• Goal
- Develop a user-aware, emotion-recognizing model that combines face recognition and representation learning.
- Integrate test-time adaptation for robust performance across various datasets (e.g., Koln, FERV39K, DFEW).
- Optimize the final model for NVIDIA Orin inference, including Docker environment setup, ONNX conversion, quantization, and TensorRT deployment.
- Utilize RADIOv2, a foundation model, to extract robust facial features for downstream tasks.

1.1 Feature Extraction with RADIOv2
- Foundation Model: Leveraged RADIOv2, a pre-trained vision transformer, to extract high-quality facial features: FRADIOv2=fRADIOv2(I), where I represents the input image, and FRADIOv2 is the resulting feature embedding.

1.2 Test-Time Adaptation (Face Recognition)
- Motivation: Achieve stable, adaptive face recognition under domain shifts such as lighting variations, pose changes, and occlusions.
- Approach:
- Researched test-time adaptation methods to fine-tune model parameters during inference: Wnew=Wold+η⋅∇LTTA, where Wnew denotes the updated weights, Wold is the original weight, η is the adaptation step size, and LTTA is the test-time adaptation loss function.

2. Face Recognition + Representation Learning
2.1 Pipeline Design
The pipeline combines RADIOv2, ArcFace, and SimCLR to achieve robust identity verification and emotion recognition.
- Feature Extraction
- Used RADIOv2 to extract foundational facial embeddings:
FBase=fRADIOv2(I), where I is the input image, and FBase is the extracted feature embedding.
- Used RADIOv2 to extract foundational facial embeddings:
- Representation Learning
- Applied SimCLR for contrastive learning: LSimCLR=−logexp(sim(zi,zj)/τ)∑Nk=1exp(sim(zi,zk)/τ), where zi and zj are projections of FBase, sim(⋅) represents cosine similarity, τ is the temperature parameter, and N is the total number of samples.
- Classification (ArcFace)
- Integrated an ArcFace head for face recognition: LArcFace=−logexp(s⋅(cos(θi+m)))∑Cj=1exp(s⋅cos(θj)), where s is the scale factor, θi represents the angle between features and weights for class i, m is the margin penalty, and C is the number of classes.
- Emotion Recognition Branch
- Added an optional emotion recognition head trained on FBase for classification tasks: LEmotion=−∑Cc=1yclogˆyc, where yc is the ground truth for class c, and ˆyc is the predicted probability for class c.
3. Training & Optimization
3.1 Multi-Teacher Distillation with Loss Formulation
To improve model generalization, the framework utilizes multi-teacher distillation, adapting methods inspired by AM-RADIO:
- Summary Feature Loss
- The student model matches the summary feature vectors of teachers: LSummary(x)=∑iλi⋅LCos(y(s)i,z(t)i), where y(s)i is the student’s summary feature, z(t)i is the teacher’s summary feature, λi is the weight for teacher i, and LCos is the cosine similarity loss.
- Spatial Feature Loss
- Spatial features of the student are matched to those of the teacher: LSpatial(x)=∑iγi⋅(αLCos(y(s)i,z(t)i)+βLSmooth-L1(y(s)i,z(t)i)), where α and β control the weighting of cosine similarity and smooth L1 loss.
- Combined Loss
- The total loss for distillation is: LTotal=LSummary+LSpatial.
3.2 Deployment Optimization
ONNX Conversion
The PyTorch models were converted to ONNX format for hardware-agnostic optimization: ModelONNX=Export(ModelPyTorch), where ModelPyTorch is the original model.
Quantization
Model precision was reduced to INT8 or FP16 to improve latency and reduce memory usage: Q(x)=round(x⋅2n)2n, where n determines the bit-width of quantization, and x is the original model parameter.
TensorRT Deployment
The TensorRT-optimized model was deployed on NVIDIA Orin, achieving real-time inference with high throughput.
4. NVIDIA Orin Inference & Performance
- Batch Size: 1
- Input Resolution: 224×224
- Approximate Latency: 15−20 ms per frame (∼50−65 FPS).
- Pipeline Integration: Integrated with DeepStream for multi-camera video streaming and real-time analysis.
5. Key Challenges & Solutions
Test-Time Adaptation
- Challenge: Adapting to domain shifts such as lighting changes and occlusions during inference.
- Solution: Implemented test-time loss minimization to dynamically update weights: LTTA=‖FBase−FAdapted‖2, where FBase represents the original feature embedding, and FAdapted is the adapted feature embedding.
Representation Learning
- Challenge: Balancing supervised learning (ArcFace) and unsupervised learning (SimCLR).
- Solution: Introduced a weighted multi-task loss to alternate between classification and contrastive learning.
Multi-Teacher Distillation
- Challenge: Combining features from heterogeneous teacher models.
- Solution: Implemented loss balancing with cosine similarity and smooth L1 for effective spatial feature learning: LSpatial=αLCos+βLSmooth-L1, where α=0.9 and β=0.1 prioritize cosine similarity.



6. Results & Conclusion
Face Recognition
- Achieved state-of-the-art accuracy with ArcFace and RADIOv2 on WebFace and Celeb-1M datasets.
Emotion Recognition
- Demonstrated robust performance on FERV39K and DFEW datasets, showcasing strong generalization across different domains.
Real-Time Feasibility
- Achieved 50-65 FPS inference speed on NVIDIA Orin, enabling real-time emotion detection and face recognition.
Overall
This project highlights the successful integration of RADIOv2, multi-task learning, and hardware optimization. It delivers a robust, real-time solution for face recognition and emotion analysis that performs reliably across diverse conditions and datasets.