Project 1

• Goal

- Build a model that can perform stable computer vision tasks in an environment with limited resources

- Refactoring of existing lagacy code

- Instance segmentation possible model study using segmentation dataset

1. Object Detection: Network Design and Optimization

- Nvidia Xavier resource
- CPU: 8-core ARM v8.2 64-bit CPU (Custom NVIDIA Carmel)
- GPU: Volta Architecture with 512 NVIDIA CUDA cores and 64 Tensor Cores
- RAM: 16 GB 256-bit LPDDR4x memory
- USB: 4 USB 3.1 Gen 1 ports, 1 USB 3.1 Gen 2 port, 1 USB-C port
- Camera: 2 MIPI CSI-2 D-PHY lanes, up to 16 simultaneous cameras

• YOLOv5 Model Performance Improvement and Quantization

- Conducted quantization for YOLOv5-based model performance improvement and lightweight design.
- The new developments in YOLOv5 improved model accuracy and speed on GPUs, but added complexity for CPU deployments.
- Compound scaling resulted in small, memory-bound networks such as YOLOv5s and larger, compute-bound networks such as YOLOv5l.
- Post-processing and Focus blocks slowed down YOLOv5l, especially at larger input sizes.
- Deployment performance between GPUs and CPUs was significantly different.
  - At batch size 1 and 640x640 input size, there was a more than 7x gap in performance between a T4 FP16 GPU instance on AWS running PyTorch and a 24-core C5 CPU instance on AWS running ONNX Runtime.

• Class Imbalanced Problem Solution

- Solved class imbalance problem by reflecting the numerical measurement of data overlap (effective number) in the loss function.

2. Segmentation Model: Network Design and Optimization

• Model Exploration and Adaptation

- Utilized various models such as DDRNet, DeepLab V3+, and ESPNet.
- Optimized models for autonomous driving situations through lightweight design and structural optimization.

• Model Comparison

Table 1: Performance comparison of several semantic segmentation models on the Cityscapes dataset.

Model	Model Size	Inference Speed (FPS)	Accuracy (Cityscapes dataset)
LiteSeg	1.2 MB	88.2	70.6 mIoU
EfficientPS	4.3 MB	23.5	72.3 mIoU
FastDepthSeg	1.9 MB	140.8	70.1 mIoU
DDRNet	7.5 MB	82.7	78.2 mIoU
DeepLab V3+	8.4 MB	37.6	77.7 mIoU

3. Model Quantization and Comparison

- Applied static quantization, dynamic quantization, and quantization-aware training.
- Compared the performance of different models.

• Quantization Results

Table 2. Performance comparison of different models under different quantization methods.

Model	Quantization-aware Training	Post Static Quantization	Post Dynamic Quantization
Metrics	mIoU / Model Size / FP S	mIoU / Model Size / FPS	mIoU / Model Size / FPS
LiteSeg	65.1 / 0.6 MB / 172.4	62.8 / 0.5 MB / 194.5	64.0 / 0.6 MB / 178.7
EfficientPS	68.7 / 2.15 MB / 45.9	65.3 / 1.7 MB / 57.1	67.0 / 2.15 MB / 51.3
FastDepthSeg	67.0 / 0.95 MB / 274.6	64.8 / 0.8 MB / 294.8	65.8 / 0.95 MB / 281.3
DDRNet	73.4 / 3.75 MB / 160.2	71.0 / 3.2 MB / 185.3	72.4 / 3.75 MB / 171.4
DeepLab V3+	69.7 / 4.2 MB / 72.9	67.5 / 3.5 MB / 87.4	68.8 / 4.2 MB / 79.6

Table 3: Results of quantization-aware training and post-training quantization on various models. Each cell shows mIoU / Model Size Decrease (%) / RAM Usage Decrease (%).

Metrics	mIoU / Model Size Decrease (%) / RAM Usage Decrease (%)	mIoU / Model Size Decrease (%) / RAM Usage Decrease (%)	mIoU / Model Size Decrease (%) / RAM Usage Decrease (%)
LiteSeg	-4.3 / 16.7 / 0	-2.6 / 16.7 / -7	-1.8 / 0 / -5
EfficientPS	-4.8 / 15.5 / -17	-4.8 / 20.9 / -7	-2.6 / 15.5 / -12
FastDepthSeg	-2.2 / 2.3 / 42	-3.3 / 10.5 / 14	-1.9 / 2.3 / 23
DDRNet	-3.3 / 9.7 / 44	-3.3 / 13.3 / 22	-1.8 / 9.7 / 36
DeepLab V3+	-2.7 / 19 / 56	-2.9 / 16.7 / 38	-1.6 / 19 / 48

• ONNX Conversion and Optimization

- Performed ONNX conversion and optimization for better deployment

• TensorRT Optimization for Nvidia Xavier Environment

- Optimized models using TensorRT for inference on Nvidia Xavier devices
- Achieved significant improvement in inference speed while maintaining acceptable accuracy levels

4. Legacy Code Refactoring

• GStreamer Pipeline Optimization

- Optimized GStreamer-based pipeline

• Decoding Improvement

- Replaced SW decoding with HW decoding to reduce CPU overhead

• Sensor Input and Modularization

- Directed various sensor inputs to the HW decoding module
- Modularized GStreamer and improved the deep learning model loading process using plugin loader

5. Weakly Supervised Instance Segmentation Model Development

- Developed weakly supervised instance segmentation model using semantic segmentation data

#Autonomy driving #Object detection #Segmentation #Model Compression #Optimization

Portfolio