Goal
- Achieved 1st Place in the 2024 CVPR AIS Depth Compression Challenge.
- Develop a robust super-resolution method for degraded and noisy depth maps.
- Demonstrate a highly efficient approach capable of near real-time performance.
1. Problem Definition & Dataset Analysis
• Context of the Challenge
- Low-resolution (LR) depth maps suffer from degradation and noise, making them unreliable for upsampling:
- Given a low-resolution depth map $D_{\text{LR}}$ and a corresponding RGB input $I$, the goal is to reconstruct a high-resolution depth map $D_{\text{SR}}$ such that: \(D_{\text{SR}} \approx D_{\text{GT}}\) where $D_{\text{GT}}$ is the ground truth depth map.
• Key Observations
- Resolution Degradation: $D_{\text{LR}}$ suffers from downsampling artifacts and spatial corruption: \(D_{\text{LR}} = \downarrow (D_{\text{GT}}) + \eta\) where $\downarrow$ denotes downsampling and $\eta$ is additive noise.
- Noise Impact: The added noise $\eta$ significantly disrupts depth reconstruction quality.
2. Proposed Model & Approach
• Utilizing Relative Depth (Depth Anything)
- The pre-trained ‘Depth Anything’ model extracts relative depth, used as a supplementary guide for depth super-resolution: \(D_{\text{Rel}} = f_{\text{DepthAnything}}(I)\)
- Inputs include the relative depth map, low-resolution (LR) depth map, and the input image, which are concatenated to form a multi-channel input: \(X = [D_{\text{Rel}}, D_{\text{LR}}, I]\) where $[\cdot]$ denotes concatenation.
• U-Net-like Structure with Tailored Design
Our architecture is based on a U-Net-inspired framework, retaining its characteristic encoder-decoder structure and skip connections, with significant enhancements to address the unique challenges of noisy LR depth maps.
Encoder
- The encoder processes the multi-channel input $X$ through two parallel paths:
- Relative Depth Path: \(F_{\text{Rel}} = \text{NAFNet}(D_{\text{Rel}})\)
- LR Depth Path: \(F_{\text{LR}} = \text{NAFNet}(D_{\text{LR}})\)
- Features $F_{\text{Rel}}$ and $F_{\text{LR}}$ are progressively downsampled across multiple levels using NAFNet blocks, enhancing feature extraction and representation.
Fusion Module
- Features from the relative depth and LR depth paths are fused using Adaptive Instance Normalization (AdaIN) to align their distributions: \(F_{\text{Fusion}} = \text{AdaIN}(F_{\text{Rel}}, F_{\text{LR}})\) where: \(\text{AdaIN}(F_{\text{Rel}}, F_{\text{LR}}) = \sigma_{\text{LR}} \cdot \frac{F_{\text{Rel}} - \mu_{\text{Rel}}}{\sigma_{\text{Rel}}} + \mu_{\text{LR}}\) $\mu$ and $\sigma$ denote the mean and variance of the features, respectively.
Decoder
- The decoder reconstructs the high-resolution (HR) depth map by progressively upsampling the fused features $F_{\text{Fusion}}$, while utilizing skip connections from the encoder: \(D_{\text{SR}} = \text{Decoder}(F_{\text{Fusion}})\)
• Detailed Architecture
- The encoder performs four stages of downsampling, progressively extracting finer features from the input.
- At each stage, the Fusion Module normalizes and combines features from the two encoder paths.
- The decoder restores the fused features to the target resolution via upsampling, integrating skip connections for enhanced reconstruction quality.
3. Implementation & Training
• Loss Function
The total loss function consists of a pixel-level reconstruction term and an edge preservation term: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{pixel}} + \mathcal{L}_{\text{edge}}\)
- Pixel Reconstruction Loss: \(\mathcal{L}_{\text{pixel}} = \| D_{\text{SR}} - D_{\text{GT}} \|_1\)
- Edge Preservation Loss: \(\mathcal{L}_{\text{edge}} = \| \text{Sobel}(D_{\text{SR}}) - \text{Sobel}(D_{\text{GT}}) \|_1\)
• Dataset & Pre-Training
- Pretrained Model: Initialized on the MVS-Synthetic Dataset: \(D_{\text{LR}}, D_{\text{GT}} \in [0, 1]\)
- Depth Clipping: Depth values are clipped to: \(D_{\text{GT, max}} = 300\)
- Validation Set: Last 100 samples reserved for validation.
• Training Settings
- Batch Size: 8
- Learning Rate: \(\alpha_{\text{DepthAnything}} = 2 \cdot 10^{-6}, \quad \alpha_{\text{U-Net}} = 2 \cdot 10^{-4}\)
- Epochs: 500
- Hardware: Single NVIDIA A6000 GPU (~3 days).
- Inference Speed: ~24 FPS on RTX 3090.
- Parameters: 29M
4. Results & Conclusion
• Enhanced Detail
- Achieves finer edge and detail reconstruction compared to the baseline: \(D_{\text{SR}}^{\text{Ours}} \approx D_{\text{GT}}, \quad D_{\text{SR}}^{\text{Baseline}} \ll D_{\text{GT}}\)
• Noise Robustness
- Effectively mitigates noise from $D_{\text{LR}}$, retaining high accuracy: \(\| D_{\text{SR}} - D_{\text{GT}} \| < \| D_{\text{LR}} - D_{\text{GT}} \|\)
• Real-Time Feasibility
- Operates efficiently at ~24 FPS on an RTX 3090, enabling real-time applications: \(\text{Speed}_{\text{Ours}} = 24 \, \text{FPS}\)
• Summary
By leveraging $D_{\text{Rel}}$ from the Depth Anything model and integrating a tailored U-Net architecture, our approach achieves robust super-resolution. It effectively handles noise, reconstructs fine details, and operates in real-time, making it highly suitable for practical deployment.