CVPR 2022 Oral. [Paper] [Github]
Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, Nong Sang
5 Apr 2020
</sub>BiSeNet V2
Introduction
Bilateral Network:
- Detail Branch: to capture the spatial details with wide channels and shallow layers
Semantic Branch: to extract the categorical semantics with narrow channels and deep layers
→ a large receptive field
→ lightweight with fewer channels and a fast-down sampling strategy
- Guided Aggregation Layer: to merge both types of features
- Booster Training Strategy with a series of auxiliary prediction heads (discarded in the inference phase)
- 72.6% mean IoU on the Cityscapes test set with the speed of 156 FPS on one NVIDIA GeForce GTX 1080Ti card
Speed-accuracy trade-off comparison on the Cityscapes test set
Related Work
Core Concepts of BiSeNetV2
1. Detail Branch
Why Wide Channels (?) and Shallow Layers (?)
- 1/8 of the original input
- Rich spatial details due to the high channel capacity.
Because of the high channel capacity and the large spatial dimension, the residual structure (He et al., 2016) will increases the memory access cost (Ma et al., 2018). Therefore, this branch mainly obeys the philosophy of VGG nets (Simonyan and Zisserman, 2015) to stack the layers.
2. Semantic Branch
Why Narrow Channels (?) and Deep Layers
- Semantic Branch can be any lightweight convolutional model
- For large receptive field and efficient computation simultaneously
- Inspired by the philosophy of the lightweight recognition model, e.g., Xception, MobileNet, ShuffleNet
Stem Block (Why?)
- (??) For efficient computation and effective feature expression ability
- Inspired by Inception V4
- Two different downsampling to shrink the feature representation
Output feature of both branches are concatenated as the output
Context Embedding Block
- For large receptive field to capture high-level semantics
- Inspired from Parsenet, Pyramid scene parsing network, Deeplab V3
- Global average pooling and residual connection to embed the global contextual information efficiently
Gather(?)-and-Expansion(?) Layer (vs Inverted Bottleneck)
- Taking advantage of the benefit of depth-wise convolution
- 3 × 3 convolution to gather local feature response and expand to higher-dimensional space (?)
- 3 × 3 depth-wise convolution performed independently over each individual output channel of the expansion layer
- 1×1 convolution as the projection layer to project the output of depth-wise convolution into a low channel capacity space.
- (?) When the stride = 2, two 3×3 depth-wise convolutions on the main path and a 3 × 3 separable convolution as the shortcut.
- Recent works (Tan et al., 2019; Howard et al., 2019) adopt 5 × 5 separable convolution heavily to enlarge the receptive field, which has fewer FLOPS than two 3 × 3 separable convolution in some conditions. In this layer, we replace the 5 × 5 depth-wise convolution in the separable convolution with two 3 × 3 depth-wise convolution, which has fewer FLOPS and the same receptive field.
- (Vs inverted bottleneck in MobileNetv2), the GE Layer has one more 3×3 convolution. However, this layer is also friendly to the computation cost and memory access cost (Ma et al., 2018; Sandler et al., 2018), because the 3 × 3 convolution is specially optimized in the CUDNN library (Chetlur et al., 2014; Ma et al., 2018). Meanwhile, because of this layer, the GE Layer has higher feature expression ability than the inverted bottleneck.
Bilateral Guided Aggregation
- There are some different manners to merge two types of feature response, i.e., element-wise summation and concatenation. However, the outputs of both branches have different levels of feature representation.
Booster Training Strategy
- Auxiliary segmentation head to different positions of the Semantic Branch.
- (Why also not in Detail Branch?)
Experimental Results
Cityscapes
- training, validation and test sets, with 2, 975, 500 and 1, 525 images
- 30 classes, 19 of which are used for semantic segmentation task.
- challenging for the real-time ← high resolution of 2, 048 × 1, 024.
Ablative Evaluation on Cityscapes
Comparison to BiSeNet V1
- Simplify the original structure to present an efficient and effective architecture for real-time semantic segmentation
- Remove the timeconsuming cross-layer connections in the original version to obtain a more clear and simpler architecture.
- Re-design the overall architecture with more compact network structures and well-designed components.
- Deepen the Detail Path to encode more details.
- Design light-weight components based on the depth-wise convolutions for the Semantic Path.
- Propose an efficient aggregation layer to enhance the mutual connections between both paths.
- Comprehensive ablative experiments to elaborate on the effectiveness and efficiency of the proposed method.
- Significantly improved the accuracy and speed of the method for a 2048×1024 input
- Achieving 72.6% Mean IoU on the Cityscapes test set with a speed of 156 FPS on one NVIDIA GeForce GTX 1080Ti card.