0. Abstract

In the case of FCN and U-Net, there are two limitations.

  • It is not possible to know the optimal depth of the model that fits the dataset. Therefore, it requires expensive and inefficient work to find it or ensemble models of various depths.
  • Skip Connection has a limited structure in which only encoders and decoders with the same depth are connected.

To overcome these two limitations, UNet++ proposes a new type of architecture.

image-20210128000424488

  • We propose a form of learning and ensemble together using deep supervision by creating U-Nets of various depths that share encoders.
  • Create flexible feature maps by combining all feature maps at the same depth with skip connection.
  • Suggests a way to increase inference speed through pruning.

UNet++ created through the above process was applied to 6 different image datasets and the following results were obtained.

  • UNet++ shows consistent high performance for 6 datasets.
  • UNet++ shows high performance segmentation quality for objects of various sizes.
  • In the case of Mask RCNN++ with a new Skip Connection applied to Mask RCNN, it shows high performance in Instance Segmentation.
  • UNet++ applied with pruning shows fast inference speed while maintaining high performance.

1. Introduction

As mentioned in Abstact, there are two limitations in the case of existing encoder-decoder models.

  • First, the optimal depth is different for each data set. Therefore, methods in the form of finding them or combining them after learning models of various depths have been proposed. However, this approach is inefficient in that the encoders are run separately rather than shared. In particular, if you learn independently like this, there is no advantage of multi-task learning.
  • Second, the design of Skip Connections is unnecessarily restrictive. In the case of feature maps of the same size, the structure for combining the encoder and decoder is too weak.

image-20210128000424488

UNet++ proposes a dense connection type as shown in the figure above to overcome the above two limitations. There are several advantages to this configuration.

  • UNet++ shares learning at various depths in the form of sharing an encoder. Through Deep Supervision, there is no need to learn and select depth while sharing the representation of an image. In addition to improving performance, it also has the advantage of increasing inference speed through pruning.
  • UNet++ does not use restrictive linkage where only feature maps of the same size from the encoder and decoder are combined. By connecting them densely, the characteristics of various feature maps are combined with the feature maps of decore.

As a result, high performance was achieved for 6 datasets, and the main contributions of the model are summarized as follows.

  1. In UNet++, we introduced an internal ensemble of U-Nets of different depths to improve performance for objects of different sizes.
  2. We redesigned Skip Connection in UNet++ to enable flexible feature combination of decoders. This provides a significant performance improvement compared to U-Net, which combines an encoder and decoder of the same depth.
  3. We propose a UNet++ method that improves inference speed while maintaining performance through pruning.
  4. Training of U-Nets of various depths built into UNet++ shows better performance than learning individual U-Nets through collaborative learning between U-Nets.
  5. In the case of UNet++, it is possible to learn by importing various encoder backbones and demonstrates scalability and usability by showing high performance in various medical image data.

2. Proposed Network Architecture: UNET++

2.A Motivation behind the new architecture

In order to determine how the performance of the model is determined according to the various depths of U-Nets and to confirm the results when combined, two insights are obtained as a result of experiments on three datasets.

image-20210128011758051

  • Deepening the depth of U-Net does not necessarily mean good performance.
  • The optimal depth of the model is different for different datasets. (EM: L4, Cell: L3, Brain: L3)

To solve the above problem, we usually combine the results after training independent models. However, UNet++ proposes a method of learning and combining various depths in one network. And in the case of Deep Supervision, unlike the existing papers, by placing it in X0, j instead of X4-j, j, the structures of U-Net are presented in an ensemble format. made it

image-20210128014939622

This way, you can see UNet, painted in yellow, continue to expand and form an ensemble, as shown in the animated image below. In other words, it is a combination of four U-Nets.

ezgif.com-gif-maker (6)

U-Nete offers the advantage of sharing knowledge using partially the same encoder as UNet++. However, these networks have the following two disadvantages.

image-20210128020619664

  • The decoders X4-j, j are separated so that the deeper U-Nets do not signal the decoders of the lower U-Nets in the ensemble.
  • In the case of U-Nete decoders, there is a disadvantage in that they are not flexible with the object size because they unnecessarily combine only feature maps of the same size.

In order to overcome these limitations above, we tried a new form called UNet+ that removes the existing Skip Connection and makes all connections between neighboring nodes.

image-20210128000424488

UNet+ also combines feature maps between neighboring nodes to alleviate the limited skip connection, but there are still improvements. To overcome this, we use dense connections between all nodes in the same hierarchy.

2.B Technical details

image-20210128163559199

  • H : Convolution Operation
  • D : Down Sampling
  • U : Up Sampling
  • [] : Concatenation

image-20210128163815277

In the case of x0,1, the result of upsampling x0,0 and x1,0 is combined ([x0 ,0, U(x1,0)]). As we go toward the decoder, the internal input values increase, and in the case of the last decore x0,4, all inputs are received. For the case of j = 0, in the case of x1,0, the node immediately above the encoder is H(D(x0,0)) to the downsampled value. This is the result of convolution.

image-20210128164428439

In deep supervision, X0,1, X0,2, X0,3, X0,4 It is a form of output by combining 1x1 convolution + sigmoid to the node. Hybrid segmentation is performed through pixel wise cross-entropy and soft dice-coefficient loss, and the equation is as above. Hybrid loss allows you to handle gradient smoothing and class imbalance. In the above equation, the cross entropy loss is on the left and the die loss is on the right.

  • N: number of pixels in one batch
  • nth : pixel number in batch
  • C: number of classes
  • yn,c : target label
  • pn,c : prediction label

In the case of deep supervision, pruning is possible, so you can use two methods: averaging using all models and fast inference using only values below a certain depth.

3. Experiments

3.A Datasets

  • Electron Microscopy (EM)
    • 30 images (512 x 512)
    • 2 classes
    • Apply 96 x 96 patches, use a sliding window to overlap half of the patches, and apply aggregate to the overlapping parts -Cell
    • training 212 / validation 70 / test 72 images
    • 2 classes —Nuclei
    • training 335 / validation 134 / test 201 images
    • Apply 96 x 96 patch and apply 32 pixel stride using sliding window -Brain Tumor
    • 256 x 256 dataset of 30 patients —Liver
    • Training 100 / validation 15 / test 15 patient dataset
  • Lung Nodule
    • training 510 / validation 100 / test 408 images
    • Apply 64 x 64 x 64 crop

3.B Baseline and implementation

-Early stop

  • Pixel-wise sensitivity, specificity, F1 and F2 scores -NVIDIA TITAN X

image-20210128162903488

4. Results

4.A Semantic Segmentation Results

image-20210128100917728

Looking at IoU, the trend is U-Net < wide U-Net < UNet+ < UNet++. And comparing the scores of UNet++ and UNet, for the six data, neuronal structure (0.62±0.10, 0.55±0.01), cell (2.30±0.30, 2.12±0.09), nuclei (1.87±0.06, 1.71±0.06), brain tumor (2.00±0.87, 1.86±0.81), liver (2.62±0.09, 2.26±0.02), and lung nodule (5.06±1.42, 3.12±0.88) were elevated. In particular, when using Deep Supervision, you can see that the range of score increase is quite large. In particular, the increase in scores in EM and Lung Nodule is large because objects of various sizes come out.

image-20210128102040655

In addition, the result of applying the UNet++ encoder to vgg-19, resnet-152, and densenet-201 is as follows.

image-20210128102620179

The above results consistently show the performance of U-Net < UNet+ < UNet++ for all encoders. You can see how statistically significant the difference is by performing a t-test on the results from 20 repeated experiments.

4.B Instance Segmentation Results

In the case of the FPN part of Mask RCNN, as shown in the figure on the left, the result of Conv and the result of 2x UP are combined. In the case of Mask RCNN++, it makes it Dense. (It is implemented in code or not accurate because there is no picture, but in my personal opinion, I think it is configured like the picture on the right.)

image-20210128161305526

As a result, looking at the table below, you can see that the IoU of 93.28 rose to 95.10 when only the FPN of Mask R-CNN was modified.

image-20210128153454860

##5. Discussion

image-20210128171825949

  • UNet++ performs better than U-Net for a range of sizes, but for very large cases they are comparable.

image-20210128171940135

  • When using Deep Supervision or going deeper into the decoder layer, the segmentation result becomes more pronounced.

7. Conclusion

7.1 Advantages

image-20210128100720180

  • The method of connecting densely is a concept that has already appeared in Densenet, but it seems to have been well grafted and utilized.
  • By introducing wide U-Net, it is impressive to show that the reason for increasing the score of UNet++ is not because there are many parameters. This part seems to have reduced the gaps in logic.

7.2 Disadvantages

  • When I first read it, I thought that overcoming the limitations of Same Scale feature maps meant that small and large objects were also detected by combining feature maps of different sizes. However, as I read on, it seems that it means that only the encoder and decoder are connected, rather than using only the nodes of the same layer of the encoder, connecting all nodes of the same depth and using information from deeper nodes through upsampling. In fact, in order to overcome the limitations of the above method, an attempt was made to solve it by receiving both small and large feature maps in the paper UNet 3+.

image-20210128094341174

  • Looking at the number of parameters in TABLE IV, it did not increase significantly from 7.8M to 9.0M, but from a memory point of view, there is a problem of continuing to store information.

7.3 Appendix

  • https://jinglescode.github.io/2019/12/02/biomedical-image-segmentation-u-net-nested/

7.4 Unintelligible sentences

  • First, the decoders are disconnected deeper U-Nets do not offer a supervision signal to the decoders of the shallower U-Nets in the ensemble
  • there is no guarantee that the same-scale feature maps are the best match for the feature fusion. (The term “same-scale” continues to appear, but I don’t know what exactly this means, whether it’s a feature map of the same size or the same layer of the encoder and decoder)