Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes

Semantic segmentation is a key technology for autonomous vehicles tounderstand the surrounding scenes. The appealing performances of contemporarymodels usually come at the expense of heavy computations and lengthy inferencetime, which is intolerable for self-driving. Using light-weight architectures(encoder-decoder or two-pathway) or reasoning on low-resolution images, recentmethods realize very fast scene parsing, even running at more than 100 FPS on asingle 1080Ti GPU. However, there is still a significant gap in performancebetween these real-time methods and the models based on dilation backbones. Totackle this problem, we proposed a family of efficient backbones speciallydesigned for real-time semantic segmentation. The proposed deep dual-resolutionnetworks (DDRNets) are composed of two deep branches between which multiplebilateral fusions are performed. Additionally, we design a new contextualinformation extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) toenlarge effective receptive fields and fuse multi-scale context based onlow-resolution feature maps. Our method achieves a new state-of-the-arttrade-off between accuracy and speed on both Cityscapes and CamVid dataset. Inparticular, on a single 2080Ti GPU, DDRNet-23-slim yields 77.4% mIoU at 102 FPSon Cityscapes test set and 74.7% mIoU at 230 FPS on CamVid test set. Withwidely used test augmentation, our method is superior to most state-of-the-artmodels and requires much less computation. Codes and trained models areavailable online.