MTANet: Multitask-Aware Network With Hierarchical Multimodal Fusion for RGB-T Urban Scene Understanding
Understanding urban scenes is a fundamental abilityrequirement for assisted driving and autonomous vehicles. Most ofthe available urban scene understanding methods use red-greenblue (RGB) images; however, their segmentation performances areprone to degradation under adverse lighting conditions. Recently,many effective artificial neural networks have been presented forurban scene understanding and have shown that incorporatingRGB and thermal (RGB-T) images can improve segmentation accuracy even under unsatisfactory lighting conditions. However, thepotential of multimodal feature fusion has not been fully exploitedbecause operations such as simply concatenating the RGB andthermal features or averaging their maps have been adopted. Toimprove the fusion of multimodal features and the segmentationaccuracy, we propose a multitask-aware network (MTANet) withhierarchical multimodal fusion (multiscale fusion strategy) forRGB-T urban scene understanding. We developed a hierarchicalmultimodal fusion module to enhance feature fusion and built ahigh-level semantic module to extract semantic information formerging with coarse features at various abstraction levels. Using themultilevel fusion module, we exploited low-, mid-, and high-levelfusion to improve segmentation accuracy. The multitask moduleuses boundary, binary, and semantic supervision to optimize theMTANet parameters. Extensive experiments were performed ontwo benchmark RGB-T datasets to verify the improved performance of the proposed MTANet compared with state-of-the-artmethods