CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

Scene understanding based on image segmentation is a crucial component ofautonomous vehicles. Pixel-wise semantic segmentation of RGB images can beadvanced by exploiting complementary features from the supplementary modality(X-modality). However, covering a wide variety of sensors with amodality-agnostic model remains an unresolved problem due to variations insensor characteristics among different modalities. Unlike previousmodality-specific methods, in this work, we propose a unified fusion framework,CMX, for RGB-X semantic segmentation. To generalize well across differentmodalities, that often include supplements as well as uncertainties, a unifiedcross-modal interaction is crucial for modality fusion. Specifically, we designa Cross-Modal Feature Rectification Module (CM-FRM) to calibrate bi-modalfeatures by leveraging the features from one modality to rectify the featuresof the other modality. With rectified feature pairs, we deploy a Feature FusionModule (FFM) to perform sufficient exchange of long-range contexts beforemixing. To verify CMX, for the first time, we unify five modalitiescomplementary to RGB, i.e., depth, thermal, polarization, event, and LiDAR.Extensive experiments show that CMX generalizes well to diverse multi-modalfusion, achieving state-of-the-art performances on five RGB-Depth benchmarks,as well as RGB-Thermal, RGB-Polarization, and RGB-LiDAR datasets. Besides, toinvestigate the generalizability to dense-sparse data fusion, we establish anRGB-Event semantic segmentation benchmark based on the EventScape dataset, onwhich CMX sets the new state-of-the-art. The source code of CMX is publiclyavailable at https://github.com/huaaaliu/RGBX_Semantic_Segmentation.