Unified Object Detector for Different Modalities based on Vision Transformers

Traditional systems typically require different models for processingdifferent modalities, such as one model for RGB images and another for depthimages. Recent research has demonstrated that a single model for one modalitycan be adapted for another using cross-modality transfer learning. In thispaper, we extend this approach by combining cross/inter-modality transferlearning with a vision transformer to develop a unified detector that achievessuperior performance across diverse modalities. Our research envisions anapplication scenario for robotics, where the unified system seamlessly switchesbetween RGB cameras and depth sensors in varying lighting conditions.Importantly, the system requires no model architecture or weight updates toenable this smooth transition. Specifically, the system uses the depth sensorduring low-lighting conditions (night time) and both the RGB camera and depthsensor or RGB caemra only in well-lit environments. We evaluate our unifiedmodel on the SUN RGB-D dataset, and demonstrate that it achieves similar orbetter performance in terms of mAP50 compared to state-of-the-art methods inthe SUNRGBD16 category, and comparable performance in point cloud only mode. Wealso introduce a novel inter-modality mixing method that enables our model toachieve significantly better results than previous methods. We provide ourcode, including training/inference logs and model checkpoints, to facilitatereproducibility and further research.\url{https://github.com/liketheflower/UODDM}