RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network

Perceiving the surrounding environment is a fundamental task in autonomousdriving. To obtain highly accurate perception results, modern autonomousdriving systems typically employ multi-modal sensors to collect comprehensiveenvironmental data. Among these, the radar-camera multi-modal perception systemis especially favored for its excellent sensing capabilities andcost-effectiveness. However, the substantial modality differences between radarand camera sensors pose challenges in fusing information. To address thisproblem, this paper presents RCBEVDet, a radar-camera fusion 3D objectdetection framework. Specifically, RCBEVDet is developed from an existingcamera-based 3D object detector, supplemented by a specially designed radarfeature extractor, RadarBEVNet, and a Cross-Attention Multi-layer Fusion (CAMF)module. Firstly, RadarBEVNet encodes sparse radar points into a densebird's-eye-view (BEV) feature using a dual-stream radar backbone and a RadarCross Section aware BEV encoder. Secondly, the CAMF module utilizes adeformable attention mechanism to align radar and camera BEV features andadopts channel and spatial fusion layers to fuse them. To further enhanceRCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMFthrough sparse fusion, supports query-based multi-view camera perceptionmodels, and adapts to a broader range of perception tasks. Extensiveexperiments on the nuScenes show that our method integrates seamlessly withexisting camera-based 3D perception models and improves their performanceacross various perception tasks. Furthermore, our method achievesstate-of-the-art radar-camera fusion results in 3D object detection, BEVsemantic segmentation, and 3D multi-object tracking tasks. Notably, with ViT-Las the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D objectdetection without test-time augmentation or model ensembling.