CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception

Autonomous driving requires an accurate and fast 3D perception system thatincludes 3D object detection, tracking, and segmentation. Although recentlow-cost camera-based approaches have shown promising results, they aresusceptible to poor illumination or bad weather conditions and have a largelocalization error. Hence, fusing camera with low-cost radar, which providesprecise long-range measurement and operates reliably in all environments, ispromising but has not yet been thoroughly investigated. In this paper, wepropose Camera Radar Net (CRN), a novel camera-radar fusion framework thatgenerates a semantically rich and spatially accurate bird's-eye-view (BEV)feature map for various tasks. To overcome the lack of spatial information inan image, we transform perspective view image features to BEV with the help ofsparse but accurate radar points. We further aggregate image and radar featuremaps in BEV using multi-modal deformable attention designed to tackle thespatial misalignment between inputs. CRN with real-time setting operates at 20FPS while achieving comparable performance to LiDAR detectors on nuScenes, andeven outperforms at a far distance on 100m setting. Moreover, CRN with offlinesetting yields 62.4% NDS, 57.5% mAP on nuScenes test set and ranks first amongall camera and camera-radar 3D object detectors.