8 months ago

Abstract

In this work, we present SpaRC, a novel Sparse fusion transformer for 3Dperception that integrates multi-view image semantics with Radar and Camerapoint features. The fusion of radar and camera modalities has emerged as anefficient perception paradigm for autonomous driving systems. Whileconventional approaches utilize dense Bird's Eye View (BEV)-based architecturesfor depth estimation, contemporary query-based transformers excel incamera-only detection through object-centric methodology. However, thesequery-based approaches exhibit limitations in false positive detections andlocalization precision due to implicit depth modeling. We address thesechallenges through three key contributions: (1) sparse frustum fusion (SFF) forcross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) forprecise object localization, and (3) local self-attention (LSA) for focusedquery aggregation. In contrast to existing methods requiring computationallyintensive BEV-grid rendering, SpaRC operates directly on encoded pointfeatures, yielding substantial improvements in efficiency and accuracy.Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstratethat SpaRC significantly outperforms existing dense BEV-based and sparsequery-based detectors. Our method achieves state-of-the-art performance metricsof 67.1 NDS and 63.1 AMOTA. The code and pretrained models are available athttps://github.com/phi-wol/sparc.

Source PDF