MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer

Monocular 3D object detection is an important yet challenging task inautonomous driving. Some existing methods leverage depth information from anoff-the-shelf depth estimator to assist 3D detection, but suffer from theadditional computational burden and achieve limited performance caused byinaccurate depth priors. To alleviate this, we propose MonoDTR, a novelend-to-end depth-aware transformer network for monocular 3D object detection.It mainly consists of two components: (1) the Depth-Aware Feature Enhancement(DFE) module that implicitly learns depth-aware features with auxiliarysupervision without requiring extra computation, and (2) the Depth-AwareTransformer (DTR) module that globally integrates context- and depth-awarefeatures. Moreover, different from conventional pixel-wise positionalencodings, we introduce a novel depth positional encoding (DPE) to inject depthpositional hints into transformers. Our proposed depth-aware modules can beeasily plugged into existing image-only monocular 3D object detectors toimprove the performance. Extensive experiments on the KITTI dataset demonstratethat our approach outperforms previous state-of-the-art monocular-based methodsand achieves real-time detection. Code is available athttps://github.com/kuanchihhuang/MonoDTR