DSGN++: Exploiting Visual-Spatial Relation for Stereo-based 3D Detectors

Camera-based 3D object detectors are welcome due to their wider deploymentand lower price than LiDAR sensors. We first revisit the prior stereo detectorDSGN for its stereo volume construction ways for representing both 3D geometryand semantics. We polish the stereo modeling and propose the advanced version,DSGN++, aiming to enhance effective information flow throughout the 2D-to-3Dpipeline in three main aspects. First, to effectively lift the 2D informationto stereo volume, we propose depth-wise plane sweeping (DPS) that allows denserconnections and extracts depth-guided features. Second, for graspingdifferently spaced features, we present a novel stereo volume -- Dual-viewStereo Volume (DSV) that integrates front-view and top-view features andreconstructs sub-voxel depth in the camera frustum. Third, as the foregroundregion becomes less dominant in 3D space, we propose a multi-modal data editingstrategy -- Stereo-LiDAR Copy-Paste, which ensures cross-modal alignment andimproves data efficiency. Without bells and whistles, extensive experiments invarious modality setups on the popular KITTI benchmark show that our methodconsistently outperforms other camera-based 3D detectors for all categories.Code is available at https://github.com/chenyilun95/DSGN2.