8 months ago

Abstract

Monocular 3D object detection poses a significant challenge in 3D sceneunderstanding due to its inherently ill-posed nature in monocular depthestimation. Existing methods heavily rely on supervised learning using abundant3D labels, typically obtained through expensive and labor-intensive annotationon LiDAR point clouds. To tackle this problem, we propose a novel weaklysupervised 3D object detection framework named VSRD (Volumetric SilhouetteRendering for Detection) to train 3D object detectors without any 3Dsupervision but only weak 2D supervision. VSRD consists of multi-view 3Dauto-labeling and subsequent training of monocular 3D object detectors usingthe pseudo labels generated in the auto-labeling stage. In the auto-labelingstage, we represent the surface of each instance as a signed distance field(SDF) and render its silhouette as an instance mask through our proposedinstance-aware volumetric silhouette rendering. To directly optimize the 3Dbounding boxes through rendering, we decompose the SDF of each instance intothe SDF of a cuboid and the residual distance field (RDF) that represents theresidual from the cuboid. This mechanism enables us to optimize the 3D boundingboxes in an end-to-end manner by comparing the rendered instance masks with theground truth instance masks. The optimized 3D bounding boxes serve as effectivetraining data for 3D object detection. We conduct extensive experiments on theKITTI-360 dataset, demonstrating that our method outperforms the existingweakly supervised 3D object detection methods. The code is available athttps://github.com/skmhrk1209/VSRD.

Source PDF