MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors

Perspective projection has been extensively utilized in monocular 3D objectdetection methods. It introduces geometric priors from 2D bounding boxes and 3Dobject dimensions to reduce the uncertainty of depth estimation. However, dueto depth errors originating from the object's visual surface, the height of thebounding box often fails to represent the actual projected central height,which undermines the effectiveness of geometric depth. Direct prediction forthe projected height unavoidably results in a loss of 2D priors, whilemulti-depth prediction with complex branches does not fully leverage geometricdepth. This paper presents a Transformer-based monocular 3D object detectionmethod called MonoDGP, which adopts perspective-invariant geometry errors tomodify the projection formula. We also try to systematically discuss andexplain the mechanisms and efficacy behind geometry errors, which serve as asimple but effective alternative to multi-depth prediction. Additionally,MonoDGP decouples the depth-guided decoder and constructs a 2D decoder onlydependent on visual features, providing 2D priors and initializing objectqueries without the disturbance of 3D detection. To further optimize andfine-tune input tokens of the transformer decoder, we also introduce a RegionSegment Head (RSH) that generates enhanced features and segment embeddings. Ourmonocular method demonstrates state-of-the-art performance on the KITTIbenchmark without extra data. Code is available athttps://github.com/PuFanqi23/MonoDGP.