PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images

In this paper, we propose PETRv2, a unified framework for 3D perception frommulti-view images. Based on PETR, PETRv2 explores the effectiveness of temporalmodeling, which utilizes the temporal information of previous frames to boost3D object detection. More specifically, we extend the 3D position embedding (3DPE) in PETR for temporal modeling. The 3D PE achieves the temporal alignment onobject position of different frames. A feature-guided position encoder isfurther introduced to improve the data adaptability of 3D PE. To support formulti-task learning (e.g., BEV segmentation and 3D lane detection), PETRv2provides a simple yet effective solution by introducing task-specific queries,which are initialized under different spaces. PETRv2 achieves state-of-the-artperformance on 3D object detection, BEV segmentation and 3D lane detection.Detailed robustness analysis is also conducted on PETR framework. We hopePETRv2 can serve as a strong baseline for 3D perception. Code is available at\url{https://github.com/megvii-research/PETR}.