BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

3D visual perception tasks, including 3D detection and map segmentation basedon multi-camera images, are essential for autonomous driving systems. In thiswork, we present a new framework termed BEVFormer, which learns unified BEVrepresentations with spatiotemporal transformers to support multiple autonomousdriving perception tasks. In a nutshell, BEVFormer exploits both spatial andtemporal information by interacting with spatial and temporal space throughpredefined grid-shaped BEV queries. To aggregate spatial information, we designspatial cross-attention that each BEV query extracts the spatial features fromthe regions of interest across camera views. For temporal information, wepropose temporal self-attention to recurrently fuse the history BEVinformation. Our approach achieves the new state-of-the-art 56.9\% in terms ofNDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher thanprevious best arts and on par with the performance of LiDAR-based baselines. Wefurther show that BEVFormer remarkably improves the accuracy of velocityestimation and recall of objects under low visibility conditions. The code isavailable at \url{https://github.com/zhiqi-li/BEVFormer}.