HyperAIHyperAI

Command Palette

Search for a command to run...

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Li Zhiqi ; Wang Wenhai ; Li Hongyang ; Xie Enze ; Sima Chonghao ; Lu Tong ; Yu Qiao ; Dai Jifeng

Abstract

3D visual perception tasks, including 3D detection and map segmentation basedon multi-camera images, are essential for autonomous driving systems. In thiswork, we present a new framework termed BEVFormer, which learns unified BEVrepresentations with spatiotemporal transformers to support multiple autonomousdriving perception tasks. In a nutshell, BEVFormer exploits both spatial andtemporal information by interacting with spatial and temporal space throughpredefined grid-shaped BEV queries. To aggregate spatial information, we designspatial cross-attention that each BEV query extracts the spatial features fromthe regions of interest across camera views. For temporal information, wepropose temporal self-attention to recurrently fuse the history BEVinformation. Our approach achieves the new state-of-the-art 56.9% in terms ofNDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher thanprevious best arts and on par with the performance of LiDAR-based baselines. Wefurther show that BEVFormer remarkably improves the accuracy of velocityestimation and recall of objects under low visibility conditions. The code isavailable at \url{https://github.com/zhiqi-li/BEVFormer}.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp