LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation

Recent works in autonomous driving have widely adopted the bird's-eye-view(BEV) semantic map as an intermediate representation of the world. Onlineprediction of these BEV maps involves non-trivial operations such asmulti-camera data extraction as well as fusion and projection into a commontopview grid. This is usually done with error-prone geometric operations (e.g.,homography or back-projection from monocular depth estimation) or expensivedirect dense mapping between image pixels and pixels in BEV (e.g., with MLP orattention). In this work, we present 'LaRa', an efficient encoder-decoder,transformer-based model for vehicle semantic segmentation from multiplecameras. Our approach uses a system of cross-attention to aggregate informationover multiple sensors into a compact, yet rich, collection of latentrepresentations. These latent representations, after being processed by aseries of self-attention blocks, are then reprojected with a secondcross-attention in the BEV space. We demonstrate that our model outperforms thebest previous works using transformers on nuScenes. The code and trained modelsare available at https://github.com/valeoai/LaRa