Cross-view Transformers for real-time Map-view Semantic Segmentation

We present cross-view transformers, an efficient attention-based model formap-view semantic segmentation from multiple cameras. Our architectureimplicitly learns a mapping from individual camera views into a canonicalmap-view representation using a camera-aware cross-view attention mechanism.Each camera uses positional embeddings that depend on its intrinsic andextrinsic calibration. These embeddings allow a transformer to learn themapping across different views without ever explicitly modeling itgeometrically. The architecture consists of a convolutional image encoder foreach view and cross-view transformer layers to infer a map-view semanticsegmentation. Our model is simple, easily parallelizable, and runs inreal-time. The presented architecture performs at state-of-the-art on thenuScenes dataset, with 4x faster inference speeds. Code is available athttps://github.com/bradyz/cross_view_transformers.