HyperAIHyperAI
2 months ago

MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception

Zhou, Hongyu ; Ge, Zheng ; Li, Zeming ; Zhang, Xiangyu
MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception
Abstract

This paper proposes an efficient multi-camera to Bird's-Eye-View (BEV) viewtransformation method for 3D perception, dubbed MatrixVT. Existing viewtransformers either suffer from poor transformation efficiency or rely ondevice-specific operators, hindering the broad application of BEV models. Incontrast, our method generates BEV features efficiently with only convolutionsand matrix multiplications (MatMul). Specifically, we propose describing theBEV feature as the MatMul of image feature and a sparse Feature TransportingMatrix (FTM). A Prime Extraction module is then introduced to compress thedimension of image features and reduce FTM's sparsity. Moreover, we propose theRing \& Ray Decomposition to replace the FTM with two matrices and reformulateour pipeline to reduce calculation further. Compared to existing methods,MatrixVT enjoys a faster speed and less memory footprint while remainingdeploy-friendly. Extensive experiments on the nuScenes benchmark demonstratethat our method is highly efficient but obtains results on par with the SOTAmethod in object detection and map segmentation tasks

MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception | Latest Papers | HyperAI