MVT: Multi-view Vision Transformer for 3D Object Recognition

Inspired by the great success achieved by CNN in image recognition,view-based methods applied CNNs to model the projected views for 3D objectunderstanding and achieved excellent performance. Nevertheless, multi-view CNNmodels cannot model the communications between patches from different views,limiting its effectiveness in 3D object recognition. Inspired by the recentsuccess gained by vision Transformer in image recognition, we propose aMulti-view Vision Transformer (MVT) for 3D object recognition. Since each patchfeature in a Transformer block has a global reception field, it naturallyachieves communications between patches from different views. Meanwhile, ittakes much less inductive bias compared with its CNN counterparts. Consideringboth effectiveness and efficiency, we develop a global-local structure for ourMVT. Our experiments on two public benchmarks, ModelNet40 and ModelNet10,demonstrate the competitive performance of our MVT.