Points to Patches: Enabling the Use of Self-Attention for 3D Shape Recognition

While the Transformer architecture has become ubiquitous in the machinelearning field, its adaptation to 3D shape recognition is non-trivial. Due toits quadratic computational complexity, the self-attention operator quicklybecomes inefficient as the set of input points grows larger. Furthermore, wefind that the attention mechanism struggles to find useful connections betweenindividual points on a global scale. In order to alleviate these problems, wepropose a two-stage Point Transformer-in-Transformer (Point-TnT) approach whichcombines local and global attention mechanisms, enabling both individual pointsand patches of points to attend to each other effectively. Experiments on shapeclassification show that such an approach provides more useful features fordownstream tasks than the baseline Transformer, while also being morecomputationally efficient. In addition, we also extend our method to featurematching for scene reconstruction, showing that it can be used in conjunctionwith existing scene reconstruction pipelines.