Self-positioning Point-based Transformer for Point Cloud Understanding

Transformers have shown superior performance on various computer vision taskswith their capabilities to capture long-range dependencies. Despite thesuccess, it is challenging to directly apply Transformers on point clouds dueto their quadratic cost in the number of points. In this paper, we present aSelf-Positioning point-based Transformer (SPoTr), which is designed to captureboth local and global shape contexts with reduced complexity. Specifically,this architecture consists of local self-attention and self-positioningpoint-based global cross-attention. The self-positioning points, adaptivelylocated based on the input shape, consider both spatial and semanticinformation with disentangled attention to improve expressive power. With theself-positioning points, we propose a novel global cross-attention mechanismfor point clouds, which improves the scalability of global self-attention byallowing the attention module to compute attention weights with only a smallset of self-positioning points. Experiments show the effectiveness of SPoTr onthree point cloud tasks such as shape classification, part segmentation, andscene segmentation. In particular, our proposed model achieves an accuracy gainof 2.6% over the previous best models on shape classification withScanObjectNN. We also provide qualitative analyses to demonstrate theinterpretability of self-positioning points. The code of SPoTr is available athttps://github.com/mlvlab/SPoTr.