Beyond local patches: Preserving global–local interactions by enhancing self-attention via 3D point cloud tokenization
Transformer-based architectures have recently shown impressive performance on various point cloud understanding tasks such as 3D object shape classification and semantic segmentation. Particularly, this can be attributed to their self-attention mechanism, which has the ability to capture long-range dependencies. However, current methods have constrained it to operate in local patches due to its quadratic memory constraints. This hinders their generalization ability and scaling capacity due to the loss of non-locality in early layers. To tackle this issue, we propose a window-based transformer architecture that captures long-range dependencies while aggregating information in the local patches. We do this by interacting each window with a set of global point cloud tokens — a representative subset of the entire scene — and augmenting the local geometry through a 3D Histogram of Oriented Gradients (HOG) descriptor. Through a series of experiments on segmentation and classification tasks, we show that our model exceeds the state-of-the-art on S3DIS semantic segmentation (+1.67% mIoU), ShapeNetPart part segmentation (+1.03% instance mIoU) and performs competitively on ScanObjectNN 3D object classification. he code and trained models shall be made publicly available.