2 months ago

3D-JEPA: A Joint Embedding Predictive Architecture for 3D Self-Supervised Representation Learning

Hu, Naiwen ; Cheng, Haozhe ; Xie, Yifan ; Li, Shiqi ; Zhu, Jihua

Abstract

Invariance-based and generative methods have shown a conspicuous performancefor 3D self-supervised representation learning (SSRL). However, the formerrelies on hand-crafted data augmentations that introduce bias not universallyapplicable to all downstream tasks, and the latter indiscriminatelyreconstructs masked regions, resulting in irrelevant details being saved in therepresentation space. To solve the problem above, we introduce 3D-JEPA, a novelnon-generative 3D SSRL framework. Specifically, we propose a multi-blocksampling strategy that produces a sufficiently informative context block andseveral representative target blocks. We present the context-aware decoder toenhance the reconstruction of the target blocks. Concretely, the contextinformation is fed to the decoder continuously, facilitating the encoder inlearning semantic modeling rather than memorizing the context informationrelated to target blocks. Overall, 3D-JEPA predicts the representation oftarget blocks from a context block using the encoder and context-aware decoderarchitecture. Various downstream tasks on different datasets demonstrate3D-JEPA's effectiveness and efficiency, achieving higher accuracy with fewerpretraining epochs, e.g., 88.65% accuracy on PB_T50_RS with 150 pretrainingepochs.