2 months ago

SAM4D: Segment Anything in Camera and LiDAR Streams

Jianyun Xu, Song Wang, Ziqian Ni, Chunyong Hu, Sheng Yang, Jianke Zhu, Qiang Li

Abstract

We present SAM4D, a multi-modal and temporal foundation model designed forpromptable segmentation across camera and LiDAR streams. Unified Multi-modalPositional Encoding (UMPE) is introduced to align camera and LiDAR features ina shared 3D space, enabling seamless cross-modal prompting and interaction.Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA),which leverages ego-motion compensation to enhance temporal consistency andlong-horizon feature retrieval, ensuring robust segmentation across dynamicallychanging autonomous driving scenes. To avoid annotation bottlenecks, we developa multi-modal automated data engine that synergizes VFM-driven video masklets,spatiotemporal 4D reconstruction, and cross-modal masklet fusion. Thisframework generates camera-LiDAR aligned pseudo-labels at a speed orders ofmagnitude faster than human annotation while preserving VFM-derived semanticfidelity in point cloud representations. We conduct extensive experiments onthe constructed Waymo-4DSeg, which demonstrate the powerful cross-modalsegmentation ability and great potential in data annotation of proposed SAM4D.