MCBLT: Multi-Camera Multi-Object 3D Tracking in Long Videos

Object perception from multi-view cameras is crucial for intelligent systems,particularly in indoor environments, e.g., warehouses, retail stores, andhospitals. Most traditional multi-target multi-camera (MTMC) detection andtracking methods rely on 2D object detection, single-view multi-object tracking(MOT), and cross-view re-identification (ReID) techniques, without properlyhandling important 3D information by multi-view image aggregation. In thispaper, we propose a 3D object detection and tracking framework, named MCBLT,which first aggregates multi-view images with necessary camera calibrationparameters to obtain 3D object detections in bird's-eye view (BEV). Then, weintroduce hierarchical graph neural networks (GNNs) to track these 3Ddetections in BEV for MTMC tracking results. Unlike existing methods, MCBLT hasimpressive generalizability across different scenes and diverse camerasettings, with exceptional capability for long-term association handling. As aresult, our proposed MCBLT establishes a new state-of-the-art on the AICity'24dataset with $81.22$ HOTA, and on the WildTrack dataset with $95.6$ IDF1.