One Homography is All You Need: IMM-based Joint Homography and Multiple Object State Estimation

A novel online MOT algorithm, IMM Joint Homography State Estimation(IMM-JHSE), is proposed. IMM-JHSE uses an initial homography estimate as theonly additional 3D information, whereas other 3D MOT methods use regular 3Dmeasurements. By jointly modelling the homography matrix and its dynamics aspart of track state vectors, IMM-JHSE removes the explicit influence of cameramotion compensation techniques on predicted track position states, which wasprevalent in previous approaches. Expanding upon this, static and dynamiccamera motion models are combined using an IMM filter. A simple bounding boxmotion model is used to predict bounding box positions to incorporate imageplane information. In addition to applying an IMM to camera motion, anon-standard IMM approach is applied where bounding-box-based BIoU scores aremixed with ground-plane-based Mahalanobis distances in an IMM-like fashion toperform association only, making IMM-JHSE robust to motion away from the groundplane. Finally, IMM-JHSE makes use of dynamic process and measurement noiseestimation techniques. IMM-JHSE improves upon related techniques, includingUCMCTrack, OC-SORT, C-BIoU and ByteTrack on the DanceTrack and KITTI-cardatasets, increasing HOTA by 2.64 and 2.11, respectively, while offeringcompetitive performance on the MOT17, MOT20 and KITTI-pedestrian datasets.Using publicly available detections, IMM-JHSE outperforms almost all other 2DMOT methods and is outperformed only by 3D MOT methods -- some of which areoffline -- on the KITTI-car dataset. Compared to tracking-by-attention methods,IMM-JHSE shows remarkably similar performance on the DanceTrack dataset andoutperforms them on the MOT17 dataset. The code is publicly available:https://github.com/Paulkie99/imm-jhse.