Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

Multi-view crowd localization predicts the ground locations of all people inthe scene. Typical methods usually estimate the crowd density maps on theground plane first, and then obtain the crowd locations. However, theperformance of existing methods is limited by the ambiguity of the density mapsin crowded areas, where local peaks can be smoothed away. To mitigate theweakness of density map supervision, optimal transport-based point supervisionmethods have been proposed in the single-image crowd localization tasks, buthave not been explored for multi-view crowd localization yet. Thus, in thispaper, we propose a novel Mahalanobis distance-based multi-view optimaltransport (M-MVOT) loss specifically designed for multi-view crowdlocalization. First, we replace the Euclidean-based transport cost with theMahalanobis distance, which defines elliptical iso-contours in the costfunction whose long-axis and short-axis directions are guided by the view raydirection. Second, the object-to-camera distance in each view is used to adjustthe optimal transport cost of each location further, where the wrongpredictions far away from the camera are more heavily penalized. Finally, wepropose a strategy to consider all the input camera views in the model loss(M-MVOT) by computing the optimal transport cost for each ground-truth pointbased on its closest camera. Experiments demonstrate the advantage of theproposed method over density map-based or common Euclidean distance-basedoptimal transport loss on several multi-view crowd localization datasets.Project page: https://vcc.tech/research/2024/MVOT.