Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

Multiview detection incorporates multiple camera views to deal withocclusions, and its central problem is multiview aggregation. Given feature mapprojections from multiple views onto a common ground plane, thestate-of-the-art method addresses this problem via convolution, which appliesthe same calculation regardless of object locations. However, suchtranslation-invariant behaviors might not be the best choice, as objectfeatures undergo various projection distortions according to their positionsand cameras. In this paper, we propose a novel multiview detector, MVDeTr, thatadopts a newly introduced shadow transformer to aggregate multiviewinformation. Unlike convolutions, shadow transformer attends differently atdifferent positions and cameras to deal with various shadow-like distortions.We propose an effective training scheme that includes a new view-coherent dataaugmentation method, which applies random augmentations while maintainingmultiview consistency. On two multiview detection benchmarks, we report newstate-of-the-art accuracy with the proposed system. Code is available athttps://github.com/hou-yz/MVDeTr.