MiPa: Mixed Patch Infrared-Visible Modality Agnostic Object Detection

In real-world scenarios, using multiple modalities like visible (RGB) andinfrared (IR) can greatly improve the performance of a predictive task such asobject detection (OD). Multimodal learning is a common way to leverage thesemodalities, where multiple modality-specific encoders and a fusion module areused to improve performance. In this paper, we tackle a different way to employRGB and IR modalities, where only one modality or the other is observed by asingle shared vision encoder. This realistic setting requires a lower memoryfootprint and is more suitable for applications such as autonomous driving andsurveillance, which commonly rely on RGB and IR data. However, when learning asingle encoder on multiple modalities, one modality can dominate the other,producing uneven recognition results. This work investigates how to efficientlyleverage RGB and IR modalities to train a common transformer-based OD visionencoder, while countering the effects of modality imbalance. For this, weintroduce a novel training technique to Mix Patches (MiPa) from the twomodalities, in conjunction with a patch-wise modality agnostic module, forlearning a common representation of both modalities. Our experiments show thatMiPa can learn a representation to reach competitive results on traditionalRGB/IR benchmarks while only requiring a single modality during inference. Ourcode is available at: https://github.com/heitorrapela/MiPa.