MMFusion: Combining Image Forensic Filters for Visual Manipulation Detection and Localization

Recent image manipulation localization and detection techniques typicallyleverage forensic artifacts and traces that are produced by a noise-sensitivefilter, such as SRM or Bayar convolution. In this paper, we showcase thatdifferent filters commonly used in such approaches excel at unveiling differenttypes of manipulations and provide complementary forensic traces. Thus, weexplore ways of combining the outputs of such filters to leverage thecomplementary nature of the produced artifacts for performing imagemanipulation localization and detection (IMLD). We assess two distinctcombination methods: one that produces independent features from each forensicfilter and then fuses them (this is referred to as late fusion) and one thatperforms early mixing of different modal outputs and produces combined features(this is referred to as early fusion). We use the latter as a feature encodingmechanism, accompanied by a new decoding mechanism that encompasses featurere-weighting, for formulating the proposed MMFusion architecture. Wedemonstrate that MMFusion achieves competitive performance for both imagemanipulation localization and detection, outperforming state-of-the-art modelsacross several image and video datasets. We also investigate further thecontribution of each forensic filter within MMFusion for addressing differenttypes of manipulations, building on recent AI explainability measures.