Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

Large multimodal models (LMMs) have shown impressive capabilities in a widerange of visual tasks. However, they often struggle with fine-grained visualreasoning, failing to identify domain-specific objectives and providejustifiable explanations for their predictions. To address this, we propose anovel visual rejection sampling framework to improve the cognition andexplainability of LMMs using self-synthesized data. Specifically, visualfine-tuning requires images, queries, and target answers. Our approach beginsby synthesizing interpretable answers that include human-verifiable visualfeatures. These features are based on expert-defined concepts, carefullyselected based on their alignment with the image content. After each round offine-tuning, we apply a reward model-free filtering mechanism to select thehighest-quality interpretable answers for the next round of tuning. Thisiterative process of data synthesis and fine-tuning progressively improves themodel's ability to generate accurate and reasonable explanations. Experimentalresults demonstrate the effectiveness of our method in improving both theaccuracy and explainability of specialized visual classification tasks.