7 months ago

Abstract

This work explores enabling Chain-of-Thought (CoT) reasoning to link visualcues across multiple images. A straightforward solution is to adapt rule-basedreinforcement learning for Vision-Language Models (VLMs). However, such methodstypically rely on manually curated question-answer pairs, which can beparticularly challenging when dealing with fine grained visual details andcomplex logic across images. Inspired by self-supervised visual representationlearning, we observe that images contain inherent constraints that can serve assupervision. Based on this insight, we construct image triplets comprising twoaugmented views of the same image and a third, similar but distinct image.During training, the model is prompted to generate a reasoning process tocompare these images (i.e., determine same or different). Then we optimize themodel with rule-based reinforcement learning. Due to the high visual similarityand the presence of augmentations, the model must attend to subtle visualchanges and perform logical reasoning to succeed. Experiments show that,although trained solely on visual comparison tasks, the learned reasoningability generalizes effectively to a wide range of questions. Without relyingon any human-annotated question-answer pairs, our method achieves significantimprovements on multi-image reasoning benchmarks and shows strong performanceon general vision tasks.

Source PDF