HyperAI超神経

Visually Interpretable Subtask Reasoning for Visual Question Answering

Yu Cheng, Arushi Goel, Hakan Bilen
公開日: 5/15/2025
Visually Interpretable Subtask Reasoning for Visual Question Answering
要約

Answering complex visual questions like `Which red furniture can be used forsitting?' requires multi-step reasoning, including object recognition,attribute filtering, and relational understanding. Recent work improvesinterpretability in multimodal large language models (MLLMs) by decomposingtasks into sub-task programs, but these methods are computationally expensiveand less accurate due to poor adaptation to target data. To address this, weintroduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), asubtask-driven training framework that enhances both interpretability andreasoning by generating textual and visual explanations within MLLMs. Insteadof relying on external models, VISTAR fine-tunes MLLMs to produce structuredSubtask-of-Thought rationales (step-by-step reasoning sequences). Experimentson two benchmarks show that VISTAR consistently improves reasoning accuracywhile maintaining interpretability. Our code and dataset will be available athttps://github.com/ChengJade/VISTAR.