HyperAI
8 days ago

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Abstract

Humans often use visual aids, for example diagrams or sketches, when solvingcomplex problems. Training multimodal models to do the same, known as VisualChain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelfvisual CoT performance, which hinders reinforcement learning, and (2) the lackof high-quality visual CoT training data. We introduce Zebra-CoT, adiverse large-scale dataset with 182,384 samples, containing logically coherentinterleaved text-image reasoning traces. We focus on four categories of taskswhere sketching or visual reasoning is especially natural, spanning scientificquestions such as geometry, physics, and algorithms; 2D visual reasoning taskslike visual search and jigsaw puzzles; 3D reasoning tasks including 3Dmulti-hop inference, embodied and robot planning; visual logic problems andstrategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoTtraining corpus results in an improvement of +12% in our test-set accuracy andyields up to +13% performance gain on standard VLM benchmark evaluations.Fine-tuning Bagel-7B yields a model that generates high-quality interleavedvisual reasoning chains, underscoring Zebra-CoT's effectiveness for developingmultimodal reasoning abilities. We open-source our dataset and models tosupport development and evaluation of visual CoT.