3 months ago

Jingqi Tong Yurong Mou Hangcheng Li Mingzhe Li Yongzhuo Yang Ming Zhang Qiguang Chen Tianyi Liang Xiaomeng Hu Yining Zheng

Abstract

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

3 months ago

Jingqi Tong Yurong Mou Hangcheng Li Mingzhe Li Yongzhuo Yang Ming Zhang Qiguang Chen Tianyi Liang Xiaomeng Hu Yining Zheng

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

3 months ago

Jingqi Tong Yurong Mou Hangcheng Li Mingzhe Li Yongzhuo Yang Ming Zhang Qiguang Chen Tianyi Liang Xiaomeng Hu Yining Zheng

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong Yurong Mou Hangcheng Li Mingzhe Li Yongzhuo Yang Ming Zhang Qiguang Chen Tianyi Liang Xiaomeng Hu Yining Zheng4 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong Yurong Mou Hangcheng Li Mingzhe Li Yongzhuo Yang Ming Zhang Qiguang Chen Tianyi Liang Xiaomeng Hu Yining Zheng4 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Jingqi Tong Yurong Mou Hangcheng Li Mingzhe Li Yongzhuo Yang Ming Zhang Qiguang Chen Tianyi Liang Xiaomeng Hu Yining Zheng4 more

Abstract

Build AI with AI

HyperAI Newsletters

Jingqi Tong Yurong Mou Hangcheng Li Mingzhe Li Yongzhuo Yang Ming Zhang Qiguang Chen Tianyi Liang Xiaomeng Hu Yining Zheng

Jingqi Tong Yurong Mou Hangcheng Li Mingzhe Li Yongzhuo Yang Ming Zhang Qiguang Chen Tianyi Liang Xiaomeng Hu Yining Zheng

Jingqi Tong Yurong Mou Hangcheng Li Mingzhe Li Yongzhuo Yang Ming Zhang Qiguang Chen Tianyi Liang Xiaomeng Hu Yining Zheng