HyperAI

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang
Veröffentlichungsdatum: 6/9/2025
MORSE-500: A Programmatically Controllable Video Benchmark to
  Stress-Test Multimodal Reasoning
Abstract

Despite rapid advances in vision-language models (VLMs), current benchmarksfor multimodal reasoning fall short in three key dimensions. First, theyoverwhelmingly rely on static images, failing to capture the temporalcomplexity of real-world environments. Second, they narrowly focus onmathematical problem-solving, neglecting the broader spectrum of reasoningskills -- including abstract, physical, planning, spatial, and temporalcapabilities -- required for robust multimodal intelligence. Third, manybenchmarks quickly saturate, offering limited headroom for diagnosing failuremodes or measuring continued progress. We introduce MORSE-500 (MultimodalReasoning Stress-test Environment), a video benchmark composed of 500 fullyscripted clips with embedded questions spanning six complementary reasoningcategories. Each instance is programmatically generated using deterministicPython scripts (via Manim, Matplotlib, MoviePy), generative video models, andcurated real footage. This script-driven design allows fine-grained controlover visual complexity, distractor density, and temporal dynamics -- enablingdifficulty to be scaled systematically as models improve. Unlike staticbenchmarks that become obsolete once saturated, MORSE-500 is built to evolve:its controllable generation pipeline supports the creation of arbitrarilychallenging new instances, making it ideally suited for stress-testingnext-generation models. Initial experiments with state-of-the-art systems --including various Gemini 2.5 Pro and OpenAI o3 which represent the strongestavailable at the time, alongside strong open-source models -- revealsubstantial performance gaps across all categories, with particularly largedeficits in abstract and planning tasks. We release the full dataset,generation scripts, and evaluation harness to support transparent,reproducible, and forward-looking multimodal reasoning research.