4 months ago

Abstract

Video DiTs have advanced video generation, yet they still struggle to modelmulti-instance or subject-object interactions. This raises a key question: Howdo these models internally represent interactions? To answer this, we curateMATRIX-11K, a video dataset with interaction-aware captions and multi-instancemask tracks. Using this dataset, we conduct a systematic analysis thatformalizes two perspectives of video DiTs: semantic grounding, viavideo-to-text attention, which evaluates whether noun and verb tokens captureinstances and their relations; and semantic propagation, via video-to-videoattention, which assesses whether instance bindings persist across frames. Wefind both effects concentrate in a small subset of interaction-dominant layers.Motivated by this, we introduce MATRIX, a simple and effective regularizationthat aligns attention in specific layers of video DiTs with multi-instance masktracks from the MATRIX-11K dataset, enhancing both grounding and propagation.We further propose InterGenEval, an evaluation protocol for interaction-awarevideo generation. In experiments, MATRIX improves both interaction fidelity andsemantic alignment while reducing drift and hallucination. Extensive ablationsvalidate our design choices. Codes and weights will be released.

Source PDF View Code