Command Palette
Search for a command to run...
Wenn die Vision für den Ton spricht
Wenn die Vision für den Ton spricht
Xiaofei Wen Wenjie Jacky Mo Xingyu Fu Rui Cai Tinghui Zhu Wendi Li Yanan Xie Muhao Chen Peng Qi
Zusammenfassung
Trotz der raschen Fortschritte bei multimodalen großen Sprachmodellen mit Videofähigkeit (MLLMs) stellen wir fest, dass ihr scheinbares Audioverständnis in Videos oft visuell gesteuert ist: Modelle stützen sich auf visuelle Hinweise, um akustische Informationen zu erschließen oder zu halluzinieren, anstatt den Audio-Stream zu überprüfen. Dieses Problem tritt sowohl bei modernsten Open-Source-Omni-Modellen als auch bei führenden Closed-Source-Modellen von Anbietern wie Google und OpenAI auf. Wir charakterisieren diese Fehlerart als einen audio-visuellen Clever-Hans-Effekt, bei dem Modelle (fälschlicherweise) audio-verankert erscheinen, tatsächlich jedoch visuelle und akustische Korrelationen ausnutzen, ohne zu überprüfen, ob die Audio- und Video-Streams tatsächlich übereinstimmen. Um dieses Verhalten systematisch zu untersuchen, führen wir Thud ein, einen interventionalen Probing-Rahmen, der auf drei kontrafaktischen Audio-Bearbeitungen basiert: Shift, das die zeitliche Synchronisation testet; Mute, das das Vorhandensein von Ton testet; und Swap, das die Audio-Video-Konsistenz testet. Über die Diagnose hinaus untersuchen wir zudem ein zweistufiges Alignments-Rezept: Interventions-abgeleitete Präferenzpaare lehren die Audio-Verifikation, während ereignisbezogene allgemeine Video-Präferenzen das Modell vor Über-Spezialisierung regularisieren. Unser bestes Rezept mit 10.000 Stichproben verbessert die durchschnittliche Leistung über die drei Interventionsdimensionen hinweg um 28 Prozentpunkte, während es die Leistung bei allgemeinen Video- und Audio-Video-QA-Benchmarks leicht verbessert.
One-sentence Summary
The authors propose THUD, an intervention-driven probing framework that diagnoses and mitigates the audio-visual Clever Hans effect in video MLLMs by employing Shift, Mute, and Swap counterfactual audio edits alongside a two-stage alignment recipe to enforce genuine acoustic verification, improving average performance across intervention dimensions by 28 percentage points while slightly enhancing general video and audio-visual QA benchmarks.
Key Contributions
- Introduce THUD, an intervention-driven probing framework that evaluates audio-visual grounding through three counterfactual audio edits: Shift for temporal synchronization, Mute for sound existence, and Swap for audio-visual consistency. This framework systematically diagnoses how video-capable multimodal models rely on visual-semantic shortcuts rather than verifying actual acoustic streams.
- Develop a two-stage preference alignment recipe that combines intervention-derived audio verification signals with event-level general video preferences to prevent over-specialization. This approach explicitly trains models to cross-check audio presence and alignment under broken visual-acoustic correlations.
- Demonstrate that applying this 10K-sample alignment recipe increases average performance across the three intervention dimensions by 28 percentage points while delivering modest gains on standard video and audio-visual question answering benchmarks.
Introduction
Video-capable multimodal large language models are rapidly advancing, making audio-visual understanding a critical capability for applications ranging from assistive technologies to safety-critical monitoring. Despite strong benchmark scores, the authors find that these models frequently exhibit a Clever Hans effect, relying on visual-semantic shortcuts to hallucinate acoustic information instead of genuinely verifying the audio stream. Standard evaluation protocols mask this limitation by preserving natural cross-modal correlations, leaving a significant gap in reliable audio-visual grounding. To address this challenge, the authors introduce THUD, an intervention-driven diagnostic framework that uses counterfactual audio edits to systematically audit temporal synchronization, audio existence, and cross-modal consistency. They further demonstrate that targeted preference alignment combining these counterfactual interventions with general video data substantially improves genuine audio verification while maintaining robust general video understanding.
Dataset
-
Dataset Composition and Sources: The authors construct a two-part dataset designed for audio-visual grounding evaluation and model alignment. The core intervention dataset is sourced from the Oops collection, which features in-the-wild videos of unintentional human actions and failures. For broader contextual training, the authors supplement this with general video instruction data drawn from FineVideo.
-
Subset Details and Filtering Rules: The intervention subset applies three controlled modifications to original clips: Shift (displacing audio by a temporal offset), Mute (replacing audio with silence), and Swap (replacing audio with a track from a different video). Exact sample counts are not specified, but the authors enforce strict quality filters. Visual timestamps must align within 0.8 seconds across Gemini, GPT, and Claude, while acoustic timestamps require human verification within 0.5 seconds. Clips with unclear event onsets, weak or masked audio, or swapped tracks that are trivially identical or completely unrelated are discarded. The general video subset is enriched through Gemini re-annotation and human verification, yielding four instruction types: description, localization, attribution, and audio-dependent QA.
-
Training Usage and Mixture Strategy: The paper employs a two-stage alignment pipeline. During the supervised fine-tuning warm-up phase, the model trains exclusively on the intervention-derived preference pairs to establish audio-aware response patterns. In the subsequent direct preference optimization stage, the authors mix intervention preference pairs with the general video instruction data. This mixture prevents over-specialization to counterfactual cases while encouraging the model to verify audio evidence rather than relying on visual shortcuts.
-
Processing, Cropping, and Metadata Construction: To enable precise temporal verification, the authors convert videos into temporally ordered frame units by dividing each clip into non-overlapping windows and sampling representative frames per unit. Initial annotations are generated via Gemini using a structured JSON prompt that captures event descriptions, precise timestamps, and confidence levels. These annotations are cross-verified using the frame-unit format. After validation, the three intervention operators are applied to generate final preference pairs, where the chosen response reflects accurate audio-visual grounding and the rejected response highlights visually plausible but incorrect assumptions. The resulting assets are paired with intervention metadata and reserved for diagnostic evaluation and alignment research.
Method
The authors leverage a two-stage post-training pipeline to align multimodal models beyond visual shortcuts by integrating intervention-based data with general video instruction data. The overall framework is designed to detect and correct reliance on visual-semantic correlations while preserving broad video understanding capabilities.
The pipeline begins with the construction of intervention data, which deliberately breaks natural audio-visual correlations through three types of physical interventions: Shift, Mute, and Swap. As shown in the figure below, these interventions manipulate the temporal alignment, presence, or consistency of audio and visual signals. Specifically, Shift introduces temporal displacement by shifting the audio track relative to the video, Mute removes the audio stream entirely, and Swap substitutes the original audio with a mismatched one. These interventions are applied to source videos that exhibit salient acoustic consequences, ensuring that the resulting data exposes model behavior under counterfactual conditions.
Following intervention, the data undergoes annotation and verification. Event-time labels are assigned to both visual and audio events, and preference pairs are constructed by comparing chosen and rejected responses based on cross-model verification and human review. This process generates labeled examples where the model is expected to reject visually plausible but audio-incorrect interpretations, thereby reinforcing correct audio-visual grounding.
The training process consists of two distinct stages. In Stage 1, a supervised fine-tuning (SFT) warm-up is performed on the intervention data to establish basic audio-aware patterns. This stage initializes the model to recognize and respond appropriately to audio-visual mismatches. In Stage 2, the model undergoes preference optimization using a mixture of intervention preference pairs and general video data. The intervention pairs guide the model to reject visually plausible but audio-inconsistent outputs, while the general video data acts as a regularizer to preserve broad multimodal comprehension. This dual objective ensures that the model learns to verify audio signals rather than relying on visual priors.
Experiment
The evaluation assesses audio-visual grounding under natural and counterfactual conditions targeting temporal synchronization, sound existence, and cross-modal consistency, validating that current models heavily rely on visual priors and default synchronization assumptions rather than genuine audio verification. A subsequent experiment validates whether targeted preference alignment can correct these flaws without an alignment tax, demonstrating that counterfactual temporal and general video data substantially enhance robust temporal grounding and offset localization while preserving general multimodal capabilities. Finally, extending the training to audio existence and source consistency further reduces performance collapses across all interventions, confirming that deliberate cross-modal supervision effectively mitigates shortcut reliance and fosters reliable audio-visual understanding.
The authors analyze the reliance of multimodal models on visual shortcuts in audio-visual grounding tasks, using a heatmap to show failure rates across different models and intervention conditions. Results indicate that most models exhibit strong audio hallucination and temporal misalignment, with significant performance drops under counterfactual interventions, suggesting a dependence on visual priors rather than genuine audio-visual alignment. The analysis highlights systematic biases toward synced predictions and poor temporal offset detection, especially in models that fail to verify audio presence and timing. Most models show high failure rates in audio hallucination and temporal misalignment, indicating reliance on visual shortcuts rather than true audio-visual grounding. Models consistently prefer synced predictions, leading to high false sync alarms and poor detection of audio offsets, especially under small temporal shifts. The heatmap reveals systematic biases in error patterns, with models rarely denying real audio or correctly identifying the direction of temporal mismatches.
The authors examine audio-visual grounding in multimodal models, focusing on their reliance on visual shortcuts and the effectiveness of targeted alignment training. Results show that most models exhibit strong performance collapses under counterfactual interventions, indicating a dependence on visual-semantic priors rather than genuine audio-visual alignment. A targeted alignment approach improves temporal synchronization while preserving general capabilities, with the best model achieving significantly higher accuracy across both binary and fine-grained temporal tasks. Most models show large performance drops under counterfactual interventions, suggesting reliance on visual shortcuts rather than true audio-visual grounding. Targeted alignment training improves temporal synchronization without harming general video understanding, achieving better accuracy on both binary and fine-grained temporal tasks. The best model maintains robust performance across different temporal offset magnitudes and outperforms baselines in detecting and localizing desynchronization.
The authors analyze model performance across three audio-visual intervention tasks—Mute, Swap, and Shift—using stacked bar charts to show the distribution of predictions. Results indicate that most models exhibit strong reliance on visual shortcuts, particularly in hallucinating audio presence or synchrony, with minimal correct detection of audio absence or temporal misalignment. The best-performing models show improved accuracy in identifying desynchronized audio-visual pairs, especially in the Shift task, suggesting targeted training can enhance temporal grounding without degrading general capabilities. Models frequently hallucinate audio presence or synchrony, with minimal correct detection of audio absence or temporal misalignment. Performance drops significantly under counterfactual interventions, indicating strong reliance on visual shortcuts rather than genuine audio-visual alignment. Targeted training improves temporal grounding, with models showing stronger accuracy in detecting and localizing audio-visual desynchronization.
The authors analyze the performance of several multimodal models on audio-visual grounding tasks, focusing on their reliance on visual shortcuts and their ability to detect audio-visual inconsistencies. Results show that most models exhibit large performance drops under counterfactual interventions, indicating a strong dependence on visual priors rather than genuine audio-visual alignment. Targeted alignment training improves temporal grounding without degrading general capabilities, and the best-performing models demonstrate robustness to temporal shifts and accurate localization of audio-video misalignments. Most models show significant performance drops under counterfactual interventions, suggesting reliance on visual shortcuts rather than verifying audio presence, timing, and consistency. Targeted alignment training improves temporal synchronization without harming general video understanding, indicating effective learning of cross-modal alignment. The best-performing models exhibit robustness to temporal shifts and can accurately localize the direction and magnitude of audio-video misalignments, moving beyond simple desynchronization detection.
The authors compare the performance of various models on combined accuracy tasks involving audio-visual interventions and original conditions. Results show that their model outperforms others in both Mute and Swap intervention settings, indicating improved alignment and reduced reliance on visual shortcuts. The proposed model achieves the highest combined accuracy in both Mute and Swap intervention scenarios compared to other models. Performance improvements are observed across different models, with the proposed model showing significant gains in Mute and Swap tasks. The results suggest that the model is more effective at handling audio-visual interventions without falling back on visual priors.
The evaluation assesses multimodal models' audio-visual grounding capabilities by applying counterfactual interventions such as muting, swapping, and shifting audio to validate their reliance on visual shortcuts versus genuine cross-modal alignment. Results reveal that most models heavily depend on visual priors, frequently hallucinating audio presence or synchrony and failing to detect temporal misalignments under disrupted conditions. Targeted alignment training effectively addresses this limitation by improving temporal synchronization and desynchronization localization without compromising general video understanding. Ultimately, the experiments demonstrate that explicit cross-modal training enables models to move beyond superficial visual dependencies and achieve robust audio-visual grounding.