il y a 2 mois

Table des matières

Résumé

Les avancées récentes dans la génération vidéo ont permis de produire des contenus très réalistes, souvent indiscernables des vidéos réelles, ce qui pose aujourd’hui un défi sociétal croissant en matière de détection des vidéos générées par IA (AIGC). Les benchmarks précédents de détection AIGC se sont principalement concentrés sur des vidéos sans audio, ciblant des domaines narratifs larges et se limitant à une classification simple. Il reste toutefois incertain que les modèles de génération vidéo les plus performants soient capables de produire des vidéos immersives associées à un audio, capables de tromper de manière fiable à la fois les humains et les modèles linguistiques visuels (VLM). À cet effet, nous introduisons Video Reality Test, une suite de benchmarks vidéo fondée sur des sources ASMR, conçue pour évaluer la réalité perceptive dans des conditions de couplage audio-visuel serré. Cette plateforme présente les dimensions suivantes : (i) sources vidéo-audio ASMR immersives. Construites à partir de vidéos ASMR réelles soigneusement sélectionnées, cette base cible des interactions fines entre actions et objets, avec une grande diversité en termes d’objets, d’actions et de fonds. (ii) Évaluation par pairs. Protocole adversarial où les modèles de génération vidéo agissent comme créateurs cherchant à tromper des évaluateurs humains, tandis que les VLM jouent le rôle d’évaluateurs chargés d’identifier les contenus falsifiés. Nos résultats expérimentaux montrent que le meilleur modèle créateur, Veo3.1-Fast, parvient même à tromper la majorité des VLM : le meilleur évaluateur (Gemini 2.5-Pro) atteint une précision de seulement 56 % (aléatoire : 50 %), bien en dessous de celle des experts humains (81,25 %). L’ajout d’un audio améliore la discrimination entre vidéo réelle et fausse, mais des indices superficiels, tels que des filigranes, peuvent encore fortement induire en erreur les modèles. Ces résultats définissent la frontière actuelle de la réalisme dans la génération vidéo et mettent en évidence les limites des VLM en matière de fidélité perceptive et de cohérence audio-visuelle. Le code est disponible à l’adresse suivante : https://github.com/video-reality-test/video-reality-test.

One-sentence Summary

Researchers from CUHK, NUS, Video Rebirth, and University of Oxford propose Video Reality Test, an ASMR-based benchmark evaluating audio-visual realism in AI-generated videos through an adversarial creator-reviewer protocol; their tests reveal generators like Veo3.1-Fast fool VLMs (Gemini 2.5-Pro: 56% accuracy) despite human superiority (81.25%), highlighting audio's role and watermark vulnerabilities in AIGC detection.

Key Contributions

Introduces Video Reality Test, the first benchmark specifically designed to evaluate AI-generated video detection under tight audio-visual coupling using immersive ASMR content, addressing the gap where prior work ignored audio and focused on non-deceptive classification tasks without testing if videos can fool humans or VLMs. This benchmark features 149 curated real ASMR videos with diversity across objects, actions, and backgrounds, paired with AI-generated counterparts for rigorous real-vs-fake assessment.
Proposes a novel Peer-Review evaluation framework where video generation models act as adversarial "creators" attempting to deceive "reviewers" (VLMs or humans), establishing a competitive arena to jointly benchmark generation and detection capabilities through reciprocal performance metrics. This protocol reveals the dynamic interplay between creators and reviewers, highlighting which models excel at deception or detection.
Demonstrates through experiments that state-of-the-art generators like Veo3.1-Fast significantly challenge current detection systems, with the strongest VLM reviewer (Gemini 2.5-Pro) achieving only 56% accuracy—far below human experts (81.25%)—while showing that incorporating audio improves discrimination by approximately 5 points, though superficial cues like watermarks still mislead models. These results quantify the realism boundary of modern video generation and expose critical VLM limitations in perceptual fidelity.

Introduction

The rapid proliferation of realistic AI-generated videos threatens information integrity by enabling sophisticated misinformation campaigns. Prior detection methods focused narrowly on facial forgeries or spatial artifacts using benchmarks like FaceForensics++, but struggle with diffusion models that exhibit unprecedented temporal coherence and reduced visual inconsistencies. Existing approaches also fail to holistically evaluate audio-visual realism or model the adversarial dynamic between generators and detectors. The authors introduce Video Reality Test as a unified benchmark that frames detection as a competitive peer-review process between video generators ("creators") and multimodal detectors ("reviewers"), specifically using high-fidelity ASMR content to evaluate both real and synthetic videos across audio-visual dimensions. This framework uniquely addresses the limitations of prior isolated evaluations by enabling scalable assessment of generation quality and detection robustness within a single adversarial ecosystem.

Dataset

The authors use Video Reality Test, a benchmark built from 149 curated real ASMR videos sourced from YouTube (selected from 1,400 candidates with >900k views as a proxy for human immersion).
Subset details:
- Easy set: 49 videos; short (3–5 seconds), single-step actions (e.g., cutting food), minimal backgrounds (e.g., dark indoor), low object diversity.
- Hard set: 100 videos; longer (up to 20+ seconds), multi-step actions (e.g., squeezing, peeling), diverse settings (indoor/outdoor), and varied objects (solids like soap, liquids like paint).
- Fake videos are dynamically generated per experiment by K video models, yielding a total dataset size of 149 × (1 + K).
Usage in the paper:
- Real videos form the baseline; fake videos are generated on-demand using first frames or Gemini-2.5-Pro text descriptions as prompts.
- Data trains a "Peer-Review" game: VLMs ("reviewers") classify videos as real/fake, while generators ("creators") aim to minimize detection rates.
- Evaluation splits measure reviewer accuracy (target: 1 for real, 0 for fake) and creator fakeness rates.
Processing details:
- Cropping: Raw videos are split at scene transitions (using frame color histograms/Bhattacharyya distance >0.5), then backgrounds/watermarks are removed via grayscale column analysis.
- Metadata: Gemini-2.5-Pro generates immersive text descriptions from 8 evenly sampled frames, avoiding structural markers.
- Curation: Human verification ensures single-theme segments; Qwen3-embedding-4B clusters descriptions into 8 classes, with 12–13 high-quality samples per class selected to balance representation (100 total).

Experiment

Evaluated VLMs on Video Reality Test ASMR benchmark: top model Gemini-3-Pro achieved 76.27% accuracy across 7 VLMs, significantly below human performance (89.11%), validating current VLMs' limited real-vs-fake detection capability.
Tested VGMs via fooling rate: Veo3.1-Fast generated videos detected as fake only 12.54% of the time, surpassing Sora2 and open-source models, demonstrating strongest deception ability against VLMs.
Analyzed audio impact: incorporating audio improved VLM detection accuracy by ~5 points on average, as mismatched audio (e.g., Sora2's human voices in ASMR) provided discriminative cues.
Assessed watermark influence: GPT-5 and Gemini-2.5-Pro detection accuracy dropped from 95.43% to 52.4% after Sora2 watermark removal, revealing reliance on superficial shortcuts over genuine perceptual analysis.
Identified VLM bias: models classified 71% of videos as real (including 33% of fake videos), confirming systematic conservative bias toward authenticating content regardless of source.

The authors use a peer-review evaluation to assess how well video generation models fool video understanding models, with results showing Veo3.1-fast generates the most deceptive videos, as only 19 out of 96 were correctly identified as fake. Sora2 and StepVideo-t2v videos are more easily detected, with 63 and 84 false predictions respectively, indicating weaker realism. Across all models, VLMs exhibit a strong bias toward classifying videos as real, with real videos being correctly identified 88 times out of 95, while generated videos are often misclassified as real.

The authors use this table to report how often a video understanding model misclassifies real and generated ASMR videos as real, revealing a strong bias toward labeling videos as authentic. Veo3.1-fast generated videos are most frequently mistaken for real (83 out of 100), while StepVideo T2V videos are most often correctly identified as fake (84 out of 100). Across all models, real videos are overwhelmingly predicted as real (94 out of 100), confirming the model’s conservative tendency to default to “real” even when evaluating synthetic content.

The authors use a peer-review evaluation to test how well video generation models can fool video understanding models, with results showing Veo3.1-fast generates the most deceptive videos, as 89% are misclassified as real. Sora2 and Hunyuan models also produce highly realistic outputs, with detection rates below 15%, while StepVideo T2V is the least convincing, with 92% of its videos correctly identified as fake. Across all models, video understanding systems exhibit a strong bias toward labeling videos as real, even when they are AI-generated.

The authors evaluate video understanding models under different prompt and reasoning settings, finding that requiring models to think before answering improves performance for stronger models but offers little benefit for weaker ones. Results also show that preference-based evaluation (comparing real and fake pairs) is easier than single-video judgment, with models achieving higher accuracy in paired settings regardless of whether the fake video is generated from the same or a different real source.

The authors use this table to show how often different video generation models fool a video understanding model, with Veo3.1-fast generating the most deceptive videos as only 5% were correctly identified as fake. Sora2 models show extreme bias, with Sora2 T2V being misclassified as real 100% of the time, while Sora2 IT2V was detected as fake in 91% of cases, suggesting model configuration significantly affects detectability. Across all models, the VLM consistently predicts videos as real rather than fake, confirming a strong classification bias toward authenticity.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

il y a 2 mois

Vision Par Ordinateur

Tâche

Jiaqi Wang Weijia Wu Yi Zhan Rui Zhao Ming Hu James Cheng Wei Liu Philip Torr Kevin Qinghong Lin

Table des matières

Résumé

One-sentence Summary

Key Contributions

Introduces Video Reality Test, the first benchmark specifically designed to evaluate AI-generated video detection under tight audio-visual coupling using immersive ASMR content, addressing the gap where prior work ignored audio and focused on non-deceptive classification tasks without testing if videos can fool humans or VLMs. This benchmark features 149 curated real ASMR videos with diversity across objects, actions, and backgrounds, paired with AI-generated counterparts for rigorous real-vs-fake assessment.
Proposes a novel Peer-Review evaluation framework where video generation models act as adversarial "creators" attempting to deceive "reviewers" (VLMs or humans), establishing a competitive arena to jointly benchmark generation and detection capabilities through reciprocal performance metrics. This protocol reveals the dynamic interplay between creators and reviewers, highlighting which models excel at deception or detection.
Demonstrates through experiments that state-of-the-art generators like Veo3.1-Fast significantly challenge current detection systems, with the strongest VLM reviewer (Gemini 2.5-Pro) achieving only 56% accuracy—far below human experts (81.25%)—while showing that incorporating audio improves discrimination by approximately 5 points, though superficial cues like watermarks still mislead models. These results quantify the realism boundary of modern video generation and expose critical VLM limitations in perceptual fidelity.

Introduction

Dataset

The authors use Video Reality Test, a benchmark built from 149 curated real ASMR videos sourced from YouTube (selected from 1,400 candidates with >900k views as a proxy for human immersion).
Subset details:
- Easy set: 49 videos; short (3–5 seconds), single-step actions (e.g., cutting food), minimal backgrounds (e.g., dark indoor), low object diversity.
- Hard set: 100 videos; longer (up to 20+ seconds), multi-step actions (e.g., squeezing, peeling), diverse settings (indoor/outdoor), and varied objects (solids like soap, liquids like paint).
- Fake videos are dynamically generated per experiment by K video models, yielding a total dataset size of 149 × (1 + K).
Usage in the paper:
- Real videos form the baseline; fake videos are generated on-demand using first frames or Gemini-2.5-Pro text descriptions as prompts.
- Data trains a "Peer-Review" game: VLMs ("reviewers") classify videos as real/fake, while generators ("creators") aim to minimize detection rates.
- Evaluation splits measure reviewer accuracy (target: 1 for real, 0 for fake) and creator fakeness rates.
Processing details:
- Cropping: Raw videos are split at scene transitions (using frame color histograms/Bhattacharyya distance >0.5), then backgrounds/watermarks are removed via grayscale column analysis.
- Metadata: Gemini-2.5-Pro generates immersive text descriptions from 8 evenly sampled frames, avoiding structural markers.
- Curation: Human verification ensures single-theme segments; Qwen3-embedding-4B clusters descriptions into 8 classes, with 12–13 high-quality samples per class selected to balance representation (100 total).

Experiment

Evaluated VLMs on Video Reality Test ASMR benchmark: top model Gemini-3-Pro achieved 76.27% accuracy across 7 VLMs, significantly below human performance (89.11%), validating current VLMs' limited real-vs-fake detection capability.
Tested VGMs via fooling rate: Veo3.1-Fast generated videos detected as fake only 12.54% of the time, surpassing Sora2 and open-source models, demonstrating strongest deception ability against VLMs.
Analyzed audio impact: incorporating audio improved VLM detection accuracy by ~5 points on average, as mismatched audio (e.g., Sora2's human voices in ASMR) provided discriminative cues.
Assessed watermark influence: GPT-5 and Gemini-2.5-Pro detection accuracy dropped from 95.43% to 52.4% after Sora2 watermark removal, revealing reliance on superficial shortcuts over genuine perceptual analysis.
Identified VLM bias: models classified 71% of videos as real (including 33% of fake videos), confirming systematic conservative bias toward authenticating content regardless of source.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

Test de Réalité Vidéo : Les vidéos ASMR générées par l’IA peuvent-elles tromper les VLM et les humains ?

Jiaqi Wang Weijia Wu Yi Zhan Rui Zhao Ming Hu James Cheng Wei Liu Philip Torr Kevin Qinghong Lin

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Test de Réalité Vidéo : Les vidéos ASMR générées par l’IA peuvent-elles tromper les VLM et les humains ?

Jiaqi Wang Weijia Wu Yi Zhan Rui Zhao Ming Hu James Cheng Wei Liu Philip Torr Kevin Qinghong Lin

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

Test de Réalité Vidéo : Les vidéos ASMR générées par l’IA peuvent-elles tromper les VLM et les humains ?

Jiaqi Wang Weijia Wu Yi Zhan Rui Zhao Ming Hu James Cheng Wei Liu Philip Torr Kevin Qinghong Lin

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters