2달 전

Yujiao Shen Shulin Tian Jingkang Yang Ziwei Liu

초록

최근 스트리밍 비디오 이해 방법들은 긴 비디오 스트림을 처리하기 위해 점점 더 복잡한 메모리 메커니즘에 의존하고 있습니다. 본 논문은 이러한 추세에 도전하여 다음과 같은 간단한 사실을 제시합니다: 최신 N 프레임만을 오프더셸프(off-the-shelf) VLM에 입력하는 슬라이딩 윈도우 기반라인이 이미 기존에 발표된 스트리밍 모델과 동등하거나 더 우수한 성능을 달성합니다. 우리는 이 기반라인을 'SimpleStream'으로 정식화하여 OVO-Bench 및 StreamingBench에서 13 개의 주요 오프라인 및 온라인 비디오 LLM 기반라인과 비교 평가했습니다. 단순함에도 불구하고 SimpleStream 은 일관되게 강력한 성능을 발휘합니다. 단 4 개의 최신 프레임만으로도 OVO-Bench 에서 평균 정확도 67.7%, StreamingBench 에서 80.59% 를 기록합니다. 통제된 실험적 분석 (ablation study) 을 통해 긴 컨텍스트의 유용성이 모델 규모에 비례하여 균일하게 증가하는 것이 아니라 백본 아키텍처에 의존함을 확인했으며, 일관된 '지각 - 메모리 트레이드오프'를 발견했습니다. 즉, 더 많은 역사적 컨텍스트를 추가하면 재현율 (recall) 은 향상되지만 실시간 지각 능력은 종종 약화됩니다. 이는 동일한 프로토콜 하에서 SimpleStream 을 명확히 능가하지 않는 한, 더 강력한 메모리, 검색 또는 압축 모듈이 진전을 나타낸다고 단정할 수 없음을 시사합니다. 따라서 우리는 향후 스트리밍 벤치마크가 최근 장면 지각 (recent-scene perception) 과 장기 기억 (long-range memory) 을 분리하여, 복잡성 증가로 인한 성능 향상을 보다 명확하게 평가할 수 있어야 한다고 주장합니다.

One-sentence Summary

Researchers from Nanyang Technological University introduce SIMPLESTREAM, a minimal baseline that feeds only recent frames to off-the-shelf VLMs, outperforming complex memory-centric models on OVO-Bench and StreamingBench while revealing a critical perception-memory trade-off.

Key Contributions

The paper introduces SIMPLESTREAM, a minimal streaming baseline that processes only the most recent $N$ frames with an off-the-shelf VLM, eliminating the need for complex memory banks, retrieval systems, or compression modules.
Comprehensive evaluations on OVO-Bench and StreamingBench demonstrate that this simple recent-context approach achieves state-of-the-art performance while maintaining lower peak GPU memory usage and competitive latency compared to prior streaming methods.
Controlled ablation studies reveal that the benefit of longer context is backbone-dependent rather than uniform across model scales, and that adding historical context often improves memory recall at the expense of real-time perception.

Introduction

Streaming video understanding is critical for real-time applications where models must process continuous video feeds under strict causal and memory constraints. Prior research has increasingly relied on complex memory mechanisms, such as external banks, retrieval systems, or compression modules, based on the assumption that managing long-term history requires elaborate architectural designs. However, these sophisticated approaches often yield modest gains while introducing significant computational overhead and a trade-off where enhanced memory recall can degrade real-time scene perception. The authors introduce SIMPLESTREAM, a minimal baseline that feeds only the most recent N frames directly to an off-the-shelf VLM without additional memory or training. They demonstrate that this simple recency-based approach matches or surpasses complex streaming models on major benchmarks like OVO-Bench and StreamingBench, revealing that longer context benefits are backbone-dependent rather than universal and arguing for a new evaluation standard that separates perception from memory performance.

Method

The authors introduce SimpleStream as a deliberately simple baseline designed to isolate the capabilities of current off-the-shelf Vision Language Models (VLMs) using only recent visual context. Unlike prior streaming systems that incorporate mechanisms for managing long-range history, SimpleStream relies on a sliding window approach. Refer to the framework diagram below, which illustrates how the system processes a continuous video stream by selecting a "Recent N-frames window" centered around the current frame to feed into the Vision Language Model.

Let the video stream be represented as a sequence of frames where $f_i$ denotes the visual frame at time step $i$ . Given a question $q_t$ at time $t$ , the method feeds the base VLM only the most recent $N$ frames and the text query. This process is formalized as:

$\mathbf { S I M P L E S T R E A M } ( t ) = \mathrm { V L M } \big ( \{ f _ { t - N + 1 } , \ldots , f _ { t } \} , \, q _ { t } \big )$

By construction, SimpleStream omits additional memory mechanisms, meaning frames outside the sliding window are discarded. Consequently, per-query memory and computation remain bounded by $N$ and do not grow with the stream length. The method introduces no architectural modification, memory module, or additional training; it functions strictly as an inference-time input policy applied to an off-the-shelf VLM.

The architectural comparison below highlights how SimpleStream differs from other context management strategies. While alternative approaches utilize External Memory, Retrieval, Compression, or Latent Memory to handle long-term dependencies, SimpleStream bypasses these components entirely. It serves as a controlled reference baseline to determine how much streaming performance can be obtained from recent visual context alone while minimizing confounding effects from additional training or system-level engineering.

Experiment

Experiments on OVO-Bench and StreamingBench validate that SIMPLESTREAM, a minimalist approach using only a fixed recent frame window, outperforms complex streaming systems with dedicated memory banks or retrieval modules, particularly in real-time visual perception tasks.
Ablation studies on recency window size and model scale demonstrate that performance does not improve monotonically with longer context; while modest window expansions help, further increases often yield diminishing returns or degradation, indicating that more historical context is not universally beneficial.
Visual-RAG analysis reveals a distinct perception-memory trade-off where retrieving historical chunks improves episodic memory recall but consistently degrades real-time perception, suggesting that current memory injection techniques often corrupt the model's understanding of the present scene.
Efficiency evaluations confirm that SIMPLESTREAM maintains low latency and stable GPU memory usage regardless of stream length, proving that persistent historical state is not required for competitive streaming inference.
Overall conclusions indicate that current benchmarks heavily favor recent perception capabilities, and future progress requires methods that can leverage historical evidence without sacrificing the clarity of current-scene understanding.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

2달 전

Yujiao Shen Shulin Tian Jingkang Yang Ziwei Liu

초록

One-sentence Summary

Key Contributions

The paper introduces SIMPLESTREAM, a minimal streaming baseline that processes only the most recent $N$ frames with an off-the-shelf VLM, eliminating the need for complex memory banks, retrieval systems, or compression modules.
Comprehensive evaluations on OVO-Bench and StreamingBench demonstrate that this simple recent-context approach achieves state-of-the-art performance while maintaining lower peak GPU memory usage and competitive latency compared to prior streaming methods.
Controlled ablation studies reveal that the benefit of longer context is backbone-dependent rather than uniform across model scales, and that adding historical context often improves memory recall at the expense of real-time perception.

Introduction

Method

$\mathbf { S I M P L E S T R E A M } ( t ) = \mathrm { V L M } \big ( \{ f _ { t - N + 1 } , \ldots , f _ { t } \} , \, q _ { t } \big )$

Experiment

Experiments on OVO-Bench and StreamingBench validate that SIMPLESTREAM, a minimalist approach using only a fixed recent frame window, outperforms complex streaming systems with dedicated memory banks or retrieval modules, particularly in real-time visual perception tasks.
Ablation studies on recency window size and model scale demonstrate that performance does not improve monotonically with longer context; while modest window expansions help, further increases often yield diminishing returns or degradation, indicating that more historical context is not universally beneficial.
Visual-RAG analysis reveals a distinct perception-memory trade-off where retrieving historical chunks improves episodic memory recall but consistently degrades real-time perception, suggesting that current memory injection techniques often corrupt the model's understanding of the present scene.
Efficiency evaluations confirm that SIMPLESTREAM maintains low latency and stable GPU memory usage regardless of stream length, proving that persistent historical state is not required for competitive streaming inference.
Overall conclusions indicate that current benchmarks heavily favor recent perception capabilities, and future progress requires methods that can leverage historical evidence without sacrificing the clarity of current-scene understanding.

소스 PDF 코드 보기

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

스트리밍 비디오 이해를 위한 간단한 베이스라인

Yujiao Shen Shulin Tian Jingkang Yang Ziwei Liu

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

스트리밍 비디오 이해를 위한 간단한 베이스라인

Yujiao Shen Shulin Tian Jingkang Yang Ziwei Liu

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

스트리밍 비디오 이해를 위한 간단한 베이스라인

Yujiao Shen Shulin Tian Jingkang Yang Ziwei Liu

초록

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

AI로 AI 구축

HyperAI Newsletters