HyperAIHyperAI

Command Palette

Search for a command to run...

PEARL: Personalized Streaming Video Understanding Model

초록

인간이 새로운 개념을 인지하는 과정은 본질적으로 스트리밍 방식입니다. 우리는 지속적으로 새로운 객체나 정체성을 인식하고 시간에 따라 기억을 업데이트합니다. 그러나 현재 다중 모달 개인화 방법론은 주로 정적 이미지나 오프라인 비디오에 국한되어 있습니다. 이로 인해 연속적인 시각적 입력과 즉각적인 현실 세계 피드백 간의 단절이 발생하여, 향후 AI 어시스턴트에 필수적인 실시간 상호작용형 개인화 응답 제공 능력이 제한됩니다. 이러한 간극을 해소하기 위해, 우리는 먼저 '개인화 스트리밍 비디오 이해(Personalized Streaming Video Understanding, PSVU)'라는 새로운 태스크를 제안하고 엄밀하게 정의합니다. 이 새로운 연구 방향을 촉진하기 위해, 본 과제에 특화된 최초의 포괄적인 벤치마크인 PEARL-Bench를 소개합니다. PEARL-Bench는 두 가지 모드에서 정확한 타임스탬프 시점에 개인화된 개념에 응답하는 모델의 능력을 평가합니다: (1) 프레임 수준(Frame-level) 모드는 이산적인 프레임 내 특정 인물이거나 객체에 초점을 맞추며, (2) 새로운 비디오 수준(Video-level) 모드는 연속적인 프레임에 걸쳐 전개되는 개인화된 행동에 초점을 맞춥니다. PEARL-Bench는 132 개의 고유한 비디오와 정확한 타임스탬프가 포함된 2,173 개의 세밀한 주석을 구성합니다. 개념의 다양성과 주석의 품질은 자동 생성과 인간 검증을 결합한 파이프라인을 통해 엄격히 보장됩니다. 이러한 도전적인 새로운 설정을 해결하기 위해, 우리는 추가적으로 플러그 앤 플레이 방식이며 학습이 불필요한 강력한 베이스라인 전략인 PEARL 을 제안합니다. 8 개의 오프라인 및 온라인 모델에 대한 광범위한 평가를 통해 PEARL 이 최첨단 성능을 달성함을 입증했습니다. 특히, PEARL 은 3 가지 서로 다른 아키텍처에 적용될 때 일관된 PSVU 성능 향상을 가져와 매우 효과적이고 견고한 전략임을 입증했습니다. 본 연구가 비전 - 언어 모델(VLM) 개인화 분야를 발전시키고, 스트리밍 기반 개인화 AI 어시스턴트에 대한 추가 연구를 촉진하기를 기대합니다. 관련 코드는 https://github.com/Yuanhong-Zheng/PEARL 에서 확인 가능합니다.

One-sentence Summary

Researchers from Peking University and collaborators propose PEARL, a training-free framework for Personalized Streaming Video Understanding that introduces a dual-grained memory system to enable real-time, personalized responses in continuous video streams, outperforming existing offline and online models on their new PEARL-Bench.

Key Contributions

  • The paper introduces the Personalized Streaming Video Understanding task and PEARL-Bench, the first comprehensive benchmark featuring 132 videos and 2,173 fine-grained annotations to evaluate frame-level and video-level personalization with precise temporal localization.
  • A training-free, plug-and-play framework named PEARL is presented, which utilizes a Dual-grained Memory System to decouple concept knowledge from stream observations and a Concept-aware Retrieval Algorithm to enable real-time responses without parameter updates.
  • Extensive evaluations across eight offline and online models demonstrate that the proposed method achieves state-of-the-art performance, delivering average improvements of 13.79% at the frame-level and 12.80% at the video-level across three distinct architectures.

Introduction

Current Vision-Language Models excel at personalization but remain confined to static images or offline video processing, creating a disconnect with the continuous, real-time nature of human cognition and real-world AI assistant applications. Existing methods struggle to handle streaming visual inputs, dynamically define user concepts on the fly, or provide instant interactive feedback without computationally expensive retraining. To address these gaps, the authors formally define the novel task of Personalized Streaming Video Understanding and introduce PEARL-Bench, the first comprehensive benchmark for evaluating frame-level and video-level personalization in continuous streams. They further propose PEARL, a training-free, plug-and-play framework that leverages a dual-grained memory system and concept-aware retrieval to enable off-the-shelf models to achieve state-of-the-art performance in real-time personalized video understanding.

Dataset

  • Dataset Composition and Sources: The authors introduce PEARL-Bench, a benchmark comprising 132 videos and 2,173 annotations designed for long-form streaming video personalization. The frame-level split draws from diverse public internet sources including anime, movies, and reality shows, while the video-level split utilizes digital human synthesis via Mixamo assets to ensure repeated, clean action annotations across 8 characters, 20 actions, and 20 backgrounds.

  • Key Details for Each Subset: The dataset supports two distinct concept types: Frame-level concepts (static characters) and Video-level concepts (dynamic actions). Annotations are categorized into three QA types: Concept-Definition for registering new concepts, Real-Time for querying current states, and Past-Time for retrieving historical evidence. The average video duration is 1,458 seconds, with all annotations linked to precise timestamps.

  • Model Usage and Training Strategy: The authors use Concept-Definition QA exclusively for memory registration during the interaction phase and exclude it from final evaluation metrics. Real-Time and Past-Time QA serve as the primary evaluation splits, where models must ground recognized concepts in current scenes or retrieve historical clips to answer questions. The benchmark includes fine-grained sub-categories such as Presence, Behavior, Appearance, Location, Relation, and Action to test multi-dimensional perception.

  • Processing and Metadata Construction: The curation pipeline involves manual filtering for high dynamics and resolution (480p+), followed by a rigorous quality control phase using automated ablation tests and human verification. To enhance robustness, the authors replace original names with 10,000 common names from the U.S. SSA database. They also employ specific prompting strategies to generate compact descriptions that focus on stable features for frame-level concepts and kinematic patterns for video-level concepts, ensuring generalizability across different scenes and subjects.

Method

The authors propose PEARL, a plug-and-play framework designed for Personalized Streaming Video Understanding (PSVU). Unlike prior approaches restricted to static definitions or single-turn interactions, PEARL enables dynamic concept definition and real-time responses within an online streaming video context. Refer to the framework diagram to observe the transition from static image/video understanding to the proposed dynamic, multi-turn streaming capability.

Formally, the streaming video is defined as an infinite sequence V=[X1,X2,]V = [\mathcal{X}_1, \mathcal{X}_2, \ldots]V=[X1,X2,], where Xi\mathcal{X}_iXi represents a semantic scene. Users can dynamically introduce new concepts C\mathcal{C}C at arbitrary timestamps via instructions. For a query QQQ issued at time tqt_qtq, the model M\mathcal{M}M constructs a context to generate a response AAA: A=M(Csub,Vcontext,Q)A = \mathcal{M}(\mathcal{C}_{sub}, \mathcal{V}_{context}, Q)A=M(Csub,Vcontext,Q) where Csub\mathcal{C}_{sub}Csub is the query-relevant concept subset and Vcontext\mathcal{V}_{context}Vcontext is the necessary visual context.

To manage the unbounded stream and evolving concepts, the authors design a Dual-grained Memory System that decouples concept-centric knowledge from stream-centric observations. This system comprises a Streaming Memory that incrementally archives segmented clips with compact multimodal embeddings, and a Concept Memory that stores structured representations of user-defined concepts. As shown in the figure below, the architecture explicitly separates these two memory streams to facilitate efficient retrieval.

The framework supports both frame-level and video-level concepts. For frame-level concepts, the system stores the last frame of the current clip as visual evidence. For video-level concepts, the entire clip serves as evidence, focusing on core kinematics and movement patterns rather than static identity features. The figure below illustrates examples of these different concept granularities, showing how specific actions or character appearances are defined and queried.

Upon receiving a query, the Concept-aware Retrieval Algorithm is triggered to retrieve relevant information. The model first identifies concept names in the query and retrieves corresponding entries from the Concept Memory. It then rewrites the query by replacing concept names with their textual descriptions to improve retrieval accuracy. This rewritten query is encoded and compared against clip embeddings in the Streaming Memory to select the top-KKK most relevant historical clips. Finally, the retrieved concepts, historical clips, current clip, and original query are fed into the VLM to generate the response. The interaction flow demonstrates how the system registers video-level concepts via tool calls and subsequently recognizes these actions in real-time.

Experiment

  • Baseline comparisons validate that offline models fail on streaming tasks due to limited context and lack of memory, while PEARL significantly outperforms both offline and online baselines across frame-level and video-level settings by effectively managing continuous visual streams.
  • Human score and text-only experiments confirm that the benchmark requires visual grounding, as text priors alone yield poor performance while human annotators achieve high accuracy with full visual access.
  • Ablation studies demonstrate that Concept Memory is indispensable for linking user-defined names to entities, Streaming Memory is essential for reasoning over historical evidence in past-time queries, and Query Rewriting improves retrieval accuracy by converting personalized names into descriptive semantics.
  • Efficiency analysis shows that PEARL introduces minimal latency overhead compared to base models while maintaining real-time capabilities, with inference time dominated by the LLM rather than retrieval modules.
  • Hyperparameter and scale analyses reveal that moderate retrieval suffices for accurate answers, excessive historical data can introduce noise for real-time queries, and the framework consistently boosts performance across various model sizes, whereas scaling offline models without a streaming framework yields negligible gains.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp