HyperAIHyperAI

Command Palette

Search for a command to run...

MA-EgoQA: الإجابة على الأسئلة في فيديوهات ذاتية المنظور من وكلاء جسديين متعددين

Kangsan Kim Yanlai Yang Suji Kim Woongyeong Yeo Youngwan Lee Mengye Ren Sung Ju Hwang

الملخص

مع تزايد قوة النماذج المادية (Embodied Models)، من المتوقع أن يتعاون البشر في المستقبل مع عدة وكلاء ذكاء اصطناعي ماديّين في أماكن عملهم أو منازلهم. ولضمان تفاعل أفضل بين المستخدمين البشريين ونظام متعدد الوكلاء، يُعدّ من الجوهري تفسير المعلومات الواردة من الوكلاء بشكل متوازٍ، والرجوع إلى السياق الملائم لكل استعلام. وتشمل التحديات القائمة: الضغط الفعّال ونقل كميات هائلة من المدخلات الحسية الفردية بصيغة فيديو، بالإضافة إلى التجميع الدقيق لعدة مقاطع فيديو ذات منظور ذاتي (Egocentric Videos) لبناء ذاكرة على مستوى النظام. في هذا البحث، نعرّف أولاً بشكل رسمي مشكلة جديدة تتمثل في فهم عدة مقاطع فيديو طويلة المدى ذات منظور ذاتي تم جمعها بشكل متزامن من وكلاء ماديّين. ولتسهيل الأبحاث في هذا الاتجاه، نقترح معيار تقييم جديدًا يُسمى MultiAgent-EgoQA (MA-EgoQA)، مُصمم لتقييم منهجي للنماذج الحالية في سيناريونا. يوفر معيار MA-EgoQA 1,700 سؤال فريد خاص بتدفقات فيديو ذاتية متعددة، موزّعة على خمس فئات: التفاعل الاجتماعي، وتنسيق المهام، ونظرية العقل، والاستدلال الزمني، والتفاعل مع البيئة. كما نقترح نموذجًا أساسيًا بسيطًا لـ MA-EgoQA نسمّيه EgoMAS، يعتمد على ذاكرة مشتركة عبر الوكلاء الماديّين واسترجاع ديناميكي على مستوى كل وكيل. ومن خلال التقييم الشامل لمجموعة متنوعة من النماذج الأساسية بما فيها EgoMAS على معيار MA-EgoQA، وجدنا أن الأساليب الحالية غير قادرة على معالجة فعّالة لتدفقات الفيديو ذاتية المنظور المتعددة، مما يبرز الحاجة إلى تطورات مستقبلية في الفهم على مستوى النظام عبر الوكلاء. ويتوفر الكود والمعايير في: https://ma-egoqa.github.io.

One-sentence Summary

Researchers from KAIST, New York University, and collaborators introduce MA-EgoQA, a benchmark for answering questions across multiple long-horizon egocentric video streams, alongside EgoMAS, a baseline using shared memory and dynamic retrieval to outperform existing models in complex multi-agent scenarios.

Key Contributions

  • The paper addresses the critical challenge of interpreting parallel sensory inputs from multiple embodied agents to enable effective human-AI communication and system-level memory aggregation.
  • It introduces MultiAgent-EgoQA, a new benchmark featuring 1.7k questions across five categories like social interaction and temporal reasoning, derived from long-horizon egocentric video streams.
  • The authors propose EgoMAS, a baseline model using shared memory and dynamic retrieval that outperforms existing approaches by 4.48% and demonstrates the limitations of current video LLMs on this task.

Introduction

As embodied AI agents become common in shared environments like homes and workplaces, the ability for humans to query these multi-agent systems for progress monitoring or anomaly detection is critical for transparency and control. Prior research has largely focused on task allocation and action execution, leaving a significant gap in systems that can integrate long-horizon egocentric video streams from multiple agents to answer complex questions. Existing video models struggle with the massive data volume generated over days and fail to effectively aggregate experiences across different agents to form a coherent system-level memory. To address this, the authors introduce MA-EgoQA, a new benchmark featuring 1.7k questions across five reasoning categories derived from six agents operating over seven days. They also propose EgoMAS, a baseline model that utilizes shared memory and agent-wise dynamic retrieval to efficiently locate relevant events, demonstrating that current state-of-the-art models cannot yet handle the complexities of multi-agent egocentric understanding.

Dataset

MA-EgoQA Dataset Overview

  • Dataset Composition and Sources The authors construct MA-EgoQA using the EgoLife dataset, which consists of super-long egocentric video recordings from six individuals wearing camera-equipped glasses over seven consecutive days in a shared house. This foundation allows the benchmark to evaluate reasoning across multiple, temporally aligned video streams rather than relying on single-agent assumptions found in prior work.

  • Key Details for Each Subset The benchmark contains 1,741 high-quality multiple-choice questions distributed across five distinct categories designed to capture unique multi-agent dynamics:

    • Social Interaction (SI): Evaluates grounding of conversations and affiliative behaviors, including 15.9k generated single-span and multi-span samples.
    • Task Coordination (TC): Focuses on role assignment and goal completion, featuring 16.3k multi-span samples alongside single-span variants.
    • Theory of Mind (ToM): Assesses reasoning about the mental states, beliefs, and intentions of others.
    • Temporal Reasoning (TR): Divided into concurrency and comparison subcategories to test timeline alignment across agents.
    • Environmental Interaction (EI): Tracks object usage and environmental state changes distributed among agents.
  • Data Usage and Generation Strategy The authors employ a hybrid generation pipeline to create the dataset, utilizing GPT-4o and GPT-5 for candidate creation followed by rigorous filtering.

    • Open-ended Categories (SI, TC, ToM): The team generates large pools of samples by providing 5-minute video segments with dense captions and transcripts to the model, instructing it to create questions grounded by at least two agents.
    • Structured Categories (TR, EI): The authors use predefined templates and specific temporal windows (30 seconds to 1 hour) to generate queries regarding event ordering and object interaction frequency.
    • Multi-span Construction: For SI and TC, the authors group semantically similar single-span questions using cosine similarity on text embeddings to synthesize complex questions requiring reasoning across non-contiguous time windows.
  • Processing and Quality Control To ensure the benchmark is challenging and strictly multi-agent, the authors implement a multi-stage filtering and verification process:

    • LLM Filtering: Candidates undergo zero-shot testing to remove trivial questions and single-agent filtering to eliminate samples answerable by one person's memory.
    • Cross-model Validation: External models (Gemini-2.5-Flash and Claude-Sonnet-4) verify correctness and option validity to prevent model-specific biases.
    • Human Verification: Four human reviewers manually inspect 3,436 candidates against full video and transcript context, ultimately selecting the final 1,741 samples for the benchmark.

Method

The authors propose EgoMAS (Egocentric Multi-Agent System), a centralized, training-free baseline designed to address the challenges of multi-agent egocentric reasoning. The system operates through a two-stage architecture comprising an event-based shared memory and an agent-wise dynamic retrieval mechanism.

Event-based Shared Memory To achieve a system-level global understanding, the system aggregates fragmented events from multiple agents. At every 10-minute interval, each embodied agent provides a caption summarizing its observations. A centralized manager then integrates these individual captions into a system-level summary. Rather than producing a flat textual condensation, the manager identifies key events across agents and explicitly records the corresponding 4W1H fields: When, What, Where, Who, and How. This produces a coherent global memory that aligns agent perspectives while preserving critical details for reasoning.

Agent-wise Dynamic Retrieval Given a query qqq, EgoMAS employs a hierarchical retrieval strategy to ensure fine-grained reasoning across multiple perspectives. First, the system retrieves the top-nnn system-level memories from the shared memory Mshared\mathcal{M}_{\text{shared}}Mshared using BM25 ranking:

Rsvs(q)=Top ⁣ ⁣n{(m,s(m,q))mMshared},\mathcal { R } _ { \mathrm { s v s } } ( q ) = \mathrm { T o p } \! \cdot \! n \, \left\{ ( m , s ( m , q ) ) \mid m \in \mathcal { M } _ { \mathrm { s h a r e d } } \right\} ,Rsvs(q)=Topn{(m,s(m,q))mMshared},

where s(m,q)s(m,q)s(m,q) denotes the BM25 score between memory mmm and query qqq. From the retrieved system-level context, EgoMAS generates a set of agent-specific retrieval requests Qagent={(aj,qj)}j=1J\mathcal{Q}_{\text{agent}} = \{(a_j,q_j)\}_{j=1}^JQagent={(aj,qj)}j=1J, where each request consists of an agent identifier aja_jaj and a sub-query qjq_jqj. For each (aj,qj)(a_j,q_j)(aj,qj), the system performs agent-level retrieval from the specific agent's memory Mai\mathcal{M}_{a_i}Mai:

Rai(qj)=Top ⁣\textslk{(m,s(m,qj))mMai}.\mathcal { R } _ { a _ { i } } ( q _ { j } ) = \mathrm { T o p } \! \textsl { - } k \, \{ ( m , s ( m , q _ { j } ) ) \mid m \in \mathcal { M } _ { a _ { i } } \} .Rai(qj)=Top\textslk{(m,s(m,qj))mMai}.

To ensure relevance, memories with scores below a threshold τ\tauτ are filtered out:

R~ai(qj)={(m,s(m,qj))Rai(qj)s(m,qj)τ}.\widetilde { \mathcal { R } } _ { a _ { i } } ( q _ { j } ) = \{ ( m , s ( m , q _ { j } ) ) \in \mathcal { R } _ { a _ { i } } ( q _ { j } ) \mid s ( m , q _ { j } ) \geq \tau \} .Rai(qj)={(m,s(m,qj))Rai(qj)s(m,qj)τ}.

Finally, the system generates the final response by conditioning on both the retrieved system-level context Rsys(q)\mathcal{R}_{\text{sys}}(q)Rsys(q) and the aggregated agent-level results R^=i=1JR^ai(qj)\widehat{\mathcal{R}} = \bigcup_{i=1}^{J} \widehat{\mathcal{R}}_{a_i}(q_j)R=i=1JRai(qj):

y^=F(q,Rsys(q),R~),\hat { y } = F \big ( q , \mathcal { R } _ { \mathrm { s y s } } ( q ) , \widetilde { \mathcal { R } } \big ) ,y^=F(q,Rsys(q),R),

where y^\hat{y}y^ and FFF denote the response and response generation function.

Benchmark Generation Process To support this research, the authors also establish a rigorous data generation pipeline. This process involves three stages: QA Generation, Filtering, and Manual Verification. In Stage I, questions are generated based on categories such as Single-span QA, Multi-span QA, and Template-based queries (TR, EI). Stage II applies zero-shot filtering, single-agent filtering, and cross-model verification to ensure quality. Finally, Stage III involves human verification to validate the dataset.

Experiment

  • Evaluation on the MA-EgoQA benchmark demonstrates that current models struggle with multi-agent egocentric video understanding, with even top proprietary models achieving low accuracy, highlighting the task's difficulty.
  • Experiments comparing input strategies reveal that concatenating all captions or frames without retrieval introduces noise and high computational costs, whereas retrieval-based approaches significantly improve efficiency and performance.
  • The EgoMAS framework outperforms all baselines by effectively aggregating memories from multiple agents, proving that multi-agent memory access is essential for accurate reasoning.
  • Analysis of sub-categories shows that performance degrades as the number of required agents or time spans increases, and Theory of Mind tasks remain the most challenging due to the need for inferring latent mental states.
  • Ablation studies confirm that EgoMAS benefits from combining shared memory construction with agent-wise dynamic retrieval, and that an event-based memory structure is superior to alternative methods.
  • Sensitivity analysis indicates that accuracy improves with the number of available agents, while modality experiments suggest that visual frames are crucial for specific queries but can distract models if not selected adaptively.

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp