8 months ago

Video Understanding

Visual Question Answering

Method/Architecture

Computer Vision

Shulin Tian Ruiqi Wang Hongming Guo Penghao Wu Yuhao Dong Xiuying Wang Jingkang Yang Hao Zhang Hongyuan Zhu Ziwei Liu

Abstract

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e.,in days and weeks) egocentric videos, which leverages a structuredChain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trainedvia reinforcement learning (RL). Inspired by human problem-solving strategies,CoTT decomposes complex reasoning into modular steps, with the RL agentinvoking specific tools, one per step, to iteratively and collaborativelyanswer sub-questions tackling such tasks as temporal retrieval and multi-modalunderstanding. We design a two-stage training paradigm involving supervisedfinetuning (SFT) of a pretrained language model using CoTT data and RL toenable our agent to dynamically propose step-by-step tools for long-rangereasoning. To facilitate training, we construct a dataset called Ego-R1 Data,which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, ourEgo-R1 agent is evaluated on a newly curated week-long video QA benchmark,Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources.Extensive results demonstrate that the dynamic, tool-augmented chain-of-thoughtreasoning by our Ego-R1 Agent can effectively tackle the unique challenges ofunderstanding ultra-long egocentric videos, significantly extending the timecoverage from few hours to a week.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Video Understanding

Visual Question Answering

Method/Architecture

Computer Vision

Shulin Tian Ruiqi Wang Hongming Guo Penghao Wu Yuhao Dong Xiuying Wang Jingkang Yang Hao Zhang Hongyuan Zhu Ziwei Liu

Abstract

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e.,in days and weeks) egocentric videos, which leverages a structuredChain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trainedvia reinforcement learning (RL). Inspired by human problem-solving strategies,CoTT decomposes complex reasoning into modular steps, with the RL agentinvoking specific tools, one per step, to iteratively and collaborativelyanswer sub-questions tackling such tasks as temporal retrieval and multi-modalunderstanding. We design a two-stage training paradigm involving supervisedfinetuning (SFT) of a pretrained language model using CoTT data and RL toenable our agent to dynamically propose step-by-step tools for long-rangereasoning. To facilitate training, we construct a dataset called Ego-R1 Data,which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, ourEgo-R1 agent is evaluated on a newly curated week-long video QA benchmark,Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources.Extensive results demonstrate that the dynamic, tool-augmented chain-of-thoughtreasoning by our Ego-R1 Agent can effectively tackle the unique challenges ofunderstanding ultra-long egocentric videos, significantly extending the timecoverage from few hours to a week.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp