EgoThink: A First-person Visual Question Answering Benchmark Dataset
Date
Size
Publish URL
Categories

EgoThink is a first-person perspective visual question answering benchmark dataset proposed by Tsinghua University.The dataset contains 700 images covering 6 core capabilities broken down into 12 dimensions. EgoThink's images come from the sampled images of the Ego4D first-person video dataset. In order to ensure data diversity, only 2 images are sampled for each video at most.
During the dataset construction process, only high-quality images that can clearly show first-person perspective thinking were selected. The dataset is manually annotated, and each dimension contains at least 50 detailed annotated question-answering questions, which are derived from multiple real-life scenarios from the first-person perspective. EgoThink has a wide range of applications, especially in evaluating and improving the performance of VLMs in first-person perspective tasks, providing a valuable resource for future embodied artificial intelligence and robotics research.