HyperAI超神经

MMVP (Multimodal MoCap Dataset with Vision and Pressure Sensors) is a multimodal motion capture dataset combining vision and pressure sensors jointly developed by Beihang University, Tsinghua University and Nanjing University.

The dataset includes a wide range of rapid human movements, such as running, skipping, and standing long jump. A total of more than 44k frames of synchronized RGBD frames and pressure data from 16 subjects were collected. The researchers used the Azure Kinect camera to record RGBD video at a frequency of 30 frames per second, and used Xsensor pressure insoles to capture plantar pressure data at a rate of up to 150 frames per second. The two data streams were manually synchronized, and combined with deep learning algorithms such as FPP-Net and CLIFF to achieve detailed processing and analysis of the data. This dataset provides a new data resource for human motion capture research based on vision and pressure sensors, which can promote progress in this field.

describe: The MMVP (Multimodal Visual Mode) benchmark focuses on identifying "CLIP-blind pairs" - images that CLIP considers similar despite having obvious visual differences. MMVP benchmarks the performance of state-of-the-art systems, including GPT-4V, on nine basic visual modes. It highlights the challenges these systems face in answering simple questions, often leading to incorrect responses and hallucinatory interpretations.

Content type: Images (CLIP-blind pairs)
quantity: 300 images
Data source: Derived from ImageNet-1k and LAION-Aesthetics
Data Collection Methods: Identification of CLIP blind pairs by comparative analysis

MMVP Multimodal Motion Capture Dataset