HyperAI超神经

The CompreCap dataset was jointly created by the University of Science and Technology of China and Ant Group in 2024 to evaluate the accuracy and comprehensiveness of large-scale visual-language models in generating detailed image descriptions. The relevant paper results are "Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning". The dataset contains 560 images, each of which has been finely semantically segmented and annotated with objects, attributes, and relationships to form a complete directional scene graph structure.

The dataset is based on the MSCOCO panoramic segmentation dataset, but has been extended and improved. The researchers built a vocabulary of common object categories from multiple well-known datasets and re-annotated these categories to provide more accurate semantic segmentation masks. To ensure the completeness of the annotations, only images whose segmented areas cover more than 95% image areas are retained. Subsequently, the researchers manually added detailed attribute descriptions for these objects and annotated important relationships between objects to form a complete directional scene graph structure.

The annotation information of the CompreCap dataset includes semantic segmentation masks of objects, detailed attribute descriptions, and directional relationships between objects. These annotations not only cover common object categories, but also capture the complex relationships between objects in the form of directional scene graphs, allowing the dataset to comprehensively evaluate the quality of generating detailed image descriptions.

CompreCap Image Description Dataset