VisDial Image Dialogue Dataset
Date
Size
Publish URL
License
CC BY 4.0
Categories

VisDial, the full name of Visual Dialog, is a dataset containing manual annotation problems based on images from the MS COCO dataset.
The dataset was developed by having two subjects chat about an image on Amazon Mechanical Turk. One of them acts as the questioner and the other acts as the answerer. The questioner can only see the text description of the image (i.e. the image caption from the MS COCO dataset), and the original image is not visible to the questioner. Their task is to ask questions around this image to "better imagine the scene". The answerer sees the image, the caption, and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions, up to 10 rounds.
VisDial v1.0 includes:
- Training set: 1,23,287 images, 10 rounds of dialogue per image;
- Validation set: 2,064 images, 10 rounds of dialogue per image;
- Test set: 8,000 images, 1 turn of dialogue per image.