M³IT: Multi-mode Multi-language Instruction Tuning Dataset
Date
2 years ago
Publish URL
Categories
The dataset consists of 40 datasets.This includes 2.4 million instances and 400 manually written task instructions.and reformatted into a vision-to-text structure. The dataset compiles various tasks from classic vision-language tasks, including captioning, visual question answering (VQA), visual conditional generation, reasoning, and classification.