HyperAIHyperAI

Llama Nemotron VLM v1 Multimodal Image and Text Dataset

Date

7 days ago

Organization

NVIDIA

Publish URL

huggingface.co

License

CC BY 4.0

Download Help

Llama Nemotron VLM v1 is a high-quality image and text dataset released by NVIDIA in 2025 for VLM post-training. It is used to support the Llama-3.1-Nemotron-Nano-VL-8B-V1 document understanding model released by NVIDIA (supporting document question answering, graph question answering, AI2D and other scenarios).

The dataset consists of 21 subsets, totaling 2,863,854 samples. Covering three categories: visual question answering (VQA), captioning (image description), and optical character recognition (OCR), it includes re-annotated public image datasets, fully and semi-synthesized OCR data (in Chinese and English, at the character, word, and page levels), and internally annotated OCR sets. The dataset also refines and enhances the original QA (question answering) or captions, making it suitable for multimodal training and evaluation of applications such as intelligent agents, chat assistants, and RAGs.

The data includes:

  • VQA (Visual Question Answering): 1,917,755 examples
  • Captioning: 131,718 samples
  • OCR (text recognition): 814,381 samples