HyperAI

Llama Nemotron VLM v1 is a high-quality image and text dataset released by NVIDIA in 2025 for VLM post-training. It is used to support the Llama-3.1-Nemotron-Nano-VL-8B-V1 document understanding model released by NVIDIA (supporting document question answering, graph question answering, AI2D and other scenarios).

The dataset consists of 21 subsets, totaling 2,863,854 samples. Covering three categories: visual question answering (VQA), captioning (image description), and optical character recognition (OCR), it includes re-annotated public image datasets, fully and semi-synthesized OCR data (in Chinese and English, at the character, word, and page levels), and internally annotated OCR sets. The dataset also refines and enhances the original QA (question answering) or captions, making it suitable for multimodal training and evaluation of applications such as intelligent agents, chat assistants, and RAGs.

The data includes:

VQA (Visual Question Answering): 1,917,755 examples

Captioning: 131,718 samples

OCR (text recognition): 814,381 samples

Llama Nemotron VLM v1 Multimodal Image and Text Dataset

The data includes:

Build AI with AI

Hyper Newsletters

Command Palette

Llama Nemotron VLM v1 Multimodal Image and Text Dataset

The data includes:

Build AI with AI

Hyper Newsletters