Command Palette
Search for a command to run...
Luis Wiedmann Orr Zohar Amir Mahla Xiaohan Wang Rui Li Thibaud Frere Leandro von Werra Aritra Roy Gosthipaty Andrés Marafioti

Abstract
The advancement of vision-language models (VLMs) is hampered by a fragmentedlandscape of inconsistent and contaminated public datasets. We introduceFineVision, a meticulously collected, curated, and unified corpus of 24 millionsamples - the largest open resource of its kind. We unify more than 200 sourcesinto 185 subsets via a semi-automated, human-in-the-loop pipeline: automationperforms bulk ingestion and schema mapping, while reviewers audit mappings andspot-check outputs to verify faithful consumption of annotations, appropriateformatting and diversity, and safety; issues trigger targeted fixes andre-runs. The workflow further applies rigorous de-duplication within and acrosssources and decontamination against 66 public benchmarks. FineVision alsoencompasses agentic/GUI tasks with a unified action space; reviewers validateschemas and inspect a sample of trajectories to confirm executable fidelity.Models trained on FineVision consistently outperform those trained on existingopen mixtures across a broad evaluation suite, underscoring the benefits ofscale, data hygiene, and balanced automation with human oversight. We releasethe corpus and curation tools to accelerate data-centric VLM research.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.