4 months ago

Luis Wiedmann Orr Zohar Amir Mahla Xiaohan Wang Rui Li Thibaud Frere Leandro von Werra Aritra Roy Gosthipaty Andrés Marafioti

Abstract

The advancement of vision-language models (VLMs) is hampered by a fragmentedlandscape of inconsistent and contaminated public datasets. We introduceFineVision, a meticulously collected, curated, and unified corpus of 24 millionsamples - the largest open resource of its kind. We unify more than 200 sourcesinto 185 subsets via a semi-automated, human-in-the-loop pipeline: automationperforms bulk ingestion and schema mapping, while reviewers audit mappings andspot-check outputs to verify faithful consumption of annotations, appropriateformatting and diversity, and safety; issues trigger targeted fixes andre-runs. The workflow further applies rigorous de-duplication within and acrosssources and decontamination against 66 public benchmarks. FineVision alsoencompasses agentic/GUI tasks with a unified action space; reviewers validateschemas and inspect a sample of trajectories to confirm executable fidelity.Models trained on FineVision consistently outperform those trained on existingopen mixtures across a broad evaluation suite, underscoring the benefits ofscale, data hygiene, and balanced automation with human oversight. We releasethe corpus and curation tools to accelerate data-centric VLM research.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

4 months ago

Luis Wiedmann Orr Zohar Amir Mahla Xiaohan Wang Rui Li Thibaud Frere Leandro von Werra Aritra Roy Gosthipaty Andrés Marafioti

Abstract

The advancement of vision-language models (VLMs) is hampered by a fragmentedlandscape of inconsistent and contaminated public datasets. We introduceFineVision, a meticulously collected, curated, and unified corpus of 24 millionsamples - the largest open resource of its kind. We unify more than 200 sourcesinto 185 subsets via a semi-automated, human-in-the-loop pipeline: automationperforms bulk ingestion and schema mapping, while reviewers audit mappings andspot-check outputs to verify faithful consumption of annotations, appropriateformatting and diversity, and safety; issues trigger targeted fixes andre-runs. The workflow further applies rigorous de-duplication within and acrosssources and decontamination against 66 public benchmarks. FineVision alsoencompasses agentic/GUI tasks with a unified action space; reviewers validateschemas and inspect a sample of trajectories to confirm executable fidelity.Models trained on FineVision consistently outperform those trained on existingopen mixtures across a broad evaluation suite, underscoring the benefits ofscale, data hygiene, and balanced automation with human oversight. We releasethe corpus and curation tools to accelerate data-centric VLM research.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp