HyperAIHyperAI

Command Palette

Search for a command to run...

Console

OpenDataArena : Une arène équitable et ouverte pour évaluer la valeur des jeux de données post-entraînement

Abstract

L’évolution rapide des grands modèles linguistiques (LLM) repose sur la qualité et la diversité des jeux de données utilisés après l’entraînement. Toutefois, un paradoxe fondamental persiste : si les modèles sont rigoureusement évalués par des benchmarks, les données qui les alimentent demeurent un "boîtier noir", caractérisé par une composition opaques, une origine incertaine et un manque d’évaluation systématique. Cette opacité entrave la reproductibilité et masque le lien causal entre les caractéristiques des données et le comportement des modèles. Pour combler ce fossé, nous introduisons OpenDataArena (ODA), une plateforme globale et ouverte conçue pour évaluer la valeur intrinsèque des données post-entraînement. ODA établit un écosystème complet fondé sur quatre piliers clés : (i) une chaîne d’entraînement-évaluation unifiée assurant des comparaisons justes et ouvertes entre divers modèles (par exemple, Llama, Qwen) et domaines ; (ii) un cadre d’évaluation multidimensionnel permettant de profiler la qualité des données selon des dizaines d’axes distincts ; (iii) un outil interactif d’exploration de la traçabilité des données, permettant de visualiser l’arbre généalogique des jeux de données et d’analyser leurs sources composantes ; et (iv) un kit entièrement open-source pour l’entraînement, l’évaluation et le scoring, visant à stimuler la recherche sur les données. Des expérimentations étendues sur ODA – couvrant plus de 120 jeux de données d’entraînement dans plusieurs domaines, testés sur 22 benchmarks, validées par plus de 600 exécutions d’entraînement et 40 millions de points de données traités – révèlent des insights non négligeables. Notre analyse met en évidence les compromis inhérents entre la complexité des données et la performance sur les tâches, identifie des redondances dans les benchmarks populaires grâce à la traçabilité des origines, et cartographie les relations généalogiques entre les jeux de données. Nous rendons publics l’ensemble des résultats, outils et configurations afin de démocratiser l’accès à l’évaluation de haute qualité des données. Au-delà d’un simple agrandissement des classements, ODA préfigure un changement de paradigme : du curation empirique des données par essais-erreurs vers une science rigoureuse de l’IA centrée sur les données, ouvrant la voie à des études approfondies sur les lois de mélange des données et la composition stratégique des modèles fondamentaux.

One-sentence Summary

The Shanghai Artificial Intelligence Laboratory and OpenDataLab's OpenDataArena Team introduces OpenDataArena (ODA), a transparent platform that benchmarks post-training data value through unified evaluation pipelines, multi-dimensional scoring across 22 benchmarks, and interactive data lineage tracing, replacing opaque "black box" dataset practices with systematic evaluation to advance reproducible data-centric AI research for large language models.

Key Contributions

  • The paper addresses the critical problem of opaque post-training data composition in LLM development, which hinders reproducibility and obscures causal links between data characteristics and model performance, by introducing OpenDataArena (ODA) as a holistic platform for systematic data benchmarking. ODA establishes a unified training-evaluation pipeline enabling fair comparisons across diverse models and domains, validated through extensive experiments on over 120 datasets across 22 benchmarks with 600+ training runs and 40 million processed data points.
  • It proposes a novel multi-dimensional scoring framework that profiles data quality across tens of distinct axes beyond single-metric evaluations, revealing non-trivial insights such as inherent trade-offs between data complexity and task performance, and identifying redundancy in popular benchmarks through lineage tracing. This framework provides granular quality assessment validated by correlation analyses between fine-grained metrics and downstream results across models like Llama3.1 and Qwen series.
  • The platform introduces an interactive data lineage explorer for visualizing dataset genealogy and source provenance alongside a fully open-source toolkit for training, evaluation, and scoring, enabling transparent dissection of dataset components and reproducible research. This ecosystem facilitated efficiency analysis mapping "genealogical" dataset relationships and identifying high-yield data sources to inform strategic curation.

Introduction

The authors address a critical gap in Large Language Model (LLM) development: while models undergo rigorous benchmarking, the post-training datasets that shape their behavior remain poorly understood "black boxes" with opaque composition and uncertain provenance. This lack of standardized evaluation hinders reproducibility, obscures how data characteristics influence model performance, and forces data curation into costly trial-and-error processes. Prior efforts failed to isolate dataset quality as the sole variable due to inconsistent training protocols and evaluation metrics.

To solve this, the authors introduce OpenDataArena (ODA), an open platform establishing fair, reproducible benchmarking for post-training data. Its core innovation is a unified training-evaluation pipeline that fixes base models and hyperparameters, enabling direct "apples-to-apples" dataset comparisons across models like Llama and Qwen. ODA further provides a multi-dimensional scoring framework to profile data quality across diverse axes, an interactive lineage explorer for tracing dataset provenance, and fully open-source tools validated across 120 datasets, 600+ training runs, and 40 million data points. This infrastructure shifts data evaluation from ad-hoc experimentation toward a principled science of Data-Centric AI.

Dataset

  • The authors analyze over 120 publicly available SFT training datasets totaling 40 million+ samples, sourced primarily from Hugging Face based on community impact (minimum downloads/likes), recency (post-2023), and SFT suitability. Key examples include OpenThoughts3, LIMO, and Tulu3-SFT, with individual dataset sizes ranging from thousands to hundreds of thousands of samples.
  • Domain distribution is heavily skewed: Math (34.3%) and Code (30.6%) dominate, followed by General (20.8%) and Science (14.4%). Datasets underwent safety reviews and format standardization, with mixed-domain collections included to reflect real-world complexity. Benchmarks for evaluation span 22+ tests across General (e.g., MMLU-PRO), Math (e.g., OlympiadBenchMath), Code (e.g., LiveCodeBench), and Reasoning (e.g., GPQA diamond).
  • The paper uses these datasets to build the OpenDataArena platform, analyzing their intrinsic properties and downstream performance via leaderboard evaluations. No explicit training/validation splits are defined; instead, datasets are assessed holistically for impact across domains, with lineage analysis tracing dependencies between high-performing collections.
  • Processing includes automated data lineage tracing to map derivations and redundancies, revealing systemic homogenization (e.g., 70 seed datasets expand to 411 nodes with 941 edges globally). Critical findings include benchmark contamination—where training data incorporates test sets like Omni-MATH—and domain-specific patterns, such as Math datasets averaging 5.18 derivation steps via iterative refinement.

Method

The OpenDataArena platform is structured as a systematic, end-to-end workflow designed to evaluate post-training datasets through standardized, reproducible benchmarks. At its core, the platform orchestrates a four-stage pipeline that transforms raw datasets into actionable insights, supported by a suite of open-source tools and interactive visualizations.

The process begins at the Data Input Layer, where datasets are ingested, normalized into a unified format, and classified by domain to ensure consistency across evaluations. This layer serves as the foundational entry point, preparing heterogeneous data sources for downstream processing. As shown in the figure below, this stage feeds directly into the Data Evaluation Layer, which acts as the computational engine of the platform.

In the Data Evaluation Layer, each dataset is used to fine-tune a common pre-trained base model—such as Qwen or LLaMA—under a fixed training protocol. The resulting model is then evaluated across a diverse set of downstream benchmarks, with its aggregated performance serving as a proxy for the dataset’s intrinsic value. Concurrently, the layer executes a multi-dimensional scoring process that separately assesses the instruction (Q) and the full instruction-response pair (Q&A), capturing distinct facets of data quality. This scoring system leverages three methodological categories: model-based evaluation for quantifying complexity and reasoning depth, LLM-as-judge for subjective qualitative attributes like coherence and clarity, and heuristic rules for objective metrics such as token length or response structure.

The outputs from the evaluation stage are then passed to the Data Analysis Layer, which synthesizes performance metrics and scoring results to enable cross-model comparisons, domain-specific efficacy assessments, and exploration of data family relationships. This layer facilitates deeper diagnostic insights by correlating dataset properties with model behavior, allowing researchers to identify patterns in data utility and redundancy.

Finally, the Data Visualization Layer renders these analytical outputs into interactive leaderboards, comparative charts, and score visualizations for end users. The platform’s deliverables include a public leaderboard for intuitive performance ranking, a multi-dimensional scoring framework detailing over 15 intrinsic dataset properties, an interactive data lineage platform for tracing provenance, and a fully open-source evaluation toolkit to ensure reproducibility and community extensibility.

Refer to the framework diagram for a high-level overview of how these components interconnect to form a cohesive evaluation ecosystem.

Experiment

  • Standardized pipeline with 600+ training runs across Llama3.1-8B, Qwen2.5-7B, and Qwen3-8B models validates that dataset quality solely drives performance variations.
  • Qwen3 achieves highest median scores across all domains (e.g., Math: ~56 on 2025Q3 datasets), confirming stronger base models provide higher performance floors and robustness to data noise.
  • Math datasets surged from ~35 to ~56 (2023Q2–2025Q3) due to synthetic Chain-of-Thought techniques, while Code domain remains volatile with inconsistent quality.
  • Response Length shows 0.81 Spearman correlation with Math performance; verbose reasoning (e.g., OpenThought3) significantly boosts learning, but Code requires conciseness (negative length correlation: -0.29).
  • High-Density Volume strategy (moderate-sized curated datasets) outperforms extreme efficiency; e.g., AM-Thinking achieves top Math/Code results, while tiny datasets like LIM0 degrade Llama3.1 performance in Math.
  • Dataset rankings show high Math consistency (0.902 rank correlation between Qwen2.5/Qwen3) but General domain saturation (negative correlation: -0.323), indicating specialized domains benefit more from tailored data.
  • Code domain exhibits unique evaluation criteria: metrics like Thinking Probability positively correlate (0.54) versus negative Math correlation (-0.69), necessitating domain-specific assessment frameworks.

The authors measure dataset ranking consistency between Qwen2.5 and Qwen3 models using Spearman correlation, revealing that Math datasets show exceptionally high alignment (0.902), indicating their value is stable across model generations. In contrast, General datasets exhibit negative correlation (-0.323), suggesting diminishing returns as stronger models absorb common instruction patterns during pre-training. Science and Code domains show weak positive correlations, reflecting partial but inconsistent transfer of dataset value as base models evolve.

The authors use standardized fine-tuning and evaluation pipelines across multiple base models to rank datasets by performance in General, Math, Code, and Science domains. Results show that Qwen3 consistently achieves top global rankings, with AM-Thinking-Math and MegaScience leading in Math and Science respectively, while Code datasets like Code-Feedback and Raiden-DeepSeek-R1 show strong performance on Qwen3. Dataset rankings exhibit high consistency in Math (Spearman 0.902) but negative correlation in General (-0.323), indicating that advanced models like Qwen3 derive less benefit from general instruction data due to pre-training saturation.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp