HyperAIHyperAI

Command Palette

Search for a command to run...

Console

OpenDataArena: Ein faire und offener Arena zur Bewertung des Wertes von Post-Training-Datensätzen

Abstract

Die rasante Entwicklung großer Sprachmodelle (Large Language Models, LLMs) beruht entscheidend auf der Qualität und Vielfalt der nachtrainierten Datensätze. Dennoch besteht eine kritische Dialektik: Während Modelle rigoros auf Benchmarks evaluiert werden, bleibt die zugrundeliegende Datenbasis ein schwarzes Kästchen – gekennzeichnet durch undurchsichtige Zusammensetzung, unklare Herkunft und fehlende systematische Bewertung. Diese Transparenzlosigkeit behindert die Reproduzierbarkeit und verschleiert den kausalen Zusammenhang zwischen Datenmerkmalen und Modellverhalten. Um diese Lücke zu schließen, stellen wir OpenDataArena (ODA) vor – eine ganzheitliche und offene Plattform, die darauf abzielt, den intrinsischen Wert nachtrainierter Daten zu benchmarken. ODA etabliert ein umfassendes Ökosystem aus vier zentralen Säulen: (i) einer einheitlichen Trainings-Evaluierungs-Pipeline, die faire, offene Vergleiche zwischen verschiedenen Modellen (z. B. Llama, Qwen) und Domänen ermöglicht; (ii) einem mehrdimensionalen Bewertungsframework, das die Datenqualität entlang zehn verschiedener Dimensionen profilierend analysiert; (iii) einem interaktiven Datenstammbaum-Explorer, der die Genealogie von Datensätzen visualisiert und die Quellen einzelner Komponenten aufdeckt; sowie (iv) einem vollständig open-source Toolkit für Training, Evaluation und Bewertung, das die Forschung zu Daten fördert. Umfangreiche Experimente auf ODA – basierend auf über 120 Trainingsdatensätzen aus mehreren Domänen, getestet an 22 Benchmarks, validiert durch mehr als 600 Trainingsläufe und 40 Millionen verarbeitete Datensätze – ergeben bedeutende Erkenntnisse. Unsere Analyse offenbart inhärente Kompromisse zwischen Datenkomplexität und Task-Performance, identifiziert Redundanzen in gängigen Benchmarks durch Stammesverfolgung und kartiert die genetischen Beziehungen zwischen Datensätzen. Wir veröffentlichen sämtliche Ergebnisse, Tools und Konfigurationen, um den Zugang zu qualitativ hochwertiger Datenbewertung zu demokratisieren. ODA geht über die bloße Erweiterung eines Leaderboards hinaus und verfolgt ein neues Paradigma: den Übergang von einer trial-and-error-basierten Datensammlung hin zu einer fundierten Wissenschaft des datenzentrierten KI, wodurch die Grundlage für rigorose Studien zu Datenmischgesetzen und der strategischen Zusammensetzung von Grundmodellen gelegt wird.

One-sentence Summary

The Shanghai Artificial Intelligence Laboratory and OpenDataLab's OpenDataArena Team introduces OpenDataArena (ODA), a transparent platform that benchmarks post-training data value through unified evaluation pipelines, multi-dimensional scoring across 22 benchmarks, and interactive data lineage tracing, replacing opaque "black box" dataset practices with systematic evaluation to advance reproducible data-centric AI research for large language models.

Key Contributions

  • The paper addresses the critical problem of opaque post-training data composition in LLM development, which hinders reproducibility and obscures causal links between data characteristics and model performance, by introducing OpenDataArena (ODA) as a holistic platform for systematic data benchmarking. ODA establishes a unified training-evaluation pipeline enabling fair comparisons across diverse models and domains, validated through extensive experiments on over 120 datasets across 22 benchmarks with 600+ training runs and 40 million processed data points.
  • It proposes a novel multi-dimensional scoring framework that profiles data quality across tens of distinct axes beyond single-metric evaluations, revealing non-trivial insights such as inherent trade-offs between data complexity and task performance, and identifying redundancy in popular benchmarks through lineage tracing. This framework provides granular quality assessment validated by correlation analyses between fine-grained metrics and downstream results across models like Llama3.1 and Qwen series.
  • The platform introduces an interactive data lineage explorer for visualizing dataset genealogy and source provenance alongside a fully open-source toolkit for training, evaluation, and scoring, enabling transparent dissection of dataset components and reproducible research. This ecosystem facilitated efficiency analysis mapping "genealogical" dataset relationships and identifying high-yield data sources to inform strategic curation.

Introduction

The authors address a critical gap in Large Language Model (LLM) development: while models undergo rigorous benchmarking, the post-training datasets that shape their behavior remain poorly understood "black boxes" with opaque composition and uncertain provenance. This lack of standardized evaluation hinders reproducibility, obscures how data characteristics influence model performance, and forces data curation into costly trial-and-error processes. Prior efforts failed to isolate dataset quality as the sole variable due to inconsistent training protocols and evaluation metrics.

To solve this, the authors introduce OpenDataArena (ODA), an open platform establishing fair, reproducible benchmarking for post-training data. Its core innovation is a unified training-evaluation pipeline that fixes base models and hyperparameters, enabling direct "apples-to-apples" dataset comparisons across models like Llama and Qwen. ODA further provides a multi-dimensional scoring framework to profile data quality across diverse axes, an interactive lineage explorer for tracing dataset provenance, and fully open-source tools validated across 120 datasets, 600+ training runs, and 40 million data points. This infrastructure shifts data evaluation from ad-hoc experimentation toward a principled science of Data-Centric AI.

Dataset

  • The authors analyze over 120 publicly available SFT training datasets totaling 40 million+ samples, sourced primarily from Hugging Face based on community impact (minimum downloads/likes), recency (post-2023), and SFT suitability. Key examples include OpenThoughts3, LIMO, and Tulu3-SFT, with individual dataset sizes ranging from thousands to hundreds of thousands of samples.
  • Domain distribution is heavily skewed: Math (34.3%) and Code (30.6%) dominate, followed by General (20.8%) and Science (14.4%). Datasets underwent safety reviews and format standardization, with mixed-domain collections included to reflect real-world complexity. Benchmarks for evaluation span 22+ tests across General (e.g., MMLU-PRO), Math (e.g., OlympiadBenchMath), Code (e.g., LiveCodeBench), and Reasoning (e.g., GPQA diamond).
  • The paper uses these datasets to build the OpenDataArena platform, analyzing their intrinsic properties and downstream performance via leaderboard evaluations. No explicit training/validation splits are defined; instead, datasets are assessed holistically for impact across domains, with lineage analysis tracing dependencies between high-performing collections.
  • Processing includes automated data lineage tracing to map derivations and redundancies, revealing systemic homogenization (e.g., 70 seed datasets expand to 411 nodes with 941 edges globally). Critical findings include benchmark contamination—where training data incorporates test sets like Omni-MATH—and domain-specific patterns, such as Math datasets averaging 5.18 derivation steps via iterative refinement.

Method

The OpenDataArena platform is structured as a systematic, end-to-end workflow designed to evaluate post-training datasets through standardized, reproducible benchmarks. At its core, the platform orchestrates a four-stage pipeline that transforms raw datasets into actionable insights, supported by a suite of open-source tools and interactive visualizations.

The process begins at the Data Input Layer, where datasets are ingested, normalized into a unified format, and classified by domain to ensure consistency across evaluations. This layer serves as the foundational entry point, preparing heterogeneous data sources for downstream processing. As shown in the figure below, this stage feeds directly into the Data Evaluation Layer, which acts as the computational engine of the platform.

In the Data Evaluation Layer, each dataset is used to fine-tune a common pre-trained base model—such as Qwen or LLaMA—under a fixed training protocol. The resulting model is then evaluated across a diverse set of downstream benchmarks, with its aggregated performance serving as a proxy for the dataset’s intrinsic value. Concurrently, the layer executes a multi-dimensional scoring process that separately assesses the instruction (Q) and the full instruction-response pair (Q&A), capturing distinct facets of data quality. This scoring system leverages three methodological categories: model-based evaluation for quantifying complexity and reasoning depth, LLM-as-judge for subjective qualitative attributes like coherence and clarity, and heuristic rules for objective metrics such as token length or response structure.

The outputs from the evaluation stage are then passed to the Data Analysis Layer, which synthesizes performance metrics and scoring results to enable cross-model comparisons, domain-specific efficacy assessments, and exploration of data family relationships. This layer facilitates deeper diagnostic insights by correlating dataset properties with model behavior, allowing researchers to identify patterns in data utility and redundancy.

Finally, the Data Visualization Layer renders these analytical outputs into interactive leaderboards, comparative charts, and score visualizations for end users. The platform’s deliverables include a public leaderboard for intuitive performance ranking, a multi-dimensional scoring framework detailing over 15 intrinsic dataset properties, an interactive data lineage platform for tracing provenance, and a fully open-source evaluation toolkit to ensure reproducibility and community extensibility.

Refer to the framework diagram for a high-level overview of how these components interconnect to form a cohesive evaluation ecosystem.

Experiment

  • Standardized pipeline with 600+ training runs across Llama3.1-8B, Qwen2.5-7B, and Qwen3-8B models validates that dataset quality solely drives performance variations.
  • Qwen3 achieves highest median scores across all domains (e.g., Math: ~56 on 2025Q3 datasets), confirming stronger base models provide higher performance floors and robustness to data noise.
  • Math datasets surged from ~35 to ~56 (2023Q2–2025Q3) due to synthetic Chain-of-Thought techniques, while Code domain remains volatile with inconsistent quality.
  • Response Length shows 0.81 Spearman correlation with Math performance; verbose reasoning (e.g., OpenThought3) significantly boosts learning, but Code requires conciseness (negative length correlation: -0.29).
  • High-Density Volume strategy (moderate-sized curated datasets) outperforms extreme efficiency; e.g., AM-Thinking achieves top Math/Code results, while tiny datasets like LIM0 degrade Llama3.1 performance in Math.
  • Dataset rankings show high Math consistency (0.902 rank correlation between Qwen2.5/Qwen3) but General domain saturation (negative correlation: -0.323), indicating specialized domains benefit more from tailored data.
  • Code domain exhibits unique evaluation criteria: metrics like Thinking Probability positively correlate (0.54) versus negative Math correlation (-0.69), necessitating domain-specific assessment frameworks.

The authors measure dataset ranking consistency between Qwen2.5 and Qwen3 models using Spearman correlation, revealing that Math datasets show exceptionally high alignment (0.902), indicating their value is stable across model generations. In contrast, General datasets exhibit negative correlation (-0.323), suggesting diminishing returns as stronger models absorb common instruction patterns during pre-training. Science and Code domains show weak positive correlations, reflecting partial but inconsistent transfer of dataset value as base models evolve.

The authors use standardized fine-tuning and evaluation pipelines across multiple base models to rank datasets by performance in General, Math, Code, and Science domains. Results show that Qwen3 consistently achieves top global rankings, with AM-Thinking-Math and MegaScience leading in Math and Science respectively, while Code datasets like Code-Feedback and Raiden-DeepSeek-R1 show strong performance on Qwen3. Dataset rankings exhibit high consistency in Math (Spearman 0.902) but negative correlation in General (-0.323), indicating that advanced models like Qwen3 derive less benefit from general instruction data due to pre-training saturation.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp
OpenDataArena: Ein faire und offener Arena zur Bewertung des Wertes von Post-Training-Datensätzen | Papers | HyperAI