HyperAIHyperAI

Command Palette

Search for a command to run...

Au-delà de l'IID : Dans quelle mesure les modèles tabulaires fondamentaux sont-ils vraiment généraux ?

Lennart Purucker Andrej Tschalzev Nick Erickson Gioia Blayer David Holzmüller Alan Arazi Alexander Pfefferle Mustafa Tajjar Gaël Varoquaux Frank Hutter

Résumé

Les modèles fondamentaux pour l'apprentissage automatique prédictif sur données tabulaires ont récemment gagné en popularité dans le monde universitaire et industriel. Les communautés de recherche de diverses disciplines évaluent de plus en plus ces modèles sur des ensembles de données et des tâches variés. Cependant, ces évaluations spécifiques aux tâches et aux disciplines restent largement inaccessibles aux chercheurs en modélisation en raison de la fragmentation des logiciels de référence et des protocoles d'évaluation. Par conséquent, les chercheurs s'appuient sur des benchmarks standards, principalement définis pour des tâches où les modèles tabulaires fondamentaux excellent déjà. Les scénarios les plus difficiles sont exclus, limitant les progrès significatifs dans le domaine en se concentrant sur des améliorations marginales sur des données IID plutôt que sur des défis plus larges et exigeants. Pour remédier à cela, nous présentons BeyondArena, le premier benchmark holistique unifié pour les données tabulaires, prenant en charge divers types de tâches (IID, temporelles, groupées), à différentes échelles de taille d'échantillon et de dimensionnalité des caractéristiques, avec des types de caractéristiques variés (avec texte, à cardinalité élevée) issus d'un large éventail de disciplines. Afin de permettre une évaluation unifiée au-delà des benchmarks standards, nous introduisons DataFoundry, un framework Python et un schéma de métadonnées pour la curation d'ensembles de données tabulaires destinés à l'apprentissage automatique prédictif. Nos résultats sur 11 modèles et 142 ensembles de données curatés montrent que les modèles tabulaires fondamentaux existants excellent sur des données IID de petite à moyenne taille, tandis que les modèles traditionnels basés sur les arbres et l'apprentissage profond dominent encore sur les ensembles de données non-IID, volumineux et de haute dimension. BeyondArena oriente la recherche en modélisation vers les défis les plus exigeants des données tabulaires, permettant des progrès vers des modèles tabulaires véritablement fondamentaux.

One-sentence Summary

Researchers from Prior Labs, University of Freiburg, INRIA Saclay, et al. introduce BeyondArena, the first unified benchmark for tabular data supporting IID, temporal, and grouped tasks, along with DataFoundry, a curation framework, and their evaluation across 11 models and 142 datasets shows that existing tabular foundation models excel only on small IID data while tree-based and deep learning models dominate on non-IID, large, and high-dimensional tasks, highlighting the need for truly general tabular AI.

Key Contributions

  • BeyondArena is a unified benchmark consolidating IID, temporal, and grouped tabular tasks across dataset scales, feature dimensionalities, and modalities (text, high cardinality), providing standardized evaluation on 142 curated datasets from multiple disciplines.
  • DataFoundry is a Python framework and metadata schema for reproducible dataset curation, enabling consistent preprocessing, validation, and benchmarking across task paradigms.
  • Evaluating 11 models shows that tabular foundation models excel on small- to medium-sized IID data but are outperformed by tree-based and deep learning methods on non-IID, large, and high-dimensional data, revealing critical gaps for future models.

Introduction

Tabular foundation models (TFMs) have been evaluated extensively, but the tabular research community has mostly restricted benchmarks to independent and identically distributed (IID) tasks, ignoring the grouped and temporal splits common in real-world predictive applications. Prior efforts also suffer from fragmentation, with separate infrastructures and standards for tiny, temporal, or text-rich tables, which confuses practitioners and slows progress. The authors introduce BeyondArena, a unified benchmark that spans IID, temporal, and grouped tasks across a wide range of sample sizes, dimensionality, and feature types (including high-cardinality categorical and text data). By curating 142 high-quality datasets and integrating 11 models into the open-source TabArena ecosystem, they provide a rigorous, standardized evaluation that reveals where current TFMs excel (small- to medium-scale IID data) and where traditional models still dominate (non-IID, large-scale, high-dimensional settings).

Dataset

The authors curate BeyondArena, a manually verified collection of 142 tabular datasets designed to reflect challenging real-world prediction tasks. The collection extends earlier benchmarks by including non-IID scenarios (temporal and grouped splits), datasets with fewer than 500 or more than 100,000 training samples, and tables containing text and date features.

Dataset composition and sources

  • The initial pool of 1,128 candidates was drawn from 21 prior tabular benchmark studies, public repositories (UCI, OpenML, Hugging Face, Kaggle, Zindi, ASlib), and government websites.
  • 304 datasets from 14 earlier TabArena benchmarks (both accepted and previously rejected) were re-evaluated.
  • 7 non-IID or multimodal benchmarks (TabReD, TableShift, CARTE, TextTabBench, TabSTAR, AutoGluon Multimodal, and a string vectorizing benchmark) contributed additional candidates.

Key details for each dataset

  • Selection rules: A dataset must be unique within the benchmark, contain at least 100 training samples, support a random, temporal, or grouped validation split, be explicitly published for classification or regression, represent a real application (not purely synthetic or vectorized images), and raise no obvious ethical concerns.
  • Filtering outcome: After manual review, 142 datasets satisfied all criteria; the majority of rejected candidates were duplicates, lacked a real-world task, or were not originally designed for supervised prediction.
  • Characteristics: The final collection spans tiny to large-scale datasets, includes diverse feature types (numerical, categorical, date, text), covers both classification and regression, and comes from varied application domains. The 142 data artifacts are hosted on Hugging Face.

How the paper uses the data

  • Each dataset is processed through the DataFoundry framework, which unifies file formats (CSV, Parquet, Excel) and annotates feature types, target column, and metadata for group or time indices.
  • The authors define a train-test split for every dataset – random, temporal, or grouped – according to the original task description. For very large datasets, subsampling is applied to keep training manageable while preserving the task’s predictive challenge.
  • No explicit mixture ratios or cropping are used; instead, the curated datasets serve as an independent benchmark suite where each dataset is evaluated individually with its designated split and preprocessing.
  • Feature engineering is applied only when necessary to align with the original task (e.g., as described in a paper or winning competition solution), and all processing steps are documented in reproducible Python notebooks.

Method

The authors aim to extend tabular benchmarking to better represent challenging real-world predictive machine learning applications. To achieve this, they overhaul selection criteria to support non-IID data (temporal and grouped), datasets with fewer than 500 or more than 100,000 training samples, and tabular data containing text and date features. To ensure high quality and representativeness, the authors opt for a labor-intensive manual curation process where all results are verified by humans, which is crucial for scientific rigor.

To enable extensive and reproducible dataset curation, the authors introduce DataFoundry, a Python framework and metadata schema for tabular datasets and predictive machine learning. DataFoundry provides a mature Python package for dataset checks, train-test splits, and an extensible metadata schema, alongside reproducible Python notebooks for processing each dataset.

The authors gather datasets from 21 tabular benchmark studies and public data repositories, including UCI Machine Learning Repository, OpenML, Hugging Face, Kaggle, and others. They re-evaluate datasets from prior benchmarks and collect from non-IID or multimodal tabular benchmarks.

A dataset must fulfill specific criteria to be included in BeyondArena: it must represent a unique predictive machine learning task, have at least 100 training samples to avoid few-shot tasks, support appropriate validation protocols such as random, temporal, or grouped splits, be published for predictive classification or regression, represent a real-world application problem, and raise no obvious ethical concerns.

The selection workflow is illustrated in the figure below:

Starting with 1128 datasets from various sources, the authors filter out duplicates, few-shot prediction tasks, non-predictive tasks, and datasets that do not represent real applications. This rigorous filtering results in 142 selected datasets ranging from tiny to large-scale, covering tabular IID and non-IID tasks.

Each selected dataset is thoroughly investigated and integrated into DataFoundry. The authors unify data formats and feature type annotations, including numbers, categorical, dates, and strings. They store necessary metadata for tasks and curation annotations in a machine-readable schema. Feature engineering is applied when necessary to create appropriate tasks.

The characteristics of the curated datasets are detailed in the following figure:

The datasets vary widely in size regarding rows, columns, and cells, as well as age and feature types such as numeric, categorical, binary, text, and date/time. They cover diverse problem types including binary classification, regression, and multiclass classification, along with various task types like IID, grouped, and temporal. The collection spans multiple sources and application domains, including medical and healthcare, business and marketing, finance, and biology.

Experiment

The BeyondArena benchmark evaluates 11 tabular models, including several tabular foundation models (TFMs), across diverse dataset characteristics like scale, dimensionality, task type, and feature types. While TFMs perform strongly on small, tiny, and IID datasets, they are consistently outperformed by well-tuned traditional models—especially deep networks such as RealMLP—on temporal, grouped, large-scale, high-dimensional, and high-cardinality tasks. Overall, TFMs can match or beat the best traditional model on 70% of datasets, but their pretrained generality falls short in many real-world non-IID regimes, illustrating the need for further development. Ablations confirm the importance of using appropriate non-IID splits and preprocessing, and show that post-hoc calibration generally improves log-loss performance.

Ablations confirm that BeyondArena's inner split choices (repeated 5-fold CV and non-IID splits) are robust, consistently improving average performance while maintaining ranking stability. Preprocessing decisions for grouped and text data have mixed effects: handling group indices for label-per-sample data heavily alters rankings, and using TF-IDF over Qwen3 embeddings for long text features yields large performance gains without disrupting model ordering. Post-hoc probability calibration nearly always beats the uncalibrated default, except for a few specific models. Replacing 8-fold CV with 5×5-fold CV improves performance across models while keeping rankings nearly unchanged (τ=0.93). Using non-IID inner splits instead of IID splits for temporal and grouped data achieves better average performance and leaves model ranks identical (τ=1.00). For label-per-sample grouped datasets, dropping index preprocessing substantially scrambles model rankings (τ=0.43), though BeyondArena's preprocessing still wins 71% of the time. On long text datasets, switching from Qwen3 embeddings to TF-IDF drastically improves predictive performance, with BeyondArena's Qwen3 setting winning only 6% of comparisons yet rankings remaining highly correlated (τ=0.89). Applying post-hoc probability calibration outperforms the no-calibration default in 82% of cases, but can negatively impact TabPFN-2.6 and RealMLP.

Ablation experiments confirm that BeyondArena's inner split strategies—repeated 5-fold cross-validation and non-IID splits—consistently boost average performance while keeping model rankings stable. Preprocessing decisions for grouped and text data have mixed effects: handling group indices is essential for label-per-sample datasets, and using TF-IDF instead of Qwen3 embeddings on long text features yields large predictive gains without disrupting model ordering. Post-hoc probability calibration nearly always outperforms the uncalibrated default, though it can negatively affect TabPFN-2.6 and RealMLP. Overall, the core design choices are robust, with calibration and careful preprocessing being key to performance without harming rank consistency.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp