HyperAIHyperAI

Command Palette

Search for a command to run...

IIDを超えて:表形式基盤モデルは本当にどれほど汎用的か?

Lennart Purucker Andrej Tschalzev Nick Erickson Gioia Blayer David Holzmüller Alan Arazi Alexander Pfefferle Mustafa Tajjar Gaël Varoquaux Frank Hutter

概要

表形式データに対する予測機械学習の基盤モデルが、近年、学界と産業界で大きな注目を集めている。様々な分野の研究コミュニティが、多様なデータセットとタスクで表形式基盤モデルを評価し始めている。しかし、これらのタスクや分野に特化した評価は、ベンチマークソフトウェアや評価プロトコルが断片化しているため、モデル研究者にとってほとんど利用できない状況にある。その結果、モデル研究者は標準的なベンチマークに依存しており、それらは主に表形式基盤モデルが既に優れた性能を示すタスク向けに定義されている。最も困難なシナリオが除外されることで、IIDデータにおける限界的な改善に焦点が当てられ、より広範で要求の厳しい課題への取り組みが制限され、分野の有意義な進展が妨げられている。この問題を克服するため、我々はBeyondArenaを導入する。これは、多様なタスクタイプ(IID、時系列、グループ化)、サンプルサイズや特徴次元のスケール、多様な特徴タイプ(テキスト付き、高カーディナリティ)を、幅広い分野から収集したデータセットでサポートする、初の統一的かつ包括的な表形式データ用ベンチマークである。標準ベンチマークを超えた統一的な評価を可能にするため、予測機械学習用の表形式データセットをキュレーションするPythonフレームワークおよびメタデータスキーマであるDataFoundryも併せて紹介する。11のモデルと142のキュレーションされたデータセットを用いた結果、既存の表形式基盤モデルは小~中規模のIIDデータでは優れている一方、非IID、大規模、高次元のデータセットでは従来のツリーベースおよび深層学習モデルが依然として支配的であることが示された。BeyondArenaは、表形式データにおける最も困難な課題にモデル研究を導き、真に基盤的な表形式モデルへの進歩を促進する。

One-sentence Summary

Researchers from Prior Labs, University of Freiburg, INRIA Saclay, et al. introduce BeyondArena, the first unified benchmark for tabular data supporting IID, temporal, and grouped tasks, along with DataFoundry, a curation framework, and their evaluation across 11 models and 142 datasets shows that existing tabular foundation models excel only on small IID data while tree-based and deep learning models dominate on non-IID, large, and high-dimensional tasks, highlighting the need for truly general tabular AI.

Key Contributions

  • BeyondArena is a unified benchmark consolidating IID, temporal, and grouped tabular tasks across dataset scales, feature dimensionalities, and modalities (text, high cardinality), providing standardized evaluation on 142 curated datasets from multiple disciplines.
  • DataFoundry is a Python framework and metadata schema for reproducible dataset curation, enabling consistent preprocessing, validation, and benchmarking across task paradigms.
  • Evaluating 11 models shows that tabular foundation models excel on small- to medium-sized IID data but are outperformed by tree-based and deep learning methods on non-IID, large, and high-dimensional data, revealing critical gaps for future models.

Introduction

Tabular foundation models (TFMs) have been evaluated extensively, but the tabular research community has mostly restricted benchmarks to independent and identically distributed (IID) tasks, ignoring the grouped and temporal splits common in real-world predictive applications. Prior efforts also suffer from fragmentation, with separate infrastructures and standards for tiny, temporal, or text-rich tables, which confuses practitioners and slows progress. The authors introduce BeyondArena, a unified benchmark that spans IID, temporal, and grouped tasks across a wide range of sample sizes, dimensionality, and feature types (including high-cardinality categorical and text data). By curating 142 high-quality datasets and integrating 11 models into the open-source TabArena ecosystem, they provide a rigorous, standardized evaluation that reveals where current TFMs excel (small- to medium-scale IID data) and where traditional models still dominate (non-IID, large-scale, high-dimensional settings).

Dataset

The authors curate BeyondArena, a manually verified collection of 142 tabular datasets designed to reflect challenging real-world prediction tasks. The collection extends earlier benchmarks by including non-IID scenarios (temporal and grouped splits), datasets with fewer than 500 or more than 100,000 training samples, and tables containing text and date features.

Dataset composition and sources

  • The initial pool of 1,128 candidates was drawn from 21 prior tabular benchmark studies, public repositories (UCI, OpenML, Hugging Face, Kaggle, Zindi, ASlib), and government websites.
  • 304 datasets from 14 earlier TabArena benchmarks (both accepted and previously rejected) were re-evaluated.
  • 7 non-IID or multimodal benchmarks (TabReD, TableShift, CARTE, TextTabBench, TabSTAR, AutoGluon Multimodal, and a string vectorizing benchmark) contributed additional candidates.

Key details for each dataset

  • Selection rules: A dataset must be unique within the benchmark, contain at least 100 training samples, support a random, temporal, or grouped validation split, be explicitly published for classification or regression, represent a real application (not purely synthetic or vectorized images), and raise no obvious ethical concerns.
  • Filtering outcome: After manual review, 142 datasets satisfied all criteria; the majority of rejected candidates were duplicates, lacked a real-world task, or were not originally designed for supervised prediction.
  • Characteristics: The final collection spans tiny to large-scale datasets, includes diverse feature types (numerical, categorical, date, text), covers both classification and regression, and comes from varied application domains. The 142 data artifacts are hosted on Hugging Face.

How the paper uses the data

  • Each dataset is processed through the DataFoundry framework, which unifies file formats (CSV, Parquet, Excel) and annotates feature types, target column, and metadata for group or time indices.
  • The authors define a train-test split for every dataset – random, temporal, or grouped – according to the original task description. For very large datasets, subsampling is applied to keep training manageable while preserving the task’s predictive challenge.
  • No explicit mixture ratios or cropping are used; instead, the curated datasets serve as an independent benchmark suite where each dataset is evaluated individually with its designated split and preprocessing.
  • Feature engineering is applied only when necessary to align with the original task (e.g., as described in a paper or winning competition solution), and all processing steps are documented in reproducible Python notebooks.

Method

The authors aim to extend tabular benchmarking to better represent challenging real-world predictive machine learning applications. To achieve this, they overhaul selection criteria to support non-IID data (temporal and grouped), datasets with fewer than 500 or more than 100,000 training samples, and tabular data containing text and date features. To ensure high quality and representativeness, the authors opt for a labor-intensive manual curation process where all results are verified by humans, which is crucial for scientific rigor.

To enable extensive and reproducible dataset curation, the authors introduce DataFoundry, a Python framework and metadata schema for tabular datasets and predictive machine learning. DataFoundry provides a mature Python package for dataset checks, train-test splits, and an extensible metadata schema, alongside reproducible Python notebooks for processing each dataset.

The authors gather datasets from 21 tabular benchmark studies and public data repositories, including UCI Machine Learning Repository, OpenML, Hugging Face, Kaggle, and others. They re-evaluate datasets from prior benchmarks and collect from non-IID or multimodal tabular benchmarks.

A dataset must fulfill specific criteria to be included in BeyondArena: it must represent a unique predictive machine learning task, have at least 100 training samples to avoid few-shot tasks, support appropriate validation protocols such as random, temporal, or grouped splits, be published for predictive classification or regression, represent a real-world application problem, and raise no obvious ethical concerns.

The selection workflow is illustrated in the figure below:

Starting with 1128 datasets from various sources, the authors filter out duplicates, few-shot prediction tasks, non-predictive tasks, and datasets that do not represent real applications. This rigorous filtering results in 142 selected datasets ranging from tiny to large-scale, covering tabular IID and non-IID tasks.

Each selected dataset is thoroughly investigated and integrated into DataFoundry. The authors unify data formats and feature type annotations, including numbers, categorical, dates, and strings. They store necessary metadata for tasks and curation annotations in a machine-readable schema. Feature engineering is applied when necessary to create appropriate tasks.

The characteristics of the curated datasets are detailed in the following figure:

The datasets vary widely in size regarding rows, columns, and cells, as well as age and feature types such as numeric, categorical, binary, text, and date/time. They cover diverse problem types including binary classification, regression, and multiclass classification, along with various task types like IID, grouped, and temporal. The collection spans multiple sources and application domains, including medical and healthcare, business and marketing, finance, and biology.

Experiment

The BeyondArena benchmark evaluates 11 tabular models, including several tabular foundation models (TFMs), across diverse dataset characteristics like scale, dimensionality, task type, and feature types. While TFMs perform strongly on small, tiny, and IID datasets, they are consistently outperformed by well-tuned traditional models—especially deep networks such as RealMLP—on temporal, grouped, large-scale, high-dimensional, and high-cardinality tasks. Overall, TFMs can match or beat the best traditional model on 70% of datasets, but their pretrained generality falls short in many real-world non-IID regimes, illustrating the need for further development. Ablations confirm the importance of using appropriate non-IID splits and preprocessing, and show that post-hoc calibration generally improves log-loss performance.

Ablations confirm that BeyondArena's inner split choices (repeated 5-fold CV and non-IID splits) are robust, consistently improving average performance while maintaining ranking stability. Preprocessing decisions for grouped and text data have mixed effects: handling group indices for label-per-sample data heavily alters rankings, and using TF-IDF over Qwen3 embeddings for long text features yields large performance gains without disrupting model ordering. Post-hoc probability calibration nearly always beats the uncalibrated default, except for a few specific models. Replacing 8-fold CV with 5×5-fold CV improves performance across models while keeping rankings nearly unchanged (τ=0.93). Using non-IID inner splits instead of IID splits for temporal and grouped data achieves better average performance and leaves model ranks identical (τ=1.00). For label-per-sample grouped datasets, dropping index preprocessing substantially scrambles model rankings (τ=0.43), though BeyondArena's preprocessing still wins 71% of the time. On long text datasets, switching from Qwen3 embeddings to TF-IDF drastically improves predictive performance, with BeyondArena's Qwen3 setting winning only 6% of comparisons yet rankings remaining highly correlated (τ=0.89). Applying post-hoc probability calibration outperforms the no-calibration default in 82% of cases, but can negatively impact TabPFN-2.6 and RealMLP.

Ablation experiments confirm that BeyondArena's inner split strategies—repeated 5-fold cross-validation and non-IID splits—consistently boost average performance while keeping model rankings stable. Preprocessing decisions for grouped and text data have mixed effects: handling group indices is essential for label-per-sample datasets, and using TF-IDF instead of Qwen3 embeddings on long text features yields large predictive gains without disrupting model ordering. Post-hoc probability calibration nearly always outperforms the uncalibrated default, though it can negatively affect TabPFN-2.6 and RealMLP. Overall, the core design choices are robust, with calibration and careful preprocessing being key to performance without harming rank consistency.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています