HyperAIHyperAI

Command Palette

Search for a command to run...

QuitoBench : une référence ouverte de haute qualité pour la prévision de séries temporelles

Siqiao Xue Zhaoyang Zhu Wei Zhang Rongyao Cai Rui Wang Yixiang Mu Fan Zhou Jianguo Li Peng Di Hang Yu

Résumé

La prévision de séries temporelles est cruciale dans les domaines de la finance, de la santé et du cloud computing, mais les progrès restent entravés par un goulot d'étranglement fondamental : la pénurie de benchmarks à grande échelle et de haute qualité. Pour combler cette lacune, nous présentons extsc{QuitoBench}, un benchmark équilibré en régimes pour la prévision de séries temporelles, couvrant huit régimes combinant tendance, saisonnalité et prévisibilité (TSF), conçu pour capturer des propriétés pertinentes pour la prévision plutôt que des étiquettes de domaine définies par l'application. Ce benchmark repose sur extsc{Quito}, un corpus de séries temporelles à l'échelle du milliard de points, issu du trafic d'applications d'Alipay et couvrant neuf domaines métiers. En évaluant 10 modèles issus des approches d'apprentissage profond, des modèles de fondation et des baselines statistiques sur 232 200 instances d'évaluation, nous rapportons quatre résultats clés : (i) un croisement lié à la longueur du contexte, où les modèles d'apprentissage profond surpassent les autres pour des contextes courts (L=96), tandis que les modèles de fondation dominent pour des contextes longs (L≥576) ; (ii) la prévisibilité constitue le facteur de difficulté dominant, générant un écart de 3,64 fois en erreur absolue moyenne (MAE) entre les régimes ; (iii) les modèles d'apprentissage profond égalent ou surpassent les modèles de fondation avec 59 fois moins de paramètres ; et (iv) l'augmentation du volume de données d'entraînement apporte un bénéfice substantiellement supérieur à l'augmentation de la taille du modèle, et ce pour les deux familles de modèles. Ces résultats sont corroborés par une forte cohérence inter-benchmarks et inter-métriques. Notre version open source permet une évaluation reproductible et consciente des régimes dans la recherche sur la prévision de séries temporelles.

One-sentence Summary

Ant Group researchers introduce QUITO-BENCH, a regime-balanced benchmark built on a billion-scale Alipay corpus, to address data scarcity in time series forecasting. It reveals that forecastability drives difficulty and that scaling data outperforms model size, enabling reproducible evaluation across finance and cloud computing.

Key Contributions

  • The paper introduces QUITO-BENCH, a regime-balanced benchmark that categorizes time series by intrinsic statistical properties like trend, seasonality, and forecastability rather than application domains to ensure uniform coverage across eight distinct forecasting regimes.
  • A billion-scale, single-provenance time series corpus named QUITO is presented, featuring uniformly long series from Alipay that eliminate information leakage and support rigorous evaluation at context lengths up to 1,024.
  • Extensive experiments across 232,200 evaluation instances reveal that deep learning models outperform foundation models at short contexts while foundation models dominate at long contexts, and that scaling training data yields greater benefits than scaling model size.

Introduction

Time series forecasting is essential for high-stakes decisions in finance, healthcare, and cloud computing, yet the field faces an evaluation crisis due to a lack of large-scale, high-quality benchmarks. Prior work suffers from coarse domain-based categorization that ignores intrinsic data properties, severe distributional skew where most data falls into a single regime, and information leakage from reusing public datasets across training and testing pipelines. To address these issues, the authors introduce QUITOBENCH, a regime-balanced benchmark built on QUITO, a billion-scale time series corpus from Alipay that ensures leakage-free evaluation across eight distinct trend, seasonality, and forecastability regimes. This new standard enables rigorous comparison of deep learning and foundation models, revealing that forecastability drives difficulty and that data scaling offers greater benefits than increasing model size.

Dataset

  • Dataset Composition and Sources The authors construct the QUITO corpus from production traffic telemetry on Alipay, a major digital payment platform. The data spans nine business verticals, including finance, commerce, and infrastructure, ensuring diversity across a full-scale digital economy rather than a single narrow domain. Each series represents the workload of a distinct application service, recorded as a 5-dimensional vector of anonymized traffic subtypes.

  • Key Details for Each Subset The corpus is divided into two disjoint subsets with no overlap in application identifiers:

    • QUITO-MIN: Contains 22,522 series at 10-minute resolution spanning from July 10, 2023, to August 19, 2023. This subset reflects high-frequency telemetry subject to shorter retention windows.
    • QUITO-HOUR: Contains 12,544 series at 1-hour resolution spanning from November 18, 2021, to August 19, 2023. This subset consists of long-term archived hourly aggregates.
    • QUITOBENCH: A curated evaluation benchmark derived from the full corpus, containing 1,290 test series (773 from QUITO-MIN and 517 from QUITO-HOUR) selected to ensure balanced representation across different time series behaviors.
  • Data Usage and Processing Strategy The authors employ a rigorous pipeline to prepare the data for training and evaluation:

    • Aggregation: Raw 1-second telemetry is aggregated into 10-minute or 1-hour bins using max pooling to preserve workload peaks.
    • Sanitization: The authors apply a two-stage deduplication process to remove exact and near-duplicate series (Pearson correlation > 0.99) and standardize the 5 variates.
    • Regime Labeling: Each series is characterized by a TSF profile (Trend, Seasonality, Forecastability) using STL decomposition and spectral entropy. These metrics are binarized to assign one of eight discrete regime labels.
    • Benchmark Construction: To prevent evaluation bias toward common patterns, the authors use stratified sampling to select approximately 162 series per regime cell for QUITOBENCH, ensuring near-uniform coverage of all eight behavioral types.
  • Splitting and Leakage Prevention The authors enforce a global temporal cutoff at July 28, 2023, to guarantee leakage-free splits across both granularities. Data prior to this date is divided into training (80%) and validation (20%) sets, while data from the cutoff onward forms the test set. This chronological ordering ensures that no future information contaminates the training process.

Method

The authors leverage a structured pipeline to construct a robust time series forecasting benchmark, ensuring contamination-free evaluation and diverse regime coverage. The overall workflow, illustrated in the framework diagram, proceeds through five distinct stages: Raw Collection of production monitoring streams, Standardization involving quality filtering and deduplication, Split Protocol to establish global time cut-offs, Curation to compute dynamic regime labels, and finally Bench Construction to create a balanced evaluation set.

To ensure rigorous evaluation, the data is partitioned using a global time cut-off strategy. As shown in the figure below, the full series is split into training (70%), validation (20%), and test (10%) sets. For the QuitoBench protocol, the authors employ dense rolling windows with a stride of 1 to maximize data utilization during the test phase, contrasting with the non-overlapping windows used in GIFT-Eval style evaluations. This approach generates multiple context-forecast pairs (LLL and HHH) from the test region, allowing for a more granular assessment of model performance across different temporal horizons.

A core component of the curation process involves characterizing the dynamic regime of each time series using three scalar diagnostics: trend strength (TTT), seasonality strength (SSS), and forecastability (FFF). Each metric is bounded within [0,1][0, 1][0,1], where higher values indicate a stronger presence of the respective property.

To quantify trend and seasonality, the authors decompose each univariate series {xt}\{x_t\}{xt} using Seasonal-Trend decomposition via LOESS (STL). This produces three additive components:

xt=τt+st+rt.x _ { t } = \tau _ { t } + s _ { t } + r _ { t } .xt=τt+st+rt.

Here, τt\tau_{t}τt represents the trend, sts_{t}st the seasonal component, and rtr_{t}rt the residual. The strength of seasonality (SSS) and trend (TTT) are defined based on the variance of these components:

S=max(0, 1Var(r)Var(s+r)),T=max(0, 1Var(r)Var(τ+r)).S = \operatorname* { m a x } \bigg ( 0 , \ 1 - \frac { \mathrm { V a r } ( r ) } { \mathrm { V a r } ( s + r ) } \bigg ) , \qquad T = \operatorname* { m a x } \bigg ( 0 , \ 1 - \frac { \mathrm { V a r } ( r ) } { \mathrm { V a r } ( \tau + r ) } \bigg ) .S=max(0, 1Var(s+r)Var(r)),T=max(0, 1Var(τ+r)Var(r)).

A value near 1 implies the component dominates the residual, while a value near 0 suggests the component is negligible relative to noise. The seasonal period ppp is determined by the series resolution, set to 144 for QUITO-MIN and 24 for QUITO-HOUR.

Forecastability is measured as the complement of normalized spectral entropy. Using Welch's method with a Hann window, the power spectral density PkP_kPk is computed for the mean-subtracted series. The normalized entropy HHH is calculated as:

H=kp^klogp^k/logK,p^k=PkjPj,H = - \sum _ { k } \hat { p } _ { k } \log \hat { p } _ { k } \, \Big / \, \log K , \qquad \hat { p } _ { k } = \frac { P _ { k } } { \sum _ { j } P _ { j } } ,H=kp^klogp^k/logK,p^k=jPjPk,

where KKK is the number of frequency bins. Forecastability is then defined as F=1HF = 1 - HF=1H, where F=1F=1F=1 indicates a perfectly deterministic series and F=0F=0F=0 corresponds to white noise.

Since each series is multivariate with five variates, the authors compute TTT, SSS, and FFF independently for each channel and aggregate them by averaging:

Ti=15j=15Ti,j,Si=15j=15Si,j,Fi=15j=15Fi,j.T _ { i } = \frac { 1 } { 5 } \sum _ { j = 1 } { 5 } T _ { i , j } , \quad S _ { i } = \frac { 1 } { 5 } \sum _ { j = 1 } { 5 } S _ { i , j } , \quad F _ { i } = \frac { 1 } { 5 } \sum _ { j = 1 } { 5 } F _ { i , j } .Ti=51j=15Ti,j,Si=51j=15Si,j,Fi=51j=15Fi,j.

Finally, each diagnostic is binarized using a fixed threshold of 0.4 to assign a label of HIGH or LOW. These three binary labels are combined to form one of eight distinct TSF regime cells (e.g., TREND×\times×SEASON×\times×FORECAST), ensuring a balanced distribution of difficulty levels across the benchmark.

Experiment

  • Comprehensive Benchmark Evaluation: Ten models across deep learning, foundation, and statistical families were tested under 18 task configurations using dense rolling windows. This validates that deep learning models generally outperform foundation models in short-context scenarios, while foundation models excel with long historical context, and confirms that statistical baselines are insufficient for complex traffic forecasting.
  • Scaling and Efficiency Analysis: Experiments varying data volume and model size demonstrate that increasing training data yields significantly larger performance gains than increasing model parameters. This validates that task-specific deep learning models are far more parameter-efficient, achieving comparable or superior accuracy to massive foundation models with orders of magnitude fewer parameters.
  • Context Length and Horizon Sensitivity: Analysis of context length reveals a functional split where deep learning models are specialists for short histories, whereas foundation models leverage pre-training to exploit long-range dependencies. Forecast horizon tests further show that task-specific architectures maintain stability over long prediction windows better than foundation models, which degrade more rapidly as uncertainty accumulates.
  • TSF Regime Specialization: Evaluation across Trend, Seasonality, and Forecastability (TSF) regimes identifies forecastability as the primary driver of difficulty. Results validate that foundation models are more robust in low-forecastability (noisy) environments and high-seasonality regimes, while deep learning models dominate in trend-driven, low-seasonality scenarios.
  • Robustness and Generalization: Cross-metric and cross-benchmark comparisons confirm that model rankings are consistent regardless of the error metric (MAE vs. MSE) or the specific dataset provenance. This validates that the observed performance differences reflect intrinsic model capabilities rather than artifacts of specific evaluation settings or data sources.

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp