HyperAIHyperAI

Command Palette

Search for a command to run...

QuitoBench:高品質なオープンタイムシリーズ予測ベンチマーク

Siqiao Xue Zhaoyang Zhu Wei Zhang Rongyao Cai Rui Wang Yixiang Mu Fan Zhou Jianguo Li Peng Di Hang Yu

概要

時系列予測は、金融、医療、クラウドコンピューティングなど多岐にわたる分野において極めて重要であるが、その進展は本質的なボトルネック、すなわち大規模かつ高品質なベンチマークの不足によって制約されている。このギャップを埋めるため、本研究では extsc{QuitoBench} を提案する。これは、8 つのトレンド×季節性×予測可能性(TSF)レジームを網羅し、アプリケーション固有のドメインラベルではなく予測に直結する特性を捉えるように設計された、レジームバランスの取れた時系列予測ベンチマークである。本ベンチマークは、Alipay の 9 つのビジネスドメインにまたがるアプリケーショントラフィックから構成される、十億規模の時系列コーパス extsc{Quito} を基盤としている。ディープラーニング、ファウンデーションモデル、統計的ベースラインから選出された 10 種類のモデルを、232,200 の評価インスタンスにわたりベンチマーク評価した結果、以下の 4 つの主要な知見が得られた。(i) コンテキスト長による転換点:短いコンテキスト(L=96L=96L=96)ではディープラーニングモデルが優位であるが、長いコンテキスト(L576L \ge 576L576)ではファウンデーションモデルが支配的である。(ii) 予測可能性が主要な難易度の駆動因子であり、レジーム間で MAE に 3.643.643.64 倍の差を生じさせる。(iii) ディープラーニングモデルは、ファウンデーションモデルと比較して 595959 倍少ないパラメータ数で同等以上の性能を達成する。(iv) 両モデルファミリーにおいて、モデルサイズのスケールアップよりも、学習データの量を増やすことのほうが著しい利益をもたらす。これらの知見は、ベンチマーク間および評価指標間での高い一貫性によって裏付けられている。本研究成果はオープンソースとして公開され、時系列予測研究における再現性が高く、レジームを考慮した評価を可能にする。

One-sentence Summary

Ant Group researchers introduce QUITO-BENCH, a regime-balanced benchmark built on a billion-scale Alipay corpus, to address data scarcity in time series forecasting. It reveals that forecastability drives difficulty and that scaling data outperforms model size, enabling reproducible evaluation across finance and cloud computing.

Key Contributions

  • The paper introduces QUITO-BENCH, a regime-balanced benchmark that categorizes time series by intrinsic statistical properties like trend, seasonality, and forecastability rather than application domains to ensure uniform coverage across eight distinct forecasting regimes.
  • A billion-scale, single-provenance time series corpus named QUITO is presented, featuring uniformly long series from Alipay that eliminate information leakage and support rigorous evaluation at context lengths up to 1,024.
  • Extensive experiments across 232,200 evaluation instances reveal that deep learning models outperform foundation models at short contexts while foundation models dominate at long contexts, and that scaling training data yields greater benefits than scaling model size.

Introduction

Time series forecasting is essential for high-stakes decisions in finance, healthcare, and cloud computing, yet the field faces an evaluation crisis due to a lack of large-scale, high-quality benchmarks. Prior work suffers from coarse domain-based categorization that ignores intrinsic data properties, severe distributional skew where most data falls into a single regime, and information leakage from reusing public datasets across training and testing pipelines. To address these issues, the authors introduce QUITOBENCH, a regime-balanced benchmark built on QUITO, a billion-scale time series corpus from Alipay that ensures leakage-free evaluation across eight distinct trend, seasonality, and forecastability regimes. This new standard enables rigorous comparison of deep learning and foundation models, revealing that forecastability drives difficulty and that data scaling offers greater benefits than increasing model size.

Dataset

  • Dataset Composition and Sources The authors construct the QUITO corpus from production traffic telemetry on Alipay, a major digital payment platform. The data spans nine business verticals, including finance, commerce, and infrastructure, ensuring diversity across a full-scale digital economy rather than a single narrow domain. Each series represents the workload of a distinct application service, recorded as a 5-dimensional vector of anonymized traffic subtypes.

  • Key Details for Each Subset The corpus is divided into two disjoint subsets with no overlap in application identifiers:

    • QUITO-MIN: Contains 22,522 series at 10-minute resolution spanning from July 10, 2023, to August 19, 2023. This subset reflects high-frequency telemetry subject to shorter retention windows.
    • QUITO-HOUR: Contains 12,544 series at 1-hour resolution spanning from November 18, 2021, to August 19, 2023. This subset consists of long-term archived hourly aggregates.
    • QUITOBENCH: A curated evaluation benchmark derived from the full corpus, containing 1,290 test series (773 from QUITO-MIN and 517 from QUITO-HOUR) selected to ensure balanced representation across different time series behaviors.
  • Data Usage and Processing Strategy The authors employ a rigorous pipeline to prepare the data for training and evaluation:

    • Aggregation: Raw 1-second telemetry is aggregated into 10-minute or 1-hour bins using max pooling to preserve workload peaks.
    • Sanitization: The authors apply a two-stage deduplication process to remove exact and near-duplicate series (Pearson correlation > 0.99) and standardize the 5 variates.
    • Regime Labeling: Each series is characterized by a TSF profile (Trend, Seasonality, Forecastability) using STL decomposition and spectral entropy. These metrics are binarized to assign one of eight discrete regime labels.
    • Benchmark Construction: To prevent evaluation bias toward common patterns, the authors use stratified sampling to select approximately 162 series per regime cell for QUITOBENCH, ensuring near-uniform coverage of all eight behavioral types.
  • Splitting and Leakage Prevention The authors enforce a global temporal cutoff at July 28, 2023, to guarantee leakage-free splits across both granularities. Data prior to this date is divided into training (80%) and validation (20%) sets, while data from the cutoff onward forms the test set. This chronological ordering ensures that no future information contaminates the training process.

Method

The authors leverage a structured pipeline to construct a robust time series forecasting benchmark, ensuring contamination-free evaluation and diverse regime coverage. The overall workflow, illustrated in the framework diagram, proceeds through five distinct stages: Raw Collection of production monitoring streams, Standardization involving quality filtering and deduplication, Split Protocol to establish global time cut-offs, Curation to compute dynamic regime labels, and finally Bench Construction to create a balanced evaluation set.

To ensure rigorous evaluation, the data is partitioned using a global time cut-off strategy. As shown in the figure below, the full series is split into training (70%), validation (20%), and test (10%) sets. For the QuitoBench protocol, the authors employ dense rolling windows with a stride of 1 to maximize data utilization during the test phase, contrasting with the non-overlapping windows used in GIFT-Eval style evaluations. This approach generates multiple context-forecast pairs (LLL and HHH) from the test region, allowing for a more granular assessment of model performance across different temporal horizons.

A core component of the curation process involves characterizing the dynamic regime of each time series using three scalar diagnostics: trend strength (TTT), seasonality strength (SSS), and forecastability (FFF). Each metric is bounded within [0,1][0, 1][0,1], where higher values indicate a stronger presence of the respective property.

To quantify trend and seasonality, the authors decompose each univariate series {xt}\{x_t\}{xt} using Seasonal-Trend decomposition via LOESS (STL). This produces three additive components:

xt=τt+st+rt.x _ { t } = \tau _ { t } + s _ { t } + r _ { t } .xt=τt+st+rt.

Here, τt\tau_{t}τt represents the trend, sts_{t}st the seasonal component, and rtr_{t}rt the residual. The strength of seasonality (SSS) and trend (TTT) are defined based on the variance of these components:

S=max(0, 1Var(r)Var(s+r)),T=max(0, 1Var(r)Var(τ+r)).S = \operatorname* { m a x } \bigg ( 0 , \ 1 - \frac { \mathrm { V a r } ( r ) } { \mathrm { V a r } ( s + r ) } \bigg ) , \qquad T = \operatorname* { m a x } \bigg ( 0 , \ 1 - \frac { \mathrm { V a r } ( r ) } { \mathrm { V a r } ( \tau + r ) } \bigg ) .S=max(0, 1Var(s+r)Var(r)),T=max(0, 1Var(τ+r)Var(r)).

A value near 1 implies the component dominates the residual, while a value near 0 suggests the component is negligible relative to noise. The seasonal period ppp is determined by the series resolution, set to 144 for QUITO-MIN and 24 for QUITO-HOUR.

Forecastability is measured as the complement of normalized spectral entropy. Using Welch's method with a Hann window, the power spectral density PkP_kPk is computed for the mean-subtracted series. The normalized entropy HHH is calculated as:

H=kp^klogp^k/logK,p^k=PkjPj,H = - \sum _ { k } \hat { p } _ { k } \log \hat { p } _ { k } \, \Big / \, \log K , \qquad \hat { p } _ { k } = \frac { P _ { k } } { \sum _ { j } P _ { j } } ,H=kp^klogp^k/logK,p^k=jPjPk,

where KKK is the number of frequency bins. Forecastability is then defined as F=1HF = 1 - HF=1H, where F=1F=1F=1 indicates a perfectly deterministic series and F=0F=0F=0 corresponds to white noise.

Since each series is multivariate with five variates, the authors compute TTT, SSS, and FFF independently for each channel and aggregate them by averaging:

Ti=15j=15Ti,j,Si=15j=15Si,j,Fi=15j=15Fi,j.T _ { i } = \frac { 1 } { 5 } \sum _ { j = 1 } { 5 } T _ { i , j } , \quad S _ { i } = \frac { 1 } { 5 } \sum _ { j = 1 } { 5 } S _ { i , j } , \quad F _ { i } = \frac { 1 } { 5 } \sum _ { j = 1 } { 5 } F _ { i , j } .Ti=51j=15Ti,j,Si=51j=15Si,j,Fi=51j=15Fi,j.

Finally, each diagnostic is binarized using a fixed threshold of 0.4 to assign a label of HIGH or LOW. These three binary labels are combined to form one of eight distinct TSF regime cells (e.g., TREND×\times×SEASON×\times×FORECAST), ensuring a balanced distribution of difficulty levels across the benchmark.

Experiment

  • Comprehensive Benchmark Evaluation: Ten models across deep learning, foundation, and statistical families were tested under 18 task configurations using dense rolling windows. This validates that deep learning models generally outperform foundation models in short-context scenarios, while foundation models excel with long historical context, and confirms that statistical baselines are insufficient for complex traffic forecasting.
  • Scaling and Efficiency Analysis: Experiments varying data volume and model size demonstrate that increasing training data yields significantly larger performance gains than increasing model parameters. This validates that task-specific deep learning models are far more parameter-efficient, achieving comparable or superior accuracy to massive foundation models with orders of magnitude fewer parameters.
  • Context Length and Horizon Sensitivity: Analysis of context length reveals a functional split where deep learning models are specialists for short histories, whereas foundation models leverage pre-training to exploit long-range dependencies. Forecast horizon tests further show that task-specific architectures maintain stability over long prediction windows better than foundation models, which degrade more rapidly as uncertainty accumulates.
  • TSF Regime Specialization: Evaluation across Trend, Seasonality, and Forecastability (TSF) regimes identifies forecastability as the primary driver of difficulty. Results validate that foundation models are more robust in low-forecastability (noisy) environments and high-seasonality regimes, while deep learning models dominate in trend-driven, low-seasonality scenarios.
  • Robustness and Generalization: Cross-metric and cross-benchmark comparisons confirm that model rankings are consistent regardless of the error metric (MAE vs. MSE) or the specific dataset provenance. This validates that the observed performance differences reflect intrinsic model capabilities rather than artifacts of specific evaluation settings or data sources.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています