il y a un mois

Table des matières

Résumé

Les modèles linguistiques à grande échelle (LLM) ont démontré un potentiel remarquable en tant qu’agents autonomes, atteignant des performances proches de celles des experts humains grâce à des raisonnements avancés et à une orchestration efficace d’outils. Toutefois, la prise de décision dans des environnements entièrement dynamiques et en temps réel reste extrêmement complexe, nécessitant une intégration en temps réel d’informations et une capacité de réponse adaptative. Bien que des travaux existants aient exploré des mécanismes d’évaluation en temps réel dans des tâches structurées, un écart critique subsiste en matière de benchmarking systématique pour les applications du monde réel, en particulier dans le domaine financier, où les exigences en matière de réactivité stratégique en temps réel sont rigoureuses. Pour combler cet écart, nous introduisons AI-Trader, le premier benchmark d’évaluation entièrement automatisé, en temps réel et exempt de contamination des données pour les agents LLM dans la prise de décision financière. AI-Trader couvre trois marchés financiers majeurs : les actions américaines, les actions A (A-shares) chinoises et les cryptomonnaies, avec plusieurs niveaux de granularité de trading afin de simuler des environnements financiers réels. Notre benchmark met en œuvre un paradigme révolutionnaire d’information minimale entièrement autonome : les agents reçoivent uniquement un contexte essentiel et doivent rechercher, vérifier et synthétiser indépendamment des informations de marché en temps réel, sans intervention humaine. Nous évaluons six LLM couramment utilisés sur trois marchés et à différentes fréquences de trading. Nos analyses révèlent des résultats frappants : l’intelligence générale ne se traduit pas automatiquement par une capacité efficace au trading, la plupart des agents affichant des rendements médiocres et une gestion du risque faible. Nous démontrons que la capacité de contrôle du risque détermine la robustesse transversale aux marchés, et que les stratégies de trading par IA génèrent plus facilement des rendements excessifs sur les marchés hautement liquides que dans des environnements réglementés ou pilotés par des politiques. Ces résultats mettent en lumière des limites critiques des agents autonomes actuels et fournissent des orientations claires pour leurs améliorations futures.

One-sentence Summary

The authors from the University of Hong Kong propose AI-Trader, a fully automated, live benchmark for evaluating LLM agents in financial decision-making across U.S. stocks, A-shares, and cryptocurrencies, introducing a minimal-information paradigm that forces autonomous real-time data synthesis; their evaluation reveals that general intelligence does not ensure trading success, with risk control emerging as key to cross-market robustness, especially in liquid markets, and underscores the need for improved agent design in dynamic financial environments.

Key Contributions

We introduce AI-Trader, the first fully autonomous, live, and data-uncontaminated benchmark for evaluating LLM agents in real-world financial decision-making across three major markets: U.S. stocks, A-shares, and cryptocurrencies, with multiple trading granularities to simulate dynamic, real-time market conditions.
The benchmark enforces a minimal information paradigm where agents receive only essential context and must independently search, verify, and synthesize live market data through autonomous tool use, eliminating human intervention and enabling rigorous assessment of real-time reasoning and adaptability.
Evaluation of six mainstream LLMs reveals that general intelligence does not ensure trading effectiveness, with most agents showing poor returns and weak risk management, while risk control emerges as a key determinant of cross-market robustness, and excess returns are more achievable in highly liquid markets.

Introduction

Financial markets present a high-stakes, real-time environment where autonomous agents must integrate live information, reason under uncertainty, and make time-critical decisions—challenges that static benchmarks fail to capture. Prior evaluation frameworks often rely on fixed data, pre-defined workflows, or human-in-the-loop interventions, creating a disconnect between lab performance and real-world capability. These limitations hinder meaningful assessment of true autonomous decision-making, especially in dynamic domains like trading.

The authors introduce AI-Trader, the first fully autonomous, live, and data-uncontaminated benchmark for evaluating LLM agents across U.S. stocks, A-shares, and cryptocurrencies. Agents operate with minimal context—only current holdings, real-time prices, and access to tools—requiring them to independently search, verify, and synthesize live market data without any human guidance. This minimal information paradigm enforces rigorous demonstration of long-horizon reasoning, information retrieval, and adaptive strategy execution.

Their evaluation of six mainstream LLMs reveals that general intelligence does not imply trading competence: most agents deliver poor returns and exhibit weak risk management, with performance heavily dependent on market liquidity and structure. The study underscores that risk control is key to cross-market robustness and highlights the need for improved autonomous planning and adaptation in LLM agents. The open-sourced framework enables reproducible, high-fidelity assessment of agent capabilities in real financial environments.

Dataset

The dataset spans three distinct financial markets: the U.S. stock market, China's A-share market, and the cryptocurrency market, enabling evaluation of agent generalization across different regulatory environments, investor behaviors, and market dynamics.
Trading frequencies are supported at both hourly and daily intervals to capture diverse market behaviors and test agent responsiveness across time horizons.
For the U.S. stock market, the dataset includes all 100 constituents of the Nasdaq-100 Index, representing large-cap, non-financial firms in technology, semiconductors, biotech, internet services, and consumer discretionary sectors. Key companies include Apple, Microsoft, NVIDIA, Amazon, Alphabet, Tesla, and Meta.
The U.S. portfolio includes a cash asset with zero return to allow agents to practice risk-free capital allocation and demonstrate full portfolio management skills.
The A-share market subset comprises the 50 stocks in the SSE-50 Index, selected from the Shanghai Stock Exchange. These are major blue-chip firms across finance, consumer goods, industrials, IT, energy, and healthcare, including Ping An Insurance, Kweichow Moutai, and China Merchants Bank.
Both market environments are designed to reflect real-world conditions: the U.S. market emphasizes sensitivity to macroeconomic factors and technological innovation, while the A-share market highlights macro-driven volatility, sector rotation, and non-stationary behavior.
The authors use the data in a training and evaluation framework where agents are trained on a mixture of market subsets with dynamically adjusted ratios to simulate cross-market adaptability.
Data is processed to ensure consistent time alignment across markets, with no external data augmentation.
No cropping is applied; full historical price and volume data are used for each asset.
Metadata includes asset identifiers, sector classifications, market capitalization tiers, and index membership to support structured analysis and model interpretation.

Method

The authors leverage a modular architecture for AI-Trader, designed to support autonomous, adaptive trading agents within a unified and extensible framework. The system operates as a closed-loop process, where agents continuously observe market conditions, reason about potential actions, and execute trades, all while interacting with a live environment through a standardized toolkit. This architecture is structured around three primary components: the agent's observation and action spaces, the toolkit of available tools, and the live environment that simulates real-world trading constraints.

As shown in the figure below, the agent's observation space encompasses key market data such as current asset prices $\mathbf{p}$ and the agent's portfolio holdings $\mathbf{s}$ , forming the initial observation $o_0$ . This base information is augmented dynamically through tool invocations, which provide additional data such as detailed stock indicators $\pi_i$ and general market news $i$ , resulting in a comprehensive observation $o_t$ at each time step. The agent's reasoning process, guided by the ReAct paradigm, operates autonomously, generating natural language traces that articulate its decision-making logic. These traces are recorded to ensure transparency and reproducibility, enabling researchers to analyze the agent's behavior in complex financial contexts.

The action space is constrained to three discrete actions per asset: buy, sell, or hold. The agent's policy function $a_t = f(o_t, r_t)$ maps the current observation $o_t$ and reasoning $r_t$ to an executable action, ensuring decisions are both autonomous and compliant with real-world constraints such as liquidity and regulatory rules. If an action would violate these constraints, the system triggers a self-correction mechanism, requiring the agent to re-evaluate and generate a new, feasible decision.

The toolkit, which includes tools for checking prices, performing web searches, retrieving stock news, executing trades, and performing calculations, is designed to be modular and extensible. Each tool is built upon the Model Context Protocol (MCP), enabling seamless integration and adaptation across different asset classes and trading frequencies. The Trade tool, for instance, enforces market-specific rules such as lot-size requirements and updates portfolio holdings and cash balances in real time, ensuring accurate and auditable trade execution. The interaction between the agent and the live environment is bidirectional: the agent receives updated observations from the environment after each action, and the environment is updated with the agent's decisions, maintaining the integrity of the closed-loop system.

Experiment

Evaluated AI trading agents across U.S. equities, A-shares, and cryptocurrencies using daily (A-shares, crypto) and hourly (U.S. equities) strategies, assessing performance on cumulative return, Sortino ratio, volatility, and maximum drawdown.
MiniMax-M2 achieved the highest cumulative return of 9.56% in the U.S. market (vs. QQQ’s 1.87%) with a Sortino ratio of 4.42 and maximum drawdown of -4.92%, demonstrating superior risk control and robustness across markets.
DeepSeek-v3.1 outperformed the CD5 Index in the crypto market (-12.18% vs. -14.30%) by maintaining a high cash position (up to 41%) and executing strategic buy-the-dip trades, highlighting its adaptability in high-volatility environments.
GPT-5, Qwen3-Max, and Gemini-2.5-Flash underperformed across all markets, with GPT-5 recording a cumulative return of 1.56% in the U.S. market and -16.41% in crypto, underscoring the gap between general language capabilities and effective trading.
Model generalization is limited: DeepSeek-v3.1 excelled in the U.S. market (8.39% CR) but failed in A-shares (-1.23% CR), while MiniMax-M2 maintained consistent performance across all markets, indicating context-aware adaptability is critical.
Case studies revealed that agents can emulate human-like behavior—DeepSeek-v3.1 successfully avoided a major U.S. market crash by diversifying into defensive sectors and increasing cash, but later underperformed due to unverified news, exposing weaknesses in information verification.

The authors use a multi-market, multi-frequency experimental design to evaluate AI trading agents across U.S. equities, A-shares, and cryptocurrencies, with performance measured using cumulative return, Sortino ratio, volatility, and maximum drawdown. Results show that MiniMax-M2 achieves the highest cumulative return and Sortino ratio in the U.S. market and remains the only consistently profitable agent in the A-share market, while DeepSeek-v3.1 outperforms the baseline in cryptocurrencies due to effective cash management and strategic trading during downturns.

The authors use the table to compare the trading behavior of different AI models across three markets, showing that MiniMax-M2 executes trades less frequently than most other models, with the lowest proportion of no-trade executions in the U.S. and A-share markets. DeepSeek-v3.1 and Claude-3.7-Sonnet exhibit higher average trade counts in the U.S. and A-share markets, indicating more active trading strategies, while Gemini-2.5-Flash shows the highest average trades in cryptocurrencies, reflecting its more aggressive approach in that volatile environment.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

il y a un mois

Domaine D'application

Approche/Framework

Tianyu Fan Yuhao Yang Yangqin Jiang Yifei Zhang Yuxuan Chen Chao Huang

Table des matières

Résumé

One-sentence Summary

Key Contributions

We introduce AI-Trader, the first fully autonomous, live, and data-uncontaminated benchmark for evaluating LLM agents in real-world financial decision-making across three major markets: U.S. stocks, A-shares, and cryptocurrencies, with multiple trading granularities to simulate dynamic, real-time market conditions.
The benchmark enforces a minimal information paradigm where agents receive only essential context and must independently search, verify, and synthesize live market data through autonomous tool use, eliminating human intervention and enabling rigorous assessment of real-time reasoning and adaptability.
Evaluation of six mainstream LLMs reveals that general intelligence does not ensure trading effectiveness, with most agents showing poor returns and weak risk management, while risk control emerges as a key determinant of cross-market robustness, and excess returns are more achievable in highly liquid markets.

Introduction

Dataset

The dataset spans three distinct financial markets: the U.S. stock market, China's A-share market, and the cryptocurrency market, enabling evaluation of agent generalization across different regulatory environments, investor behaviors, and market dynamics.
Trading frequencies are supported at both hourly and daily intervals to capture diverse market behaviors and test agent responsiveness across time horizons.
For the U.S. stock market, the dataset includes all 100 constituents of the Nasdaq-100 Index, representing large-cap, non-financial firms in technology, semiconductors, biotech, internet services, and consumer discretionary sectors. Key companies include Apple, Microsoft, NVIDIA, Amazon, Alphabet, Tesla, and Meta.
The U.S. portfolio includes a cash asset with zero return to allow agents to practice risk-free capital allocation and demonstrate full portfolio management skills.
The A-share market subset comprises the 50 stocks in the SSE-50 Index, selected from the Shanghai Stock Exchange. These are major blue-chip firms across finance, consumer goods, industrials, IT, energy, and healthcare, including Ping An Insurance, Kweichow Moutai, and China Merchants Bank.
Both market environments are designed to reflect real-world conditions: the U.S. market emphasizes sensitivity to macroeconomic factors and technological innovation, while the A-share market highlights macro-driven volatility, sector rotation, and non-stationary behavior.
The authors use the data in a training and evaluation framework where agents are trained on a mixture of market subsets with dynamically adjusted ratios to simulate cross-market adaptability.
Data is processed to ensure consistent time alignment across markets, with no external data augmentation.
No cropping is applied; full historical price and volume data are used for each asset.
Metadata includes asset identifiers, sector classifications, market capitalization tiers, and index membership to support structured analysis and model interpretation.

Method

Experiment

Evaluated AI trading agents across U.S. equities, A-shares, and cryptocurrencies using daily (A-shares, crypto) and hourly (U.S. equities) strategies, assessing performance on cumulative return, Sortino ratio, volatility, and maximum drawdown.
MiniMax-M2 achieved the highest cumulative return of 9.56% in the U.S. market (vs. QQQ’s 1.87%) with a Sortino ratio of 4.42 and maximum drawdown of -4.92%, demonstrating superior risk control and robustness across markets.
DeepSeek-v3.1 outperformed the CD5 Index in the crypto market (-12.18% vs. -14.30%) by maintaining a high cash position (up to 41%) and executing strategic buy-the-dip trades, highlighting its adaptability in high-volatility environments.
GPT-5, Qwen3-Max, and Gemini-2.5-Flash underperformed across all markets, with GPT-5 recording a cumulative return of 1.56% in the U.S. market and -16.41% in crypto, underscoring the gap between general language capabilities and effective trading.
Model generalization is limited: DeepSeek-v3.1 excelled in the U.S. market (8.39% CR) but failed in A-shares (-1.23% CR), while MiniMax-M2 maintained consistent performance across all markets, indicating context-aware adaptability is critical.
Case studies revealed that agents can emulate human-like behavior—DeepSeek-v3.1 successfully avoided a major U.S. market crash by diversifying into defensive sectors and increasing cash, but later underperformed due to unverified news, exposing weaknesses in information verification.

PDF source Voir le code

Table des matières

Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA

GPU prêts à l’emploi

Tarifs les plus avantageux

Commencer Voir les tarifs

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour

Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin

Propulsé par MailChimp

Command Palette

AI-Trader : Évaluation des agents autonomes sur des marchés financiers en temps réel

Tianyu Fan Yuhao Yang Yangqin Jiang Yifei Zhang Yuxuan Chen Chao Huang

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

AI-Trader : Évaluation des agents autonomes sur des marchés financiers en temps réel

Tianyu Fan Yuhao Yang Yangqin Jiang Yifei Zhang Yuxuan Chen Chao Huang

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters

Command Palette

AI-Trader : Évaluation des agents autonomes sur des marchés financiers en temps réel

Tianyu Fan Yuhao Yang Yangqin Jiang Yifei Zhang Yuxuan Chen Chao Huang

Résumé

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Créer de l'IA avec l'IA

HyperAI Newsletters