HyperAIHyperAI

Command Palette

Search for a command to run...

ASTRA: 에이전트 경로 및 강화 환경의 자동 합성

초록

대규모 언어 모델(Large Language Models, LLMs)은 다단계 의사결정을 위한 도구 보강형 에이전트로 점점 더 널리 활용되고 있으나, 견고한 도구 사용 에이전트를 훈련하는 것은 여전히 도전적인 과제이다. 기존의 방법들은 여전히 수동적 개입을 필요로 하며, 검증 불가능한 시뮬레이션 환경에 의존하거나, 감독적 미세조정(Supervised Fine-Tuning, SFT) 또는 강화학습(Reinforcement Learning, RL) 중 하나에만 의존하며, 안정적인 장기적, 다턴(다단계) 학습에 어려움을 겪고 있다. 이러한 문제를 해결하기 위해, 우리는 확장 가능한 데이터 합성과 검증 가능한 강화학습을 통해 완전 자동화된 엔드투엔드 프레임워크인 ASTRA를 제안한다. ASTRA는 두 가지 보완적인 구성 요소를 통합한다. 첫째, 도구 호출 그래프의 정적 구조를 활용하는 파이프라인을 통해 다양한 구조적 기반의 추적 경로를 합성함으로써, 광범위하고 전이 가능한 도구 사용 능력을 부여한다. 둘째, 인간의 의미적 추론의 풍부하고 복합적인 구조를 반영하는 환경 합성 프레임워크를 통해 분해된 질문-답변 추적 경로를 독립적이고 코드 실행 가능하며 규칙 검증이 가능한 환경으로 변환함으로써, 결정론적 다턴 강화학습을 가능하게 한다. 본 방법을 기반으로, 경로 수준의 보상(trajectory-level rewards)을 활용하여 작업 완료율과 상호작용 효율성을 균형 있게 유지하는 SFT와 온라인 RL을 통합한 통합 훈련 방법론을 개발하였다. 다양한 에이전트 기반 도구 사용 벤치마크에서의 실험 결과, ASTRA로 훈련된 모델은 비교 가능한 규모에서 최신 기술 수준의 성능을 달성하며, 폐쇄형 시스템에 근접하면서도 핵심 추론 능력을 유지함을 확인하였다. 전체 파이프라인, 환경 및 훈련된 모델은 https://github.com/LianjiaTech/astra 에서 공개한다.

One-sentence Summary

Beike Language and Intelligence propose ASTRA, a fully automated framework for training tool-augmented LLM agents via scalable data synthesis and verifiable RL, enabling stable multi-turn learning and outperforming prior methods on agentic benchmarks while preserving core reasoning.

Key Contributions

  • ASTRA introduces a fully automated framework for training tool-augmented LLM agents by synthesizing multi-turn trajectories from tool-call graph topologies and converting QA traces into rule-verifiable, code-executable environments, eliminating manual intervention and enabling deterministic reinforcement learning.
  • The method combines supervised fine-tuning with online multi-turn RL using trajectory-level rewards to jointly optimize task completion and interaction efficiency, overcoming limitations of single-regime training and improving long-horizon decision-making stability.
  • Evaluated on agentic benchmarks including BFCL v3, ASTRA-trained models achieve state-of-the-art performance at comparable scales, rivaling closed-source systems while maintaining core reasoning capabilities, with all pipelines and models publicly released.

Introduction

The authors leverage large language models as tool-augmented agents for multi-step decision making, a capability critical for real-world applications like data analysis and interactive systems. Prior methods suffer from reliance on non-verifiable simulated environments, fragmented single-step training, and exclusive use of either supervised fine-tuning or reinforcement learning—limiting stable, long-horizon learning and generalization. ASTRA introduces a fully automated framework that synthesizes multi-turn, structurally grounded trajectories using tool-call graphs for supervised training, and generates code-executable, rule-verifiable environments from human reasoning traces to enable deterministic, multi-turn reinforcement learning. Their unified training methodology combines SFT with online RL using trajectory-level rewards, achieving state-of-the-art performance while preserving core reasoning and enabling scalable, reproducible agent development.

Dataset

The authors use a carefully curated dataset for training and evaluating tool-using LLMs, built from heterogeneous tool documentation and synthesized user tasks. Here’s how it’s structured and processed:

  • Dataset Composition and Sources
    Tool documents are collected from open MCP registries (e.g., Smithery, RapidAPI), internal specs, and public datasets. These are normalized into a unified OpenAI-style tool-calling schema and grouped by service (called “MCP servers”). Only 1,585 servers (19,036 tools across 41 domains) pass filtering for sufficient tool count, clear descriptions, and schema compatibility.

  • Task Construction and Augmentation
    For each server, tasks are synthesized via two modes:

    • Chain-conditioned: Generates tasks aligned with executable multi-step tool workflows.
    • Server-only: Broadens coverage by generating tasks from server specs alone.
      Augmentation expands tasks along three axes: diversity (paraphrasing), complexity (adding constraints), and persona (user style variations), while preserving language and intent. Tasks are scored on clarity, realism, and tool necessity—only those meeting all thresholds are retained.
  • SFT Dataset
    Contains 54,885 multi-turn conversations (580,983 messages total), averaging 10.59 messages and 4.42 tool calls per sample. 72.2% have 1–5 tool calls. Tool responses and assistant utterances dominate (41.8% and 39.3%). Covers 6,765 unique functions across reasoning, search, and computation.

  • RL Dataset
    Includes 6,596 bilingual samples (71.2% English, 28.8% Chinese) spanning domains like Real Estate (15.6%), E-commerce (10.6%), and Healthcare (8.1%). Average 4.37 reasoning hops (median 4.0), with 47.8% being Parallel Multi-Hop scenarios. 91.3% of 28,794 sub-questions require tool calls; average 3.98 tool calls per sample. 44.2% of steps can be parallelized.

  • Processing and Metadata
    Tasks are generated via structured templates specifying domain, knowledge corpus, and hop constraints (min/max). Scenarios follow four logical structures: Single-Hop, Parallel Single-Hop, Multi-Hop, and Parallel Multi-Hop. All samples are filtered for quality, realism, and tool dependency before inclusion.

Method

The authors leverage a dual-track methodology for synthesizing tool-augmented agent training data: one for supervised fine-tuning (SFT) via trajectory generation and another for reinforcement learning (RL) via verifiable environment construction. Each track is underpinned by distinct architectural components and validation mechanisms designed to ensure realism, diversity, and executability.

For SFT, the pipeline begins with tool document collection and normalization, followed by tool-chain synthesis and validation. As shown in the framework diagram, the process is structured into five sequential stages. Stage 1 involves filtering raw tool specifications into qualified documents by extracting function names and parameter schemas. Stage 2 constructs a directed transition graph from synthesized tool-chains, where nodes represent tools and edges denote valid consecutive invocations. Candidate chains are sampled via length-bounded random walks over this graph, biased by edge weights derived from prior synthesis frequency. Stage 3 generates task-conditioned queries paired with these chains, using an LLM to evolve plausible user intents and validate their coherence. Stage 4 executes multi-turn interactions using Qwen-Agent, integrating both real MCP servers and stateful emulators for doc-only tools. The emulator injects 20% failure rates to simulate real-world unreliability. Finally, Stage 5 applies a multi-dimensional reward model to assess trajectory quality across seven axes: query understanding, planning, tool-response context understanding, tool-response context-conditioned planning, tool call status, tool conciseness, and final answer quality. These scores are averaged into a single scalar reward for SFT signal generation.

For RL, the authors adopt a QA-based environment synthesis framework that models multi-turn tool use as navigation over a latent semantic topology. As illustrated in the figure below, the pipeline comprises four stages: Q-A instance synthesis, quality validation, environment synthesis, and sub-environment merging. In Stage 1, the system generates a main question q0q_0q0 and its answer a0a_0a0, along with a set of intermediate sub-questions S={(qi,ai)}i=1m\mathcal{S} = \{ (q_i, a_i) \}_{i=1}^mS={(qi,ai)}i=1m that form a dependency graph G\mathcal{G}G, such that a0=Φ({ai},G)a_0 = \Phi(\{a_i\}, \mathcal{G})a0=Φ({ai},G). Synthesis is conditioned on a knowledge source K\mathcal{K}K and a hop budget HHH, operating in either question-conditional or unconditional mode. Stage 2 validates each decomposed instance across four dimensions: dependency consistency (ensuring each sub-question’s dependencies are logically necessary), sub-question atomicity (verifying indivisibility), sequential rationality (checking execution order validity), and task completeness (confirming sufficiency to solve q0q_0q0). The overall quality score QS(τ\tauτ) is the mean of these four binary indicators. Stage 3 synthesizes executable sub-environments for each non-leaf sub-task node by generating tool specifications, augmenting their complexity, and implementing Python-based tool code validated in a sandbox. Stage 4 merges functionally equivalent sub-environments by identifying homogeneous sub-questions and expanding their underlying data structures to support multiple parameter variants while preserving correctness.

The RL training infrastructure implements an online, multi-turn agentic paradigm. At each step, the policy model generates a tool invocation statement, which is executed in a code sandbox alongside its corresponding tool implementation. The sandbox returns the tool output, which is fed back as an observation to condition subsequent decisions. The interaction terminates upon reaching a maximum turn limit, sequence length, or when the model ceases tool calls. The resulting trajectory is used for policy optimization via a modified GRPO objective that omits KL and entropy terms for stability. To mitigate training instability from degenerate reward variance, the authors introduce Adaptive Batch Filling: a strategy that maintains a buffer of valid rollouts (those with non-zero reward variance) and selects the first nnn valid samples for each optimization step, ensuring dense gradient signals. The reward function is F1-style, defined as the harmonic mean of sub-task recall r=n^/nr = \hat{n}/nr=n^/n and tool usage precision p=n^/(c+ϵ)p = \hat{n}/(c + \epsilon)p=n^/(c+ϵ), where n^\hat{n}n^ is the number of solved sub-tasks and ccc is the total tool calls. This design incentivizes both task completion and interaction efficiency.

To enhance robustness, the authors augment each training instance with task-irrelevant tools sampled from three semantic similarity bands (high, medium, low) derived from cosine similarity of tool documentation embeddings. This encourages the model to discriminate relevant tools rather than overfit to minimal sets. Training operates in a strictly online setting with batch size 256, fixed learning rate 2×1062 \times 10^{-6}2×106, and long-context support (25,600-token prompts, 49,152-token responses) to accommodate multi-turn interactions.

The SFT infrastructure employs a modified checkpointing strategy in HuggingFace Transformers to reduce I/O and storage overhead: model weights are saved at high frequency, while training-state checkpoints (optimizer, scheduler) are retained only for the most recent 1–2 saves. This preserves fine-grained observability for analysis without compromising scalability.

Experiment

  • Trained Qwen3-14B and Qwen3-32B via SFT and RL with context parallelism to handle long sequences; used cosine learning rate schedules with warmup.
  • Evaluated on agentic benchmarks (BFCL-MT, τ²-Bench, ACEBench) requiring multi-turn tool use and user simulation, plus non-agentic math benchmarks (AIME2024/2025) to assess core reasoning.
  • Achieved state-of-the-art or competitive results across model scales; RL stage delivered the largest performance gains over SFT and original models.
  • Irrelevant tool mixing during RL improved tool discrimination by forcing the model to reject incorrect options, with balanced similarity coverage yielding best results.
  • F1-style reward (balancing recall and precision) stabilized training and prevented extremes of tool overuse or underuse; recall-only and precision-only rewards led to training collapse.
  • SFT improved structured tool use and state tracking; RL enabled broader exploration via sub-QA supervision, supporting recovery from suboptimal decisions.
  • Output length decreased after SFT, then increased moderately after RL; interaction steps remained stable across stages, indicating performance gains were not due to length changes.
  • Method preserved strong reasoning on non-agentic tasks while enhancing agentic tool-use capability.

The authors use a two-stage training approach—supervised fine-tuning followed by reinforcement learning—to enhance tool-use capabilities in Qwen3 models, with RL delivering the largest performance gains across agentic benchmarks. Results show consistent improvements in multi-turn tool interaction and task completion, while maintaining strong performance on non-agenetic reasoning tasks. The final RL-trained models outperform both their base and SFT counterparts, achieving state-of-the-art or competitive results relative to larger open-source and closed-source models.

The authors observe that while fine-tuning stages do not significantly alter the average number of interaction steps per subtask, they substantially reduce token usage per step, with the RL stage producing outputs of intermediate length between the original and SFT models. This suggests that performance gains stem from more efficient reasoning and tool-use patterns rather than changes in dialogue depth. The RL-trained models maintain higher token efficiency per subtask compared to the base models, indicating improved conciseness without sacrificing task coverage.

The authors use a two-stage training approach combining supervised fine-tuning and reinforcement learning to enhance agentic tool-use capabilities in Qwen3 models. Results show that their method achieves state-of-the-art performance on multi-turn tool-use benchmarks at matched parameter scales, with reinforcement learning delivering the largest performance gains. The approach also maintains strong performance on non-agenetic reasoning tasks, indicating that tool-use optimization does not compromise core reasoning ability.

The authors evaluate their ASTRA models on non-agenetic mathematical reasoning benchmarks AIME2024 and AIME2025 under two decoding settings, showing that performance remains stable or slightly improves after training, with no significant degradation despite the focus on agentic tool use. Results indicate that the 32B variant consistently outperforms the 14B variant, and both ASTRA versions maintain competitive scores across both decoding configurations.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp