Command Palette
Search for a command to run...
Trace2Skill: 軌道局所的な教訓を転移可能な Agent 技能へ蒸留する
Trace2Skill: 軌道局所的な教訓を転移可能な Agent 技能へ蒸留する
Jingwei Ni Yihao Liu Xinpeng Liu Yutao Sun Mengyu Zhou Pengyu Cheng Dexin Wang Xiaoxi Jiang Guanjun Jiang
概要
大規模言語モデル(LLM)エージェントにドメイン固有のスキルを付与することは、複雑な課題に対処する上で極めて重要である。しかし、手動によるスキル作成はスケーラビリティに深刻なボトルネックを生じさせる。一方、自動的なスキル生成は、浅いパラメトリック知識に依存するか、一般化不可能な軌道局所的な教訓に逐次的に過剰適合してしまうため、脆弱で断片的な結果をもたらす傾向がある。これを克服するため、我々は Trace2Skill を提案する。これは、人間専門家がスキルを作成するプロセスを模倣するフレームワークであり、広範な実行経験を包括的に分析した上で、それを単一の包括的なガイドに凝縮するというアプローチを採用している。個々の軌道に対して逐次的に反応するのではなく、Trace2Skill は並列的に多数のサブエージェントを派遣し、多様な実行プールを分析する。そして、軌道固有の教訓を抽出し、帰納的推論を通じて、それらを階層的に統合・統合し、矛盾のない統一されたスキルディレクトリへと凝縮する。Trace2Skill は、既存の人間作成スキルの深化と、ゼロからの新規スキル作成の両方をサポートする。スプレッドシート、VisionQA、数学的推論といった困難なドメインにおける実験により、Trace2Skill が Anthropic 社公式の xlsx スキルを含む強力なベースラインを大幅に上回ることが示された。特に重要なのは、この軌道に根ざした進化が、単にタスクインスタンスやモデル固有の癖を記憶するだけではない点である。進化させたスキルは、LLM の規模を超えて転移し、分布外(OOD)設定においても一般化する。例えば、Qwen3.5-35B によって自身の軌道から進化させたスキルは、WikiTableQuestions において Qwen3.5-122B エージェントのパフォーマンスを最大 57.65 パーセントポイント向上させた。最終的に、我々の結果は、複雑なエージェントの経験を、パラメータ更新も外部検索モジュールも不要で、35B パラメータ規模のオープンソースモデルさえも活用可能な、極めて転移性の高い宣言的スキルにパッケージ化可能であることを実証している。
One-sentence Summary
Researchers from Alibaba, ETH Zurich, and Peking University introduce Trace2Skill, a framework that parallelizes sub-agent analysis of execution trajectories to distill fragmented lessons into unified, transferable skills, outperforming sequential online updates and retrieval-based baselines across spreadsheet, math, and vision tasks without requiring parameter updates.
Key Contributions
- The paper introduces Trace2Skill, a framework that dispatches a parallel fleet of sub-agents to analyze diverse execution trajectories and hierarchically consolidate trajectory-specific lessons into a unified, conflict-free skill directory via inductive reasoning.
- This work demonstrates that skills evolved through holistic parallel analysis transfer effectively across different LLM scales and generalize to out-of-distribution settings, such as improving a 122B agent by up to 57.65 percentage points using skills generated by a 35B model.
- Experimental results confirm that the proposed parallel consolidation method outperforms both online sequential editing and retrieval-based experience banks while requiring no parameter updates or external retrieval modules.
Introduction
Equipping LLM agents with domain-specific skills is essential for handling complex tasks, yet manual creation creates a scalability bottleneck while automated methods often produce fragile results due to reliance on shallow parametric knowledge or sequential overfitting to isolated trajectory lessons. Prior approaches typically update skills sequentially as new data arrives or rely on retrieval-based memory banks, which leads to fragmented skill collections and poor generalization across different model scales or out-of-distribution settings. The authors introduce Trace2Skill, a framework that mimics human expertise by analyzing a diverse pool of execution trajectories in parallel to distill trajectory-local lessons into a single, comprehensive, and conflict-free skill directory. This approach leverages inductive reasoning to create transferable declarative skills that improve performance across varying LLM scales and task domains without requiring parameter updates or external retrieval modules.
Dataset
-
Dataset Composition and Sources: The authors construct a dataset of 323 map patches derived from 122B parameter model runs on the SpreadsheetBench-Verified benchmark. These patches capture Standard Operating Procedures (SoPs) distilled from agent trajectories, with the four most prevalent themes accounting for the majority of citations.
-
Key Subset Details:
- Formula Recalculation and Verification: 178 patches focus on running recalculation scripts and reopening files with data_only=True to prevent stale cells.
- Tool Selection: 177 patches advocate using openpyxl for write-back operations instead of pandas.toexcel() to preserve formula relationships and named ranges.
- Explicit Read-back Verification: 138 patches emphasize reopening output files to confirm target cell values before submission.
- Structural-edit Safety: 53 patches address safe row deletion practices, such as deleting in descending order and copying input workbooks to prevent index-shift corruption.
- Niche Quirks: Low-support observations are routed into 13 supplementary reference files rather than the main skill document to handle edge cases like cell color extraction or specific business logic mismatches.
-
Model Usage and Processing: The pipeline automatically recovers a hierarchical skill structure from trajectory evidence without manual curation. General procedural guidance flows into the main SKILL.md file, while case-specific rules populate the references directory. This hierarchy mirrors established skill-design practices where universal workflow rules are separated from infrequent edge cases.
-
Patch Generation and Consolidation: Individual error analysts generate structured patches for single trajectories, such as identifying failures where agents delete rows outside specified ranges. These 323 individual patches undergo a four-level hierarchical merging process to produce final consolidated patches that encode robust safety checks and validation steps for row and column operations.
Experiment
- Spreadsheet experiments validate that distilling trajectory-grounded skills significantly outperforms both human-written priors and parametric knowledge alone, with error-driven analysis providing the most reliable improvements across in-distribution and out-of-distribution tasks.
- Math reasoning evaluations confirm that the skill synthesis approach generalizes beyond spreadsheets to competition-level problems, demonstrating domain-agnostic capabilities that transfer effectively across different model scales.
- Visual question answering results reveal a dissociation between task execution and skill authoring, showing that a model's ability to perform well on a benchmark does not guarantee the reflective capacity required to analyze failures and generate transferable skills.
- Comparisons of evolution strategies demonstrate that parallel consolidation of error lessons yields higher quality and greater efficiency than sequential editing by preventing context drift and enabling simultaneous inductive reasoning.
- Benchmarks against retrieval-based memory systems show that distilling observations into a compact skill document is superior to episodic retrieval, as it avoids sensitivity to surface-level query similarity and integrates guidance directly into the system prompt.
- Ablation studies on error analysis methods prove that an agentic loop with artifact access and fix validation produces more transferable patches than single-call LLM analysis, which often misidentifies root causes and hallucinates failure mechanisms.