Command Palette
Search for a command to run...
Trace2Skill: Trajektorien-lokale Lehren in übertragbare Agenten-Fähigkeiten destillieren
Trace2Skill: Trajektorien-lokale Lehren in übertragbare Agenten-Fähigkeiten destillieren
Jingwei Ni Yihao Liu Xinpeng Liu Yutao Sun Mengyu Zhou Pengyu Cheng Dexin Wang Xiaoxi Jiang Guanjun Jiang
Zusammenfassung
Die Ausstattung von Large Language Model (LLM)-Agenten mit domänenspezifischen Fähigkeiten ist entscheidend für die Bewältigung komplexer Aufgaben. Die manuelle Erstellung solcher Fähigkeiten stellt jedoch eine erhebliche Skalierbarkeitsbarriere dar. Umgekehrt führt die automatisierte Generierung von Fähigkeiten häufig zu fragilen oder fragmentierten Ergebnissen, da sie entweder auf oberflächlichem parametrischem Wissen basiert oder sequenziell an nicht verallgemeinerbare, trajectorielokale Lehren overfitted. Um dies zu überwinden, stellen wir Trace2Skill vor, ein Framework, das dem Vorgehen menschlicher Experten bei der Erstellung von Fähigkeiten nachempfunden ist: Durch eine ganzheitliche Analyse eines breiten Spektrums an Ausführungserfahrungen, die anschließend in einen einzigen, umfassenden Leitfaden verdichtet werden. Anstatt sequenziell auf einzelne Trajektorien zu reagieren, setzt Trace2Skill eine parallele Flotte von Sub-Agenten ein, um eine diverse Menge von Ausführungen zu analysieren. Mittels induktiver Schlussfolgerung extrahiert das System trajectorielokale Lehren und konsolidiert diese hierarchisch zu einem einheitlichen, konfliktfreien Verzeichnis von Fähigkeiten. Trace2Skill unterstützt sowohl die Vertiefung bestehender, von Menschen verfasster Fähigkeiten als auch die Erstellung neuer Fähigkeiten von Grund auf. Experimente in anspruchsvollen Domänen wie Tabellenkalkulation, VisionQA und mathematischem Schlussfolgern zeigen, dass Trace2Skill starke Baselines signifikant übertrifft, einschließlich der offiziellen xlsx-Fähigkeiten von Anthropic. Entscheidend ist, dass diese trajectorie-verankerte Evolution nicht lediglich Aufgabeninstanzen oder modellspezifische Eigenarten auswendig lernt: Die weiterentwickelten Fähigkeiten sind über verschiedene LLM-Skalen hinweg übertragbar und verallgemeinern sich auf Out-of-Distribution (OOD)-Szenarien. So verbesserten beispielsweise von einem Qwen3.5-35B-Modell auf seinen eigenen Trajektorien weiterentwickelte Fähigkeiten einen Qwen3.5-122B-Agenten bei WikiTableQuestions um bis zu 57,65 absolute Prozentpunkte. Letztlich belegen unsere Ergebnisse, dass komplexe Agentenerfahrungen in hoch übertragbare, deklarative Fähigkeiten überführt werden können – ohne Parameter-Updates, ohne externe Retrieval-Module und unter Verwendung von Open-Source-Modellen mit lediglich 35 Milliarden Parametern.
One-sentence Summary
Researchers from Alibaba, ETH Zurich, and Peking University introduce Trace2Skill, a framework that parallelizes sub-agent analysis of execution trajectories to distill fragmented lessons into unified, transferable skills, outperforming sequential online updates and retrieval-based baselines across spreadsheet, math, and vision tasks without requiring parameter updates.
Key Contributions
- The paper introduces Trace2Skill, a framework that dispatches a parallel fleet of sub-agents to analyze diverse execution trajectories and hierarchically consolidate trajectory-specific lessons into a unified, conflict-free skill directory via inductive reasoning.
- This work demonstrates that skills evolved through holistic parallel analysis transfer effectively across different LLM scales and generalize to out-of-distribution settings, such as improving a 122B agent by up to 57.65 percentage points using skills generated by a 35B model.
- Experimental results confirm that the proposed parallel consolidation method outperforms both online sequential editing and retrieval-based experience banks while requiring no parameter updates or external retrieval modules.
Introduction
Equipping LLM agents with domain-specific skills is essential for handling complex tasks, yet manual creation creates a scalability bottleneck while automated methods often produce fragile results due to reliance on shallow parametric knowledge or sequential overfitting to isolated trajectory lessons. Prior approaches typically update skills sequentially as new data arrives or rely on retrieval-based memory banks, which leads to fragmented skill collections and poor generalization across different model scales or out-of-distribution settings. The authors introduce Trace2Skill, a framework that mimics human expertise by analyzing a diverse pool of execution trajectories in parallel to distill trajectory-local lessons into a single, comprehensive, and conflict-free skill directory. This approach leverages inductive reasoning to create transferable declarative skills that improve performance across varying LLM scales and task domains without requiring parameter updates or external retrieval modules.
Dataset
-
Dataset Composition and Sources: The authors construct a dataset of 323 map patches derived from 122B parameter model runs on the SpreadsheetBench-Verified benchmark. These patches capture Standard Operating Procedures (SoPs) distilled from agent trajectories, with the four most prevalent themes accounting for the majority of citations.
-
Key Subset Details:
- Formula Recalculation and Verification: 178 patches focus on running recalculation scripts and reopening files with data_only=True to prevent stale cells.
- Tool Selection: 177 patches advocate using openpyxl for write-back operations instead of pandas.toexcel() to preserve formula relationships and named ranges.
- Explicit Read-back Verification: 138 patches emphasize reopening output files to confirm target cell values before submission.
- Structural-edit Safety: 53 patches address safe row deletion practices, such as deleting in descending order and copying input workbooks to prevent index-shift corruption.
- Niche Quirks: Low-support observations are routed into 13 supplementary reference files rather than the main skill document to handle edge cases like cell color extraction or specific business logic mismatches.
-
Model Usage and Processing: The pipeline automatically recovers a hierarchical skill structure from trajectory evidence without manual curation. General procedural guidance flows into the main SKILL.md file, while case-specific rules populate the references directory. This hierarchy mirrors established skill-design practices where universal workflow rules are separated from infrequent edge cases.
-
Patch Generation and Consolidation: Individual error analysts generate structured patches for single trajectories, such as identifying failures where agents delete rows outside specified ranges. These 323 individual patches undergo a four-level hierarchical merging process to produce final consolidated patches that encode robust safety checks and validation steps for row and column operations.
Experiment
- Spreadsheet experiments validate that distilling trajectory-grounded skills significantly outperforms both human-written priors and parametric knowledge alone, with error-driven analysis providing the most reliable improvements across in-distribution and out-of-distribution tasks.
- Math reasoning evaluations confirm that the skill synthesis approach generalizes beyond spreadsheets to competition-level problems, demonstrating domain-agnostic capabilities that transfer effectively across different model scales.
- Visual question answering results reveal a dissociation between task execution and skill authoring, showing that a model's ability to perform well on a benchmark does not guarantee the reflective capacity required to analyze failures and generate transferable skills.
- Comparisons of evolution strategies demonstrate that parallel consolidation of error lessons yields higher quality and greater efficiency than sequential editing by preventing context drift and enabling simultaneous inductive reasoning.
- Benchmarks against retrieval-based memory systems show that distilling observations into a compact skill document is superior to episodic retrieval, as it avoids sensitivity to surface-level query similarity and integrates guidance directly into the system prompt.
- Ablation studies on error analysis methods prove that an agentic loop with artifact access and fix validation produces more transferable patches than single-call LLM analysis, which often misidentifies root causes and hallucinates failure mechanisms.