Command Palette
Search for a command to run...
Trace2Skill: 궤적 지역적 교훈을 전이 가능한 Agent 기술로 증류
Trace2Skill: 궤적 지역적 교훈을 전이 가능한 Agent 기술로 증류
Jingwei Ni Yihao Liu Xinpeng Liu Yutao Sun Mengyu Zhou Pengyu Cheng Dexin Wang Xiaoxi Jiang Guanjun Jiang
초록
대규모 언어 모델(LLM) 에이전트에 도메인 특화 기술을 부여하는 것은 복잡한 과제를 해결하는 데 필수적입니다. 그러나 수동 작성 방식은 확장성에 심각한 병목 현상을 초래합니다. 반면, 자동화된 기술 생성은 얕은 매개변수 지식에 의존하거나 일반화 불가능한 궤적 국소적 교훈에 순차적으로 과적합함으로써 취약하거나 단편화된 결과를 낳는 경향이 있습니다. 이러한 문제를 해결하기 위해 우리는 Trace2Skill 을 소개합니다. 이 프레임워크는 인간 전문가가 기술을 작성하는 방식을 모방하여 광범위한 실행 경험을 전체적으로 분석한 후 이를 단일하고 포괄적인 가이드로 정제합니다. 개별 궤적에 순차적으로 대응하는 대신, Trace2Skill 은 다양한 실행 풀을 분석하기 위해 병렬 하위 에이전트 군집을 투입합니다. 이후 궤적 특화 교훈을 추출하고 귀납적 추론을 통해 이를 통일되고 충돌이 없는 통합 기술 디렉터리로 위계적으로 통합합니다. Trace2Skill 은 기존 인간 작성 기술의 심화뿐만 아니라 처음부터 새로운 기술 생성도 지원합니다. 스프레드시트, VisionQA, 수학 추론과 같은 까다로운 도메인에서의 실험 결과, Anthropic 의 공식 xlsx 기술을 포함한 강력한 베이스라인 대비 Trace2Skill 의 성능이 현저히 향상됨을 확인했습니다. 특히, 궤적 기반의 이러한 진화는 단순히 작업 인스턴스나 모델 고유의 특이점을 암기하는 데 그치지 않습니다. 진화된 기술은 LLM 규모 간에 전이 가능하며 OOD(Out-of-Distribution) 환경에서도 일반화됩니다. 예를 들어, Qwen3.5-35B 가 자체 궤적을 통해 진화시킨 기술은 Qwen3.5-122B 에이전트의 WikiTableQuestions 성능을 최대 57.65%p 절대적으로 향상시켰습니다. 궁극적으로 본 연구 결과는 복잡한 에이전트 경험을 매개변수 업데이트나 외부 검색 모듈 없이도, 파라미터 수가 35B 에 불과한 오픈소스 모델을 활용하여 높은 전이성을 가진 선언적 기술로 패키징할 수 있음을 입증합니다.
One-sentence Summary
Researchers from Alibaba, ETH Zurich, and Peking University introduce Trace2Skill, a framework that parallelizes sub-agent analysis of execution trajectories to distill fragmented lessons into unified, transferable skills, outperforming sequential online updates and retrieval-based baselines across spreadsheet, math, and vision tasks without requiring parameter updates.
Key Contributions
- The paper introduces Trace2Skill, a framework that dispatches a parallel fleet of sub-agents to analyze diverse execution trajectories and hierarchically consolidate trajectory-specific lessons into a unified, conflict-free skill directory via inductive reasoning.
- This work demonstrates that skills evolved through holistic parallel analysis transfer effectively across different LLM scales and generalize to out-of-distribution settings, such as improving a 122B agent by up to 57.65 percentage points using skills generated by a 35B model.
- Experimental results confirm that the proposed parallel consolidation method outperforms both online sequential editing and retrieval-based experience banks while requiring no parameter updates or external retrieval modules.
Introduction
Equipping LLM agents with domain-specific skills is essential for handling complex tasks, yet manual creation creates a scalability bottleneck while automated methods often produce fragile results due to reliance on shallow parametric knowledge or sequential overfitting to isolated trajectory lessons. Prior approaches typically update skills sequentially as new data arrives or rely on retrieval-based memory banks, which leads to fragmented skill collections and poor generalization across different model scales or out-of-distribution settings. The authors introduce Trace2Skill, a framework that mimics human expertise by analyzing a diverse pool of execution trajectories in parallel to distill trajectory-local lessons into a single, comprehensive, and conflict-free skill directory. This approach leverages inductive reasoning to create transferable declarative skills that improve performance across varying LLM scales and task domains without requiring parameter updates or external retrieval modules.
Dataset
-
Dataset Composition and Sources: The authors construct a dataset of 323 map patches derived from 122B parameter model runs on the SpreadsheetBench-Verified benchmark. These patches capture Standard Operating Procedures (SoPs) distilled from agent trajectories, with the four most prevalent themes accounting for the majority of citations.
-
Key Subset Details:
- Formula Recalculation and Verification: 178 patches focus on running recalculation scripts and reopening files with data_only=True to prevent stale cells.
- Tool Selection: 177 patches advocate using openpyxl for write-back operations instead of pandas.toexcel() to preserve formula relationships and named ranges.
- Explicit Read-back Verification: 138 patches emphasize reopening output files to confirm target cell values before submission.
- Structural-edit Safety: 53 patches address safe row deletion practices, such as deleting in descending order and copying input workbooks to prevent index-shift corruption.
- Niche Quirks: Low-support observations are routed into 13 supplementary reference files rather than the main skill document to handle edge cases like cell color extraction or specific business logic mismatches.
-
Model Usage and Processing: The pipeline automatically recovers a hierarchical skill structure from trajectory evidence without manual curation. General procedural guidance flows into the main SKILL.md file, while case-specific rules populate the references directory. This hierarchy mirrors established skill-design practices where universal workflow rules are separated from infrequent edge cases.
-
Patch Generation and Consolidation: Individual error analysts generate structured patches for single trajectories, such as identifying failures where agents delete rows outside specified ranges. These 323 individual patches undergo a four-level hierarchical merging process to produce final consolidated patches that encode robust safety checks and validation steps for row and column operations.
Experiment
- Spreadsheet experiments validate that distilling trajectory-grounded skills significantly outperforms both human-written priors and parametric knowledge alone, with error-driven analysis providing the most reliable improvements across in-distribution and out-of-distribution tasks.
- Math reasoning evaluations confirm that the skill synthesis approach generalizes beyond spreadsheets to competition-level problems, demonstrating domain-agnostic capabilities that transfer effectively across different model scales.
- Visual question answering results reveal a dissociation between task execution and skill authoring, showing that a model's ability to perform well on a benchmark does not guarantee the reflective capacity required to analyze failures and generate transferable skills.
- Comparisons of evolution strategies demonstrate that parallel consolidation of error lessons yields higher quality and greater efficiency than sequential editing by preventing context drift and enabling simultaneous inductive reasoning.
- Benchmarks against retrieval-based memory systems show that distilling observations into a compact skill document is superior to episodic retrieval, as it avoids sensitivity to surface-level query similarity and integrates guidance directly into the system prompt.
- Ablation studies on error analysis methods prove that an agentic loop with artifact access and fix validation produces more transferable patches than single-call LLM analysis, which often misidentifies root causes and hallucinates failure mechanisms.