2 hours ago

Jingwei Ni Yihao Liu Xinpeng Liu Yutao Sun Mengyu Zhou Pengyu Cheng Dexin Wang Xiaoxi Jiang Guanjun Jiang

Table of Contents

Abstract

Equipping Large Language Model (LLM) agents with domain-specific skills is critical for tackling complex tasks. Yet, manual authoring creates a severe scalability bottleneck. Conversely, automated skill generation often yields fragile or fragmented results because it either relies on shallow parametric knowledge or sequentially overfits to non-generalizable trajectory-local lessons. To overcome this, we introduce Trace2Skill, a framework that mirrors how human experts author skills: by holistically analyzing broad execution experience before distilling it into a single, comprehensive guide. Instead of reacting sequentially to individual trajectories, Trace2Skill dispatches a parallel fleet of sub-agents to analyze a diverse pool of executions. It extracts trajectory-specific lessons and hierarchically consolidates them into a unified, conflict-free skill directory via inductive reasoning. Trace2Skill supports both deepening existing human-written skills and creating new ones from scratch. Experiments in challenging domains, such as spreadsheet, VisionQA and math reasoning, show that Trace2Skill significantly improves upon strong baselines, including Anthropic's official xlsx skills. Crucially, this trajectory-grounded evolution does not merely memorize task instances or model-specific quirks: evolved skills transfer across LLM scales and generalize to OOD settings. For example, skills evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 absolute percentage points on WikiTableQuestions. Ultimately, our results demonstrate that complex agent experience can be packaged into highly transferable, declarative skills -- requiring no parameter updates, no external retrieval modules, and utilizing open-source models as small as 35B parameters.

One-sentence Summary

Researchers from Alibaba, ETH Zurich, and Peking University introduce Trace2Skill, a framework that parallelizes sub-agent analysis of execution trajectories to distill fragmented lessons into unified, transferable skills, outperforming sequential online updates and retrieval-based baselines across spreadsheet, math, and vision tasks without requiring parameter updates.

Key Contributions

The paper introduces Trace2Skill, a framework that dispatches a parallel fleet of sub-agents to analyze diverse execution trajectories and hierarchically consolidate trajectory-specific lessons into a unified, conflict-free skill directory via inductive reasoning.
This work demonstrates that skills evolved through holistic parallel analysis transfer effectively across different LLM scales and generalize to out-of-distribution settings, such as improving a 122B agent by up to 57.65 percentage points using skills generated by a 35B model.
Experimental results confirm that the proposed parallel consolidation method outperforms both online sequential editing and retrieval-based experience banks while requiring no parameter updates or external retrieval modules.

Introduction

Equipping LLM agents with domain-specific skills is essential for handling complex tasks, yet manual creation creates a scalability bottleneck while automated methods often produce fragile results due to reliance on shallow parametric knowledge or sequential overfitting to isolated trajectory lessons. Prior approaches typically update skills sequentially as new data arrives or rely on retrieval-based memory banks, which leads to fragmented skill collections and poor generalization across different model scales or out-of-distribution settings. The authors introduce Trace2Skill, a framework that mimics human expertise by analyzing a diverse pool of execution trajectories in parallel to distill trajectory-local lessons into a single, comprehensive, and conflict-free skill directory. This approach leverages inductive reasoning to create transferable declarative skills that improve performance across varying LLM scales and task domains without requiring parameter updates or external retrieval modules.

Dataset

Dataset Composition and Sources: The authors construct a dataset of 323 map patches derived from 122B parameter model runs on the SpreadsheetBench-Verified benchmark. These patches capture Standard Operating Procedures (SoPs) distilled from agent trajectories, with the four most prevalent themes accounting for the majority of citations.
Key Subset Details:
- Formula Recalculation and Verification: 178 patches focus on running recalculation scripts and reopening files with data_only=True to prevent stale cells.
- Tool Selection: 177 patches advocate using openpyxl for write-back operations instead of pandas.toexcel() to preserve formula relationships and named ranges.
- Explicit Read-back Verification: 138 patches emphasize reopening output files to confirm target cell values before submission.
- Structural-edit Safety: 53 patches address safe row deletion practices, such as deleting in descending order and copying input workbooks to prevent index-shift corruption.
- Niche Quirks: Low-support observations are routed into 13 supplementary reference files rather than the main skill document to handle edge cases like cell color extraction or specific business logic mismatches.
Model Usage and Processing: The pipeline automatically recovers a hierarchical skill structure from trajectory evidence without manual curation. General procedural guidance flows into the main SKILL.md file, while case-specific rules populate the references directory. This hierarchy mirrors established skill-design practices where universal workflow rules are separated from infrequent edge cases.
Patch Generation and Consolidation: Individual error analysts generate structured patches for single trajectories, such as identifying failures where agents delete rows outside specified ranges. These 323 individual patches undergo a four-level hierarchical merging process to produce final consolidated patches that encode robust safety checks and validation steps for row and column operations.

Experiment

Spreadsheet experiments validate that distilling trajectory-grounded skills significantly outperforms both human-written priors and parametric knowledge alone, with error-driven analysis providing the most reliable improvements across in-distribution and out-of-distribution tasks.
Math reasoning evaluations confirm that the skill synthesis approach generalizes beyond spreadsheets to competition-level problems, demonstrating domain-agnostic capabilities that transfer effectively across different model scales.
Visual question answering results reveal a dissociation between task execution and skill authoring, showing that a model's ability to perform well on a benchmark does not guarantee the reflective capacity required to analyze failures and generate transferable skills.
Comparisons of evolution strategies demonstrate that parallel consolidation of error lessons yields higher quality and greater efficiency than sequential editing by preventing context drift and enabling simultaneous inductive reasoning.
Benchmarks against retrieval-based memory systems show that distilling observations into a compact skill document is superior to episodic retrieval, as it avoids sensitivity to surface-level query similarity and integrates guidance directly into the system prompt.
Ablation studies on error analysis methods prove that an agentic loop with artifact access and fix validation produces more transferable patches than single-call LLM analysis, which often misidentifies root causes and hallucinates failure mechanisms.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

2 hours ago

LLM

Agent

Jingwei Ni Yihao Liu Xinpeng Liu Yutao Sun Mengyu Zhou Pengyu Cheng Dexin Wang Xiaoxi Jiang Guanjun Jiang

Table of Contents

Abstract

One-sentence Summary

Key Contributions

The paper introduces Trace2Skill, a framework that dispatches a parallel fleet of sub-agents to analyze diverse execution trajectories and hierarchically consolidate trajectory-specific lessons into a unified, conflict-free skill directory via inductive reasoning.
This work demonstrates that skills evolved through holistic parallel analysis transfer effectively across different LLM scales and generalize to out-of-distribution settings, such as improving a 122B agent by up to 57.65 percentage points using skills generated by a 35B model.
Experimental results confirm that the proposed parallel consolidation method outperforms both online sequential editing and retrieval-based experience banks while requiring no parameter updates or external retrieval modules.

Introduction

Dataset

Dataset Composition and Sources: The authors construct a dataset of 323 map patches derived from 122B parameter model runs on the SpreadsheetBench-Verified benchmark. These patches capture Standard Operating Procedures (SoPs) distilled from agent trajectories, with the four most prevalent themes accounting for the majority of citations.
Key Subset Details:
- Formula Recalculation and Verification: 178 patches focus on running recalculation scripts and reopening files with data_only=True to prevent stale cells.
- Tool Selection: 177 patches advocate using openpyxl for write-back operations instead of pandas.toexcel() to preserve formula relationships and named ranges.
- Explicit Read-back Verification: 138 patches emphasize reopening output files to confirm target cell values before submission.
- Structural-edit Safety: 53 patches address safe row deletion practices, such as deleting in descending order and copying input workbooks to prevent index-shift corruption.
- Niche Quirks: Low-support observations are routed into 13 supplementary reference files rather than the main skill document to handle edge cases like cell color extraction or specific business logic mismatches.
Model Usage and Processing: The pipeline automatically recovers a hierarchical skill structure from trajectory evidence without manual curation. General procedural guidance flows into the main SKILL.md file, while case-specific rules populate the references directory. This hierarchy mirrors established skill-design practices where universal workflow rules are separated from infrequent edge cases.
Patch Generation and Consolidation: Individual error analysts generate structured patches for single trajectories, such as identifying failures where agents delete rows outside specified ranges. These 323 individual patches undergo a four-level hierarchical merging process to produce final consolidated patches that encode robust safety checks and validation steps for row and column operations.

Experiment

Spreadsheet experiments validate that distilling trajectory-grounded skills significantly outperforms both human-written priors and parametric knowledge alone, with error-driven analysis providing the most reliable improvements across in-distribution and out-of-distribution tasks.
Math reasoning evaluations confirm that the skill synthesis approach generalizes beyond spreadsheets to competition-level problems, demonstrating domain-agnostic capabilities that transfer effectively across different model scales.
Visual question answering results reveal a dissociation between task execution and skill authoring, showing that a model's ability to perform well on a benchmark does not guarantee the reflective capacity required to analyze failures and generate transferable skills.
Comparisons of evolution strategies demonstrate that parallel consolidation of error lessons yields higher quality and greater efficiency than sequential editing by preventing context drift and enabling simultaneous inductive reasoning.
Benchmarks against retrieval-based memory systems show that distilling observations into a compact skill document is superior to episodic retrieval, as it avoids sensitivity to surface-level query similarity and integrates guidance directly into the system prompt.
Ablation studies on error analysis methods prove that an agentic loop with artifact access and fix validation produces more transferable patches than single-call LLM analysis, which often misidentifies root causes and hallucinates failure mechanisms.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni Yihao Liu Xinpeng Liu Yutao Sun Mengyu Zhou Pengyu Cheng Dexin Wang Xiaoxi Jiang Guanjun Jiang

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni Yihao Liu Xinpeng Liu Yutao Sun Mengyu Zhou Pengyu Cheng Dexin Wang Xiaoxi Jiang Guanjun Jiang

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni Yihao Liu Xinpeng Liu Yutao Sun Mengyu Zhou Pengyu Cheng Dexin Wang Xiaoxi Jiang Guanjun Jiang

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Experiment

Build AI with AI

HyperAI Newsletters