Command Palette
Search for a command to run...
대규모 언어 모델에서의 도구 사용을 위한 인-컨텍스트 강화 학습
대규모 언어 모델에서의 도구 사용을 위한 인-컨텍스트 강화 학습
Yaoqi Ye Yiran Zhao Keyu Duan Zeyu Zheng Kenji Kawaguchi Cihang Xie Michael Qizhe Shieh
초록
대규모 언어 모델(LLM) 은 강력한 추론 능력을 보이지만, 내부 지식의 한계로 인해 복잡한 작업 수행 시 성능이 종종 제약받습니다. 이러한 과제를 극복하기 위한 유력한 접근법은 파이썬 인터프리터(수치 계산용) 나 검색 엔진(사실 정보 검색용) 과 같은 외부 도구를 모델에 결합하는 것입니다. 그러나 모델이 이러한 도구를 효과적으로 활용하도록 만드는 것은 여전히 중요한 도전 과제입니다. 기존 방법들은 주로 지도 미세 조정(SFT) 으로 시작하여 이후 강화 학습(RL) 을 수행하는 콜드스타트 파이프라인에 의존합니다. 이러한 접근법은 SFT 에 상당량의 레이블이 지정된 데이터를 요구하며, 이는 주석 작성이 비용이 많이 들거나 합성하기 어렵습니다.본 논문에서는 강화 학습의 rollout 단계에서 퓨샷 프롬프팅(few-shot prompting) 을 활용하여 SFT 없이 외부 도구 호출을 학습할 수 있는 RL 전용 프레임워크인 인컨텍스트 강화 학습(In-Context Reinforcement Learning, ICRL) 을 제안합니다. 구체적으로 ICRL 은 rollout 프롬프트 내에 인컨텍스트 예제를 포함시켜 모델이 외부 도구를 어떻게 호출해야 하는지를 학습시킵니다. 또한, 학습이 진행됨에 따라 인컨텍스트 예제의 수를 점진적으로 줄여 최종적으로는 제로샷(zero-shot) 설정에 도달하도록 하여, 모델이 스스로 도구를 호출하는 능력을 습득하게 합니다.우리는 다양한 추론 및 도구 활용 벤치마크를 통해 광범위한 실험을 수행했습니다. 실험 결과, ICRL 은 최신 최상위 성능(State-of-the-Art) 을 달성하여, 전통적인 SFT 기반 파이프라인에 비해 확장성과 데이터 효율성이 뛰어난 대안임을 입증했습니다.
One-sentence Summary
Researchers from Yale, Stanford, and other institutions propose In-Context Reinforcement Learning, a novel framework that enables large language models to autonomously refine tool-use strategies through dynamic context updates, significantly improving adaptability and performance in complex multi-step reasoning tasks without requiring extensive fine-tuning.
Key Contributions
- Large language models often struggle with complex tasks due to limited internal knowledge, and existing tool-use training methods rely on expensive supervised fine-tuning to provide initial guidance.
- This work introduces In-Context Reinforcement Learning, an RL-only framework that uses few-shot prompting during rollouts to teach tool invocation without any supervised fine-tuning.
- Experiments on benchmarks like TriviaQA and AIME2024 show that this approach achieves state-of-the-art performance with significant accuracy gains over strong baselines while eliminating the need for costly labeled data.
Introduction
Large language models often struggle with complex tasks due to their reliance on static pretraining knowledge, making the integration of external tools like code interpreters and search engines essential for real-world applications. Current methods typically depend on a cold-start pipeline that combines supervised fine-tuning with reinforcement learning, a process that demands expensive and labor-intensive labeled data to teach models how to invoke tools effectively. The authors introduce In-Context Reinforcement Learning, an RL-only framework that eliminates the need for supervised fine-tuning by embedding few-shot demonstrations directly into the rollout prompts. This approach uses a curriculum that gradually reduces these examples to guide the model from imitation to autonomous tool use, achieving state-of-the-art performance while significantly improving data efficiency.
Method
The authors formalize tool use in Large Language Models (LLMs) as a Markov Decision Process (MDP). Given a query q and an external tool T, the model generates a response y where each token is conditioned on the query, previous tokens, and a history of interactions Ht. This interaction history includes the model's actions, such as internal reasoning, issuing search queries, or providing a final answer, as well as the observations returned by the tool. The conditional distribution is defined as:
πθ(y∣q,T)=t=1∏∣y∣πθ(yt∣y<t,q,Ht)To optimize this policy, the authors employ a reinforcement learning objective that maximizes the expected reward while constraining the divergence from a reference model πref. They specifically adopt Group Relative Policy Optimization (GRPO) to train the policy πθ. A critical component of this training is a loss masking strategy. Since the rollout sequence includes retrieved content from external tools which the model did not generate, these tokens are excluded from the loss computation. This ensures that the policy gradient updates focus solely on the model's own decisions, such as tool invocation and reasoning steps, rather than the fixed content retrieved by the tool.
The core innovation of the proposed method, In-Context Reinforcement Learning (ICRL), lies in its training curriculum. Rather than training from scratch or relying solely on static few-shot prompting, ICRL integrates the inductive bias of few-shot learning with the exploration capabilities of RL. The process begins by incorporating a small number of tool-use demonstrations into the rollout template to guide the model. As training progresses and the model acquires tool-use capabilities, the number of demonstration examples in the prompt is iteratively reduced. This transition allows the model to move from imitation to autonomous tool use.
To provide a robust learning signal during this process, the authors design a composite reward function that combines answer accuracy and format correctness. The accuracy reward is based on exact match with the ground truth, while the format reward penalizes violations of the expected structured output, such as incorrect XML tags. The total reward is a weighted sum of these two components, guiding the model to produce both correct answers and valid tool-use sequences.
Experiment
- Main experiments compare ICRL against direct prompting, retrieval-based, and fine-tuning baselines across diverse QA benchmarks, validating that ICRL achieves state-of-the-art performance in complex reasoning and multi-hop tasks without requiring supervised fine-tuning or labeled tool traces.
- Ablation studies on curriculum design demonstrate that a simpler three-stage rollout schedule yields superior accuracy compared to aggressive reduction strategies, which cause premature stopping and weaken multi-turn reasoning capabilities.
- Scaling experiments confirm that ICRL effectively leverages larger model capacities, with the 14B variant significantly outperforming prompting and chain-of-thought methods while maintaining data efficiency.
- Generalization tests on code-writing and math problem-solving tasks show that ICRL successfully transfers to new tool domains, offering a more data-efficient alternative to methods relying on costly cold-start supervised fine-tuning.
- Training process analysis reveals that the model learns to internalize structured tool-use behaviors and increase valid tool calls over time, even when trained solely on sparse rewards for format validity and final answer accuracy.