HyperAIHyperAI

Command Palette

Search for a command to run...

LLM-in-Sandbox를 통한 일반적인 에이전트 지능 유도

Daixuan Cheng Shaohan Huang Yuxian Gu Huatong Song Guoxin Chen Li Dong Wayne Xin Zhao Ji-Rong Wen Furu Wei

초록

우리는 LLM-in-Sandbox를 소개한다. 이는 대규모 언어 모델(LLM)이 코드 사안트(가상 컴퓨터) 내에서 탐색할 수 있도록 함으로써, 코드 외 영역에서도 일반 지능을 발현할 수 있도록 한다. 먼저, 추가 학습 없이도 강력한 LLM이 코드 사안트를 활용해 비코드 작업에 대해 일반화 능력을 보임을 실험적으로 입증한다. 예를 들어, LLM은 외부 자원에 자발적으로 접근하여 새로운 지식을 획득하고, 파일 시스템을 활용해 장문의 맥락을 처리하며, 서식 요구사항을 충족하기 위해 스크립트를 실행하는 등 자율적인 행동을 보인다. 또한, LLM-in-Sandbox 강화학습(LLM-in-Sandbox-RL)을 통해 이러한 에이전트적 능력을 향상시킬 수 있음을 보여주며, 이는 비에이전트 데이터만을 사용해 사안트 탐색을 위한 모델을 훈련하는 방식이다. 실험 결과, LLM-in-Sandbox는 학습 없이도 사후 훈련을 거친 경우에도 수학, 물리학, 화학, 생명의학, 장문 맥락 이해, 지시 따르기 등 다양한 분야에서 견고한 일반화 성능을 달성함을 확인했다. 마지막으로, 계산적 및 시스템적 관점에서 LLM-in-Sandbox의 효율성을 분석하고, 실세계 적용을 촉진하기 위해 파이썬 패키지 형태로 오픈소스로 공개한다.

One-sentence Summary

Researchers from Renmin University, Microsoft Research, and Tsinghua University propose LLM-in-Sandbox, enabling LLMs to autonomously use code sandboxes for non-code tasks—enhancing reasoning via RL without agentic data—yielding robust gains in science, long-context, and instruction-following tasks, with open-source deployment support.

Key Contributions

  • LLM-in-Sandbox enables large language models to leverage a virtual code sandbox for non-code tasks without additional training, demonstrating spontaneous use of file systems, external resources, and script execution to solve problems in domains like math, science, and long-context understanding.
  • The authors introduce LLM-in-Sandbox Reinforcement Learning, which trains models using only non-agenetic data to improve sandbox exploration, boosting performance across both agentic and vanilla modes while generalizing to out-of-domain tasks.
  • The system achieves up to 8x reduction in token usage for long contexts, maintains competitive throughput, and is open-sourced as a Python package compatible with major inference backends to support real-world deployment.

Introduction

The authors leverage a code sandbox—a virtual computer environment—to unlock general agentic intelligence in large language models (LLMs), enabling them to handle non-code tasks like math, science, and long-context reasoning without task-specific training. Prior approaches often treat LLMs as static text generators or limit tool use to predefined APIs, failing to exploit the full versatility of computational environments. The authors’ main contribution is LLM-in-Sandbox, which grants models open-ended access to external resources, file systems, and code execution, plus LLM-in-Sandbox-RL, a reinforcement learning method that trains models using only general non-agenetic data to master sandbox exploration—boosting performance across domains while reducing token usage and enabling real-world deployment via an open-source Python package.

Dataset

The authors use six diverse non-code evaluation benchmarks, each tailored for specific domains and evaluation methods:

  • Mathematics: 30 AIME25 problems (olympiad-level), repeated 16 times each. Prompt includes step-by-step reasoning and boxed final answers. Evaluated with Math-Verify (HuggingFace, 2025).

  • Physics: 650 UGPhysics problems (50 per of 13 undergraduate subjects). Evaluated via LLM judge (Qwen3-30B-A3B-Instruct-2507, Yang et al., 2025a).

  • Chemistry: 450 ChemBench questions (50 per of 9 sub-domains). Evaluated by exact match.

  • Biomedicine: 500 text-based MedXpertQA multiple-choice questions. Evaluated by exact match.

  • Long-Context Understanding: 100 AA-LCR questions requiring multi-document reasoning (avg. 100K tokens per doc set). Each problem repeated 4 times. Documents stored in sandbox at /testbed/documents/. Evaluated with LLM-based checker (Qwen3-235B-A22B-Instruct-2507, Yang et al., 2025a).

  • Instruction Following: 300 single-turn IFBench questions across 58 constraints. Evaluated in loose mode using official code to handle formatting variations.

  • Software Engineering: 500 verified SWE-bench tasks (code generation, debugging, comprehension). Evaluated with official rule-based script. Uses R2E-Gym sandbox and OpenHands-style prompts with toolset from Section 2.2.

For sandbox capability analysis, the authors classify model actions using pattern matching on Python and bash code blocks. Capability usage rate = (turns with matched patterns) / (total turns). Classification rules are detailed in Table 15.

Method

The authors leverage a code sandbox environment to enable large language models (LLMs) to perform general-purpose tasks through interactive, multi-turn reasoning and action cycles. The framework is built upon a lightweight, general-purpose sandbox implemented as an Ubuntu-based Docker container, which provides the model with terminal access and full system capabilities while ensuring isolation from the host system. This environment is equipped with a minimal toolset designed to realize core computational functions: executeBASH for executing arbitrary terminal commands, str_replace_editor for file manipulation, and submit for signaling task completion. The sandbox's design emphasizes generalizability and scalability by maintaining a constant footprint of approximately 1.1 GB, in contrast to task-specific environments that may require up to 6 TB of storage, enabling efficient large-scale inference and training.

The LLM-in-Sandbox workflow extends the ReAct framework, where the model iteratively generates tool calls based on the task prompt and environmental feedback. As shown in Algorithm 1, the process begins with sandbox configuration, followed by a loop in which the model generates a tool call, executes it within the sandbox, and receives an observation. This interaction continues until the model calls submit or reaches a maximum turn limit. The workflow supports flexible input/output handling by leveraging the sandbox's file system: inputs can be provided via the prompt or as files in the /testbed/documents/ directory, while outputs are directed to a designated location, such as /testbed/answer.txt, ensuring a clean separation between exploration and final results.

To train LLMs to effectively utilize the sandbox, the authors propose LLM-in-Sandbox-RL, a reinforcement learning approach that combines sandbox-based training with general-domain data. The training data consists of context-based tasks sourced from diverse domains, including encyclopedia, fiction, academic tests, and social media, where each instance includes background materials and a task objective. The sandbox configuration stores these contexts as files, with strategies to increase task difficulty, such as splitting multi-document contexts into separate files or supplementing single-file contexts with distractors. This setup encourages the model to actively explore the environment to locate relevant information, naturally learning to leverage its capabilities.

The task setup involves sampling one task as the testing task, using prior related tasks as in-context examples in the prompt. The model is instructed to write the final answer to a designated output file, such as /testbed/answer.txt, and the result is extracted upon completion. The RL training framework adopts outcome-based rewards, with the key difference being that trajectory generation occurs in LLM-in-Sandbox mode, allowing the model to learn from sandbox interactions. The system prompt guides the model to use computational tools, derive answers through program execution, and explore freely within the safe, isolated environment. The prompt specifies the working directory, input and output paths, and includes important notes and encouraged approaches to ensure proper interaction with the sandbox tools.

Experiment

  • LLM-in-Sandbox improves performance across general domains (Math, Physics, Chemistry, Biomedicine, Long-Context, Instruction Following), with gains up to +24.2% on Qwen3-Coder for Math; weaker models like Qwen3-4B-Instruct show no benefit or decline.
  • Strong models effectively use sandbox capabilities: computation (43.4% in Math), external resources (18.4% in Chemistry), and file management (high in Long-Context); weak models exhibit low usage (<3%) and inefficient exploration.
  • Storing long-context documents in sandbox (vs. prompt) boosts performance for Claude, DeepSeek, and Kimi; Qwen3-4B performs worse, indicating need for training in file-based navigation.
  • LLM-in-Sandbox-RL training generalizes across domains, model sizes, and inference modes: Qwen3-4B improves from 10.0 to 14.4 in Biomed and 33.7 to 37.7 in Instruction Following; even LLM mode benefits due to transferred reasoning patterns.
  • Training with context placed in sandbox (vs. prompt) yields superior generalization, as active exploration enhances capability utilization.
  • Post-training, models show increased sandbox usage (external resources, file ops, computation) and reduced turns (Qwen3-4B from 23.7 to 7.0), while vanilla LLM outputs gain structural organization and verification behaviors.
  • Computationally, LLM-in-Sandbox reduces token usage by up to 8x in long-context tasks and averages 0.5–0.8x total tokens vs. LLM mode; environment tokens (37–51%) are fast to process, enabling competitive throughput (up to 2.2x speedup for MiniMax).
  • Infrastructure is lightweight: single ~1.1GB Docker image, ~50MB idle / ~200MB peak memory per sandbox; 512 concurrent sandboxes consume only 5% of 2TB system RAM.
  • LLM-in-Sandbox enables non-text outputs: generates interactive maps (.html), posters (.png/.svg), videos (.mp4), and music (.wav/.mid) via autonomous tool installation and execution, demonstrating cross-modal and file-level capabilities beyond text generation.

Results show that LLM-in-Sandbox significantly improves performance across most domains compared to vanilla LLM generation, with the largest gains observed in Mathematics (+24.2%) and Instruction Following (+14.4%). However, the model Qwen3-4B-Instruct performs worse in most domains, indicating that weaker models fail to effectively leverage the sandbox environment.

The authors use different inference configurations for each model, with variations in temperature, top-p, min-p, top-k, repetition penalty, and backend. The configurations are tailored to each model's specifications, with some models using API-based backends and others using SGLang or vLLM, and the maximum generation length per turn is capped at 65,536 tokens except for Claude-Sonnet-4.5-Think, which is limited to 64,000 tokens due to API constraints.

Results show that LLM-in-Sandbox consistently improves performance across most domains compared to vanilla LLM generation, with the largest gains observed in Mathematics and Long-Context tasks. The improvements are particularly significant for strong models, while weaker models like Qwen3-4B-Instruct show little to no benefit and sometimes perform worse.

Results show that strong agentic models consistently benefit from LLM-in-Sandbox across all evaluated domains, with performance improvements ranging from +1.0% to +24.2% compared to vanilla LLM generation. However, weaker models like Qwen3-4B-Instruct-2507 show significant declines in performance, particularly in Long-Context and Chemistry tasks, indicating that effective sandbox utilization depends on the model's inherent capabilities.

Results show that LLM-in-Sandbox-RL training significantly improves reasoning patterns in both strong and weak models, increasing the use of verification and structural organization in vanilla LLM mode outputs. The Qwen3-Coder-30B-A3B model sees a modest gain in verification (0.77 to 0.88) and a larger increase in structure (10.30 to 16.12), while the Qwen3-4B-Instruct-2507 model shows substantial improvements in both metrics, with verification rising from 20.22 to 36.91 and structure from 19.13 to 20.64.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
LLM-in-Sandbox를 통한 일반적인 에이전트 지능 유도 | 문서 | HyperAI초신경