4달 전

Chenglei Si Zitong Yang Yejin Choi Emmanuel Candès Diyi Yang Tatsunori Hashimoto

초록

자동화된 AI 연구는 과학적 발견을 가속화할 잠재력을 지닌다. 그러나 현재의 대규모 언어 모델(LLM)은 신뢰할 만해 보이지만 실질적인 효과가 없는 아이디어를 자주 생성한다. 실행 기반(Execution grounding) 접근법은 이를 해결할 수 있을 것으로 기대되지만, 자동 실행이 실제로 가능할지, 그리고 LLM이 실행 피드백을 통해 학습할 수 있을지는 명확하지 않다. 이러한 문제를 탐구하기 위해 우리는 먼저 아이디어를 구현하고, 대규모 병렬 GPU 실험을 통해 그 효과성을 검증할 수 있는 자동 실행기(Automated Executor)를 구축하였다. 이후 실제 연구 과제인 LLM 사전학습과 사후학습을 실행 환경으로 변환하여, 본 연구의 자동 실행기가 전방위적인 LLM에서 샘플링한 아이디어의 대부분을 성공적으로 구현할 수 있음을 입증하였다. 실행 피드백을 통해 학습하는 두 가지 방법—진화 탐색(evolutionary search)과 강화학습(reinforcement learning)—을 분석하였다. 실행 유도형 진화 탐색은 샘플 효율성이 뛰어나며, 사후학습에서 GRPO 기준 대비 유의미하게 우수한 성능(69.4% 대 48.0%)을 달성하였고, 사전학습에서는 nanoGPT 기준보다 뛰어난 사전학습 레시피를 발견하여 학습 시간을 19.7분에서 35.9분으로 단축하는 성과를 거두었다. 이 모든 결과는 단 10회의 탐색 에포크 내에서 달성되었다. 전방위적 LLM은 탐색 과정에서 의미 있는 알고리즘 아이디어를 자주 생성하지만, 초기에 포화 상태에 이르며 일반적으로 스케일링 추세를 보이지 않는다. 반면, 실행 보상에서 강화학습을 수행하는 경우 모드 붕괴(mode collapse) 문제가 발생한다. 아이디어 생성 모델의 평균 보상은 향상되지만, 상한값(upper-bound)은 개선되지 않는데, 이는 모델들이 단순한 아이디어로 수렴하기 때문이다. 본 연구는 실행된 아이디어와 학습 동역학을 철저히 분석함으로써, 향후 실행 기반 자동화 AI 연구의 발전을 위한 기초를 마련하였다.

One-sentence Summary

Chenglei Si, Zitong Yang, and colleagues from Stanford propose an automated executor for AI research that tests LLM-generated ideas via GPU experiments, using evolutionary search to efficiently outperform baselines in LLM pre- and post-training, while revealing limitations in reinforcement learning and early saturation of frontier models.

Key Contributions

We introduce a scalable automated executor that implements and evaluates LLM-generated research ideas for open-ended problems like LLM pre-training and post-training, achieving over 90% execution success with frontier models such as Claude-4.5-Opus.
Execution-guided evolutionary search proves sample-efficient, discovering post-training and pre-training recipes that significantly outperform baselines (69.4% vs 48.0% and 19.7 vs 35.9 minutes) within ten epochs, though scaling trends remain limited for most models.
Reinforcement learning from execution reward improves average idea quality but suffers from mode collapse, converging to simple, low-diversity ideas and failing to enhance the upper-bound performance critical for scientific discovery.

Introduction

The authors leverage large language models to automate AI research by generating, implementing, and evaluating research ideas through an execution-grounded feedback loop. Prior work in AutoML and LLM-based research agents either operates in constrained search spaces or lacks mechanisms to learn from execution results—limiting their ability to improve idea quality over time. The authors’ main contribution is a scalable automated executor that implements and runs hundreds of LLM-generated ideas in parallel for open-ended problems like LLM pre-training and post-training, achieving over 90% execution rates. They use this system to train ideators via evolutionary search and reinforcement learning, finding that evolutionary search efficiently discovers high-performing ideas while RL suffers from diversity collapse and fails to improve peak performance. Their work demonstrates feasibility and exposes key limitations for future systems to address.

Dataset

The authors use two research environments to train and evaluate their automated idea executor:

Pre-Training Environment (nanoGPT)
- Source: Adapted from the nanoGPT speedrun (Jordan et al., 2024), using a 124M GPT-2 model trained on FineWeb corpus (Penedo et al., 2024).
- Objective: Optimize for validation loss (or its reciprocal, 1/loss) under a fixed 25-minute wall-clock budget on 8 H100 GPUs.
- Modifications:
  - Proxy reward (1/loss) replaces raw training time as the optimization target.
  - Evaluation hyperparameters are frozen; inference is restricted to single-token prediction to prevent attention-based reward hacking.
- Final validation uses a locked inference function to ensure fair comparison with human solutions on the original leaderboard.
Post-Training Environment (GRPO)
- Source: Baseline GRPO algorithm (Shao et al., 2024) finetuning Qwen2.5-Math-1.5B (Yang et al., 2024) on MATH dataset (Hendrycks et al., 2021).
- Objective: Maximize validation accuracy on MATH within a fixed wall-clock time.
- Safeguards: Validation code is isolated in a separate file; the executor cannot access or modify it to prevent reward manipulation.
General Setup
- Both environments allow unrestricted ideation scope—from hyperparameter tuning to novel architectures or training methods.
- No constraints are imposed on the types of improvements the ideator model can propose.
- The environments are designed to be open-ended yet measurable, combining innovation space with clear benchmarking.

Method

The system architecture consists of two primary components: the Automated Idea Executor and the Automated AI Researcher, which operate in a closed-loop feedback system. The Automated Idea Executor functions as a high-level API that transforms a batch of natural language ideas into benchmark performance metrics. This component is composed of three core modules: the Implementer, Scheduler, and Worker. The Implementer, hosted on a CPU machine with high I/O capacity, receives a batch of natural language ideas and generates executable code changes. It makes parallelized API calls to a code execution LLM, prompting it with both the idea and the baseline codebase to sample multiple code diff files. To ensure patchability, the model undergoes a sequential self-revision process up to two times, returning the first successfully applied diff. The patched codebase is then uploaded as a .zip file to a cloud bucket. The Scheduler, operating under a fixed clock frequency, periodically downloads new codebases from the cloud. For unexecuted codebases, it assesses the resource requirements of the research environment and prepares a job configuration. The Worker, a GPU-equipped cluster, connects to available resources upon receiving a job configuration from the Scheduler. It runs the experiment and uploads the results, including performance metrics and full metadata (idea content, code change, execution log), to a cloud bucket (e.g., wandb). The Automated AI Researcher, which includes the ideator model, receives the experiment results and uses them to update the ideator via reinforcement learning or evolutionary search, generating new natural language ideas to continue the cycle.

Experiment

Built automated executor to implement LLM-generated ideas and validate them via GPU experiments on LLM pre-training and post-training tasks.
Execution-guided evolutionary search found superior methods: 69.4% accuracy on GRPO (vs 48.0% baseline) and 19.7 min training time on nanoGPT (vs 35.9 min baseline) within 10 epochs.
Claude-4.5-Sonnet and Claude-4.5-Opus achieved high execution rates (up to 90%) and outperformed baselines in best-of-50 sampling; GPT-5 showed lower execution rates.
When using GPT-5 as executor, open-weight models like Qwen3-235B still achieved non-trivial completion rates and outperformed baselines.
Evolutionary search outperformed best-of-N sampling under equal budget, showing effective use of feedback across epochs.
Claude-4.5-Opus showed scaling trends; Claude-4.5-Sonnet saturated early but found optimal hyper-parameter combinations.
RL from execution reward improved average performance but caused mode collapse, converging on simple ideas (e.g., RMSNorm→LayerNorm, EMA) without improving upper-bound performance.
RL training reduced thinking trace length, correlating with higher execution rates but lower idea complexity.
Models generated algorithmic ideas resembling recent research papers, suggesting potential to support frontier AI research.
Top solutions from evolutionary search surpassed human expert benchmarks on GRPO (69.4% vs 68.8%) but lagged behind human speedrun on nanoGPT (19.7 min vs 2.1 min).

Results show that execution-guided search achieves a validation accuracy of 69.4% on the post-training task, significantly outperforming the baseline of 48.0% and surpassing the best human expert's result of 68.8%. On the pre-training task, the search reduces training time to 19.7 minutes, improving upon the baseline of 35.9 minutes and approaching the best human solution of 2.1 minutes.

The authors use an automated executor to evaluate and optimize ideas generated by large language models in two environments: GRPO for post-training and nanoGPT for pre-training. Results show that execution-guided evolutionary search enables models to discover high-performing solutions, with Claude-4.5-Sonnet achieving 69.4% accuracy on GRPO and Claude-4.5-Opus reaching a validation loss of 3.1407 on nanoGPT, both outperforming their respective baselines.

The authors use an automated executor to evaluate and refine ideas generated by large language models for improving LLM training methods. Results show that execution-guided evolutionary search enables models like Claude-4.5-Opus and Claude-4.5-Sonnet to discover methods that significantly outperform baseline approaches, with Claude-4.5-Sonnet achieving a validation accuracy of 69.4% on the GRPO task and Claude-4.5-Opus reducing training time by 45% on the nanoGPT task.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

4달 전

Chenglei Si Zitong Yang Yejin Choi Emmanuel Candès Diyi Yang Tatsunori Hashimoto

초록

One-sentence Summary

Key Contributions

We introduce a scalable automated executor that implements and evaluates LLM-generated research ideas for open-ended problems like LLM pre-training and post-training, achieving over 90% execution success with frontier models such as Claude-4.5-Opus.
Execution-guided evolutionary search proves sample-efficient, discovering post-training and pre-training recipes that significantly outperform baselines (69.4% vs 48.0% and 19.7 vs 35.9 minutes) within ten epochs, though scaling trends remain limited for most models.
Reinforcement learning from execution reward improves average idea quality but suffers from mode collapse, converging to simple, low-diversity ideas and failing to enhance the upper-bound performance critical for scientific discovery.

Introduction

Dataset

The authors use two research environments to train and evaluate their automated idea executor:

Pre-Training Environment (nanoGPT)
- Source: Adapted from the nanoGPT speedrun (Jordan et al., 2024), using a 124M GPT-2 model trained on FineWeb corpus (Penedo et al., 2024).
- Objective: Optimize for validation loss (or its reciprocal, 1/loss) under a fixed 25-minute wall-clock budget on 8 H100 GPUs.
- Modifications:
  - Proxy reward (1/loss) replaces raw training time as the optimization target.
  - Evaluation hyperparameters are frozen; inference is restricted to single-token prediction to prevent attention-based reward hacking.
- Final validation uses a locked inference function to ensure fair comparison with human solutions on the original leaderboard.
Post-Training Environment (GRPO)
- Source: Baseline GRPO algorithm (Shao et al., 2024) finetuning Qwen2.5-Math-1.5B (Yang et al., 2024) on MATH dataset (Hendrycks et al., 2021).
- Objective: Maximize validation accuracy on MATH within a fixed wall-clock time.
- Safeguards: Validation code is isolated in a separate file; the executor cannot access or modify it to prevent reward manipulation.
General Setup
- Both environments allow unrestricted ideation scope—from hyperparameter tuning to novel architectures or training methods.
- No constraints are imposed on the types of improvements the ideator model can propose.
- The environments are designed to be open-ended yet measurable, combining innovation space with clear benchmarking.

Method

Experiment

Built automated executor to implement LLM-generated ideas and validate them via GPU experiments on LLM pre-training and post-training tasks.
Execution-guided evolutionary search found superior methods: 69.4% accuracy on GRPO (vs 48.0% baseline) and 19.7 min training time on nanoGPT (vs 35.9 min baseline) within 10 epochs.
Claude-4.5-Sonnet and Claude-4.5-Opus achieved high execution rates (up to 90%) and outperformed baselines in best-of-50 sampling; GPT-5 showed lower execution rates.
When using GPT-5 as executor, open-weight models like Qwen3-235B still achieved non-trivial completion rates and outperformed baselines.
Evolutionary search outperformed best-of-N sampling under equal budget, showing effective use of feedback across epochs.
Claude-4.5-Opus showed scaling trends; Claude-4.5-Sonnet saturated early but found optimal hyper-parameter combinations.
RL from execution reward improved average performance but caused mode collapse, converging on simple ideas (e.g., RMSNorm→LayerNorm, EMA) without improving upper-bound performance.
RL training reduced thinking trace length, correlating with higher execution rates but lower idea complexity.
Models generated algorithmic ideas resembling recent research papers, suggesting potential to support frontier AI research.
Top solutions from evolutionary search surpassed human expert benchmarks on GRPO (69.4% vs 68.8%) but lagged behind human speedrun on nanoGPT (19.7 min vs 2.1 min).

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

실행 기반 자동 AI 연구로 나아가기

Chenglei Si Zitong Yang Yejin Choi Emmanuel Candès Diyi Yang Tatsunori Hashimoto

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

실행 기반 자동 AI 연구로 나아가기

Chenglei Si Zitong Yang Yejin Choi Emmanuel Candès Diyi Yang Tatsunori Hashimoto

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

실행 기반 자동 AI 연구로 나아가기

Chenglei Si Zitong Yang Yejin Choi Emmanuel Candès Diyi Yang Tatsunori Hashimoto

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters