The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

Rapid advancements in large language models (LLMs) have the potential toassist in scientific progress. A critical capability toward this endeavor isthe ability to reproduce existing work. To evaluate the ability of AI agents toreproduce results in an active research area, we introduce the Automated LLMSpeedrunning Benchmark, leveraging the research community contributions on theNanoGPT speedrun, a competition to train a GPT-2 model in the shortest time.Each of the 19 speedrun tasks provides the agent with the previous recordstraining script, optionally paired with one of three hint formats, ranging frompseudocode to paper-like descriptions of the new records improvements. Recordsexecute quickly by design and speedrun improvements encompass diversecode-level changes, ranging from high-level algorithmic advancements tohardware-aware optimizations. These features make the benchmark both accessibleand realistic for the frontier problem of improving LLM training. We find thatrecent reasoning LLMs combined with SoTA scaffolds struggle to reimplementalready-known innovations in our benchmark, even when given detailed hints. Ourbenchmark thus provides a simple, non-saturated measure of an LLMs ability toautomate scientific reproduction, a necessary (but not sufficient) skill for anautonomous research agent.