Qwen3-32B Leads the TerminalBench Ranking
Scale AI, a data-labeling startup, has confirmed a major investment from Meta, boosting its valuation to $29 billion. The deal includes Meta acquiring a 49% stake for $14.3 billion, with CEO Alexandr Wang stepping down to join Meta’s AI initiatives. Wang will remain on Scale’s board, and interim CEO Jason Droege will oversee operations. The funding aims to support growth and return capital to shareholders, while Scale maintains its independence. The investment highlights Meta’s push to strengthen its AI capabilities amid competition from rivals like OpenAI and Google. Last year, Meta lost 4.3% of its top AI talent to other labs, underscoring the urgency to secure expertise. Scale AI has been a critical partner for leading AI labs, providing labeled data for training large language models (LLMs). Recent efforts include hiring top talent, such as PhD researchers and senior engineers, to meet rising demand for high-quality data. A GitHub project, terminal-bench-rl, details training methods for long-horizon terminal agents using reinforcement learning (RL). Built on UC Berkeley’s rLLM framework, the project includes custom environments and tools for coding tasks. The goal was to create a sophisticated LLM agent capable of handling complex terminal-based workflows. The project trained Qwen3-32B, a large language model, across various hardware setups, including 32x H100 GPUs (valued at $1 million) and smaller configurations like 2xA100s. While the full-scale training was limited by costs, the code and dataset have been tested to function reliably on systems ranging from single VM instances to multi-node clusters. Key components of the project include a reward system combining Answer Verification (65% weight) and LLM-as-a-Judge (35% weight). The latter uses models like Claude Sonnet 4 and Qwen3 Coder to evaluate agent performance, with Claude Sonnet 4 ranking highest due to its strict and accurate scoring. The system also supports dynamic switching between judge models to handle computational limits. The training framework leverages Group Relative Policy Optimization (GRPO), which enhances learning by comparing responses within a group, making it effective for structured reasoning tasks. Each training task generates parallel rollouts in isolated Docker environments, with verification scripts ensuring outcomes meet specified criteria. The dataset includes 331 tasks spanning varying difficulty levels, structured with task IDs, prompts, Dockerfile setups, and test functions. A synthetic data pipeline, powered by Claude Code and Opus-4, automates task generation and validation. The project also includes tools for bash commands and file operations, enabling agents to execute complex workflows. Despite limitations in compute resources, the Qwen3-32B agent achieved a 13.75% score on Stanford’s TerminalBench leaderboard, ranking 19th and outperforming models like GPT-4.1 and Deepseek R1. Future improvements include curriculum learning, dataset expansion, and smarter data filtering. The project emphasizes scalability, with training configurations adaptable to different hardware setups. While the author acknowledges the challenge of affording full-scale training, the infrastructure is designed to support both development and production environments. The work underscores the growing importance of training data and RL techniques in advancing AI capabilities for coding and terminal tasks. By combining structured environments, robust reward systems, and scalable infrastructure, the project offers a framework for developing high-performing AI agents despite resource constraints.