HyperAIHyperAI

Command Palette

Search for a command to run...

LLM Skirmish Tournament Reveals Coding Strengths and Flaws in Frontier AI Models

LLM Skirmish is a competitive benchmark that evaluates large language models through real-time strategy gameplay, leveraging the unique mechanics of Screeps—a programming-focused MMO RTS where players control in-game units entirely through code. The goal is to test how well frontier LLMs can apply their coding abilities in a dynamic, adversarial environment. The tournament features five rounds of 1v1 matches, with each model playing every other model once per round, resulting in 50 total matches. Models write JavaScript scripts via OpenCode, an open-source agentic coding framework that runs in isolated Docker containers. The orchestrator manages the tournament, sending prompts and validating scripts, with up to three attempts allowed per script if validation fails. Each round begins with a spawn building, one military unit, and three economic units. The objective is to destroy the opponent’s spawn. Matches last up to 2,000 game frames, with each model limited to one second of computation per frame. If no spawn is destroyed within that time, the winner is determined by score. The benchmark emphasizes in-context learning. Models can review prior match logs and adjust their strategies across rounds. Across all tournaments, each model submits 25 scripts, and a simulated 7,750 matches are run to calculate robust average win rates per round. Claude Opus 4.5 leads with 85 wins and a 85% win rate, earning an ELO rating of 1778. GPT 5.2 follows with 68 wins (68% win rate, ELO 1625), while Grok 4.1 Fast and GLM 4.7 trail with 39% and 32% win rates respectively. Gemini 3 Pro shows a strange pattern: a strong 70% win rate in round 1, dropping to just 15% in rounds 2–5. Its scripts are much shorter than top performers, suggesting reliance on simple, effective strategies early on. However, it appears to suffer from context rot—overloading its input with prior results—leading to degraded performance. Notably, four of the five models improve their average win rates from round 1 to round 5, indicating effective in-context learning. Claude Opus 4.5 shows the largest gain (+20%), followed by GLM 4.7 (+16%), GPT 5.2 (+7%), and Grok 4.1 Fast (+6%). In head-to-head matchups, GLM 4.7 and Gemini 3 Pro each have exactly 50% win rates, making them true rivals. GPT 5.2 emerges as a key challenger to Claude Opus 4.5, often the only model able to defeat it in later rounds. Claude remains the dominant force, but GPT 5.2 consistently pushes it, preventing a clean sweep. Cost efficiency varies widely. Claude Opus 4.5 achieves the highest ELO but at a steep $4.12 per round. GPT 5.2 delivers nearly 1.7 times more ELO per dollar, making it the most cost-effective performer. Overall, LLM Skirmish reveals that while coding prowess is a key strength of frontier models, real-world performance depends on strategy refinement, context management, and cost efficiency. The results highlight both the promise and limitations of current LLMs in complex, evolving environments.

Related Links

Hacker NewsHacker News
LLM Skirmish Tournament Reveals Coding Strengths and Flaws in Frontier AI Models | Trending Stories | HyperAI