Evaluating LLMs on the Daily Quartiles Puzzle: Which Model Performs Best?
The Quartiles puzzle has become a popular daily activity for many Apple News+ subscribers, thanks to its engaging word-forming challenge. The game involves a 4x5 grid of tiles, each containing 2 to 4 letters. Players aim to form valid English words by combining 1 to 4 tiles, adhering to specific rules: each tile can only be used once per word, letters within tiles cannot be rearranged, and tiles must be selected in sequence from left to right. The scoring system rewards longer words, with a 40-point bonus for finding all five 4-tile words, known as "Quartiles." To explore how effectively Large Language Models (LLMs) can solve this puzzle, I conducted a study using several leading models. The puzzle selected for this evaluation, dated 05 May 2025, features 25 possible words with a maximum score of 132 points. The goal of the test was to gauge each model's ability to identify valid words and prioritize the five Quartiles. Here’s a breakdown of the models and their performances: OpenAI — ChatGPT 4o: Score: 1 point Valid Words Identified: 1-tile word "CHE" Performance: ChatGPT 4o, despite being a multimodal flagship model, struggled significantly. It produced mostly invalid outputs and only managed to identify a single 1-tile word, scoring just 1 point. This poor performance could be attributed to the model's difficulty in recognizing valid word combinations under strict puzzle rules. OpenAI — ChatGPT o4-mini: Score: 16 points Valid Words Identified: 1-tile word "Pro," 2-tile word "No," 3-tile words "Provider" and "Script," one Quartile "Provisions" Performance: ChatGPT o4-mini performed the best among the tested models. It identified a balanced mix of short and long words, including one Quartile, demonstrating a reasonable level of accuracy and versatility in handling the puzzle constraints. Google — Gemini 2.5 Pro: Score: 2 points Valid Words Identified: 1-tile words "PRO" and "NO" Performance: Gemini 2.5 Pro, known for its advanced reasoning and coding capabilities, fared poorly. It took over a minute to generate a response and identified only two 1-tile words, while hallucinating several invalid combinations. This indicates a significant weakness in pattern recognition and rule adherence for this specific task. Qwen3–235B-A22B: Score: 1 point Valid Words Identified: 1-tile word "NO" Performance: Qwen3–235B-A22B, despite being a powerful model, also performed poorly. It took about 8 minutes to respond and identified only one valid 1-tile word. The high rate of incorrect and made-up words suggests a lack of precision in the model’s word formation capabilities under puzzle-specific rules. DeepSeek R1: Score: 2 points Valid Words Identified: 1-tile words "Pro" and "No" Performance: DeepSeek R1, an open-source model focused on efficient reasoning, similarly took a long time (around 7 minutes) to respond. It identified only two 1-tile words, missing longer combinations and Quartiles. While designed for competitive reasoning performance, its inability to effectively parse the puzzle suggests limitations in contextual understanding and speed. Anthropic — Claude 3.7 Sonnet: Score: 8 points Valid Words Identified: 2-tile word "Script," 3-tile word "Visions," 2-tile words "Holder" and "Place" Performance: Claude 3.7 Sonnet missed several straightforward 1-tile words like "Pro" and "No," which most other models easily identified. Despite this, it still scored higher than some of the other models by finding longer words, though none were Quartiles. Observations and Evaluations Several key observations emerged from the tests: Common Letter Combinations: Many models incorrectly identified short segments like "ES" and "AR" as valid words, indicating a倾向 towards common English letter combinations even when they don’t form complete words. Accuracy in Simple Words: One-tile words like "Pro" and "No" were generally easier for models to find, except for Claude 3.7 Sonnet, which surprisingly missed them. This simplicity should have been a low bar for all models. Hallucinations: All models generated made-up or incorrect words, suggesting difficulties in accurately combining tiles under the game’s rules. Industry Insights and Company Profiles OpenAI: Known for its advanced LLMs, OpenAI’s ChatGPT o4-mini stood out in this evaluation. Its ability to identify a mix of short and long words, including a Quartile, suggests it may have better fine-tuning or inherent word recognition capabilities for such tasks. Google: Despite Gemini 2.5 Pro’s strengths in reasoning and coding, its poor performance in the Quartiles puzzle highlights potential areas for improvement in rule-based and constrained tasks. Anthropic: Claude 3.7 Sonnet’s mixed performance, missing simple words but identifying longer ones, indicates a need for better calibration in handling word segmentation and combination. DeepSeek: As an open-source model, DeepSeek R1’s performance underscores the ongoing challenges in creating LLMs that excel in niche, rule-constrained environments. Next Steps The Quartiles puzzle includes a shuffle feature that rearranges the tiles, potentially altering the difficulty level. To further understand the models' adaptability, I plan to simulate multiple configurations of the same puzzle and share the results in the coming days. This follow-up study will provide deeper insights into how LLMs handle variable puzzle conditions and might help identify more robust strategies for solving such word games.
