
This paper presents a case study of coding tasks by the latest reasoningmodels of OpenAI, i.e. o1-preview and o1-mini, in comparison with otherfrontier models. The o1 models deliver SOTA results for WebApp1K, a single-taskbenchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doublingnumber of tasks and test cases. The new benchmark causes the o1 modelperformances to decline significantly, falling behind Claude 3.5. Moreover,they consistently fail when confronted with atypical yet correct test cases, atrap non-reasoning models occasionally avoid. We hypothesize that theperformance variability is due to instruction comprehension. Specifically, thereasoning mechanism boosts performance when all expectations are captured,meanwhile exacerbates errors when key expectations are missed, potentiallyimpacted by input lengths. As such, we argue that the coding success ofreasoning models hinges on the top-notch base model and SFT to ensuremeticulous adherence to instructions.