a month ago

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu

Abstract

Recent Large Reasoning Models (LRMs) have achieved remarkable progress ontask-specific benchmarks, yet their evaluation methods remain constrained byisolated problem-solving paradigms. Existing benchmarks predominantly assesssingle-question reasoning through sequential testing, resulting criticallimitations: (1) vulnerability to data contamination and less challenging(e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetualcreation of new questions with large human efforts, (2) failure to evaluatemodels under multi-context pressure, a key requirement for real-worlddeployment. To bridge this gap, we present REST (Reasoning Evaluation throughSimultaneous Testing), a stress-testing framework that concurrently exposesLRMs to multiple problems simultaneously. Beyond basic reasoning, RESTspecifically evaluates several under-tested capabilities: contextual priorityallocation, cross-problem interference resistance, and dynamic cognitive loadmanagement. Our evaluation reveals several striking findings: Evenstate-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performancedegradation under stress testing. Crucially, REST demonstrates strongerdiscriminative power than existing benchmarks, revealing pronounced performancedifferences among models that exhibit similar, near-ceiling performance undersingle-question evaluations. Some key mechanistic insights emerge from ouranalysis: (1) the "overthinking trap" is a critical factor contributing to theperformance degradation; (2) the models trained with "long2short" techniquepreserve more accuracy of their single-problem performance under REST,outperforming standard-trained counterparts. These results establish REST as acost-efficient, future-proof evaluation paradigm that better reflectsreal-world reasoning demands while reducing reliance on continuous humanannotation.