HyperAIHyperAI
a month ago

REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu
REST: Stress Testing Large Reasoning Models by Asking Multiple Problems
  at Once
Abstract

Recent Large Reasoning Models (LRMs) have achieved remarkable progress ontask-specific benchmarks, yet their evaluation methods remain constrained byisolated problem-solving paradigms. Existing benchmarks predominantly assesssingle-question reasoning through sequential testing, resulting criticallimitations: (1) vulnerability to data contamination and less challenging(e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly and perpetualcreation of new questions with large human efforts, (2) failure to evaluatemodels under multi-context pressure, a key requirement for real-worlddeployment. To bridge this gap, we present REST (Reasoning Evaluation throughSimultaneous Testing), a stress-testing framework that concurrently exposesLRMs to multiple problems simultaneously. Beyond basic reasoning, RESTspecifically evaluates several under-tested capabilities: contextual priorityallocation, cross-problem interference resistance, and dynamic cognitive loadmanagement. Our evaluation reveals several striking findings: Evenstate-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performancedegradation under stress testing. Crucially, REST demonstrates strongerdiscriminative power than existing benchmarks, revealing pronounced performancedifferences among models that exhibit similar, near-ceiling performance undersingle-question evaluations. Some key mechanistic insights emerge from ouranalysis: (1) the "overthinking trap" is a critical factor contributing to theperformance degradation; (2) the models trained with "long2short" techniquepreserve more accuracy of their single-problem performance under REST,outperforming standard-trained counterparts. These results establish REST as acost-efficient, future-proof evaluation paradigm that better reflectsreal-world reasoning demands while reducing reliance on continuous humanannotation.