Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

The reasoning capabilities of large language models (LLMs) have been alongstanding focus of research. Recent works have further enhanced thesecapabilities using reinforcement learning (RL), with many new methods claimingsignificant improvements with minimal or no external supervision. Surprisingly,some studies even suggest that random or incorrect reward signals can enhancereasoning performance. However, these breakthroughs are mostly reported on theQwen2.5 model family and evaluated on well-known benchmarks such as MATH-500,AMC, and AIME, while failing to achieve similar gains on other models likeLlama, which warrants further investigation. Our analysis shows that althoughQwen2.5 achieves strong mathematical reasoning performance, its pretraining onlarge-scale web corpora makes it vulnerable to data contamination in popularbenchmarks. As a result, results derived from these benchmarks may beunreliable. To address this, we introduce a generator that produces fullysynthetic arithmetic problems of arbitrary length and difficulty, yielding aclean dataset we call RandomCalculation. Using these leakage-free datasets, weshow that only accurate reward signals consistently improve performance, whilenoisy or incorrect signals do not. We advocate for evaluating RL methods onuncontaminated benchmarks and across diverse model families to ensuretrustworthy conclusions.