The Invisible Leash: Why RLVR May Not Escape Its Origin

Recent advances in large reasoning models highlight Reinforcement Learningwith Verifiable Rewards (RLVR) as a promising method for enhancing AI'scapabilities, particularly in solving complex logical tasks. However, itremains unclear whether RLVR truly expands a model's reasoning boundary ormerely amplifies high-reward outputs that the base model already knows forimproved precision. This study presents a theoretical and empiricalinvestigation that provides fresh insights into the potential limits of RLVR.First, we offer a new theoretical perspective that RLVR is constrained by thebase model's support-unable to sample solutions with zero initialprobability-and operates as a conservative reweighting mechanism that mayrestrict the discovery of entirely original solutions. We also identify anentropy-reward tradeoff: while RLVR reliably enhances precision, it mayprogressively narrow exploration and potentially overlook correct yetunderrepresented solutions. Extensive empirical experiments validate that whileRLVR consistently improves pass@1, the shrinkage of empirical support generallyoutweighs the expansion of empirical support under larger sampling budgets,failing to recover correct answers that were previously accessible to the basemodel. Interestingly, we also observe that while RLVR sometimes increasestoken-level entropy, resulting in greater uncertainty at each generation step,answer-level entropy declines, indicating that these seemingly more uncertainpaths ultimately converge onto a smaller set of distinct answers. Takentogether, these findings reveal potential limits of RLVR in extending reasoninghorizons. Breaking this invisible leash may require future algorithmicinnovations such as explicit exploration mechanisms or hybrid strategies thatseed probability mass into underrepresented solution regions.