Search for a command to run...
Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards