RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

Rule-based reasoning has been acknowledged as one of the fundamental problemsin reasoning, while deviations in rule formats, types, and complexity inreal-world applications pose severe challenges. Recent studies have shown thatlarge reasoning models (LRMs) have remarkable reasoning capabilities, and theirperformance is substantially enhanced by reinforcement learning (RL). However,it remains an open question whether small reasoning models (SRMs) can learnrule-based reasoning effectively with robust generalization across diversetasks and domains. To address this, we introduce Reinforced Rule-basedReasoning, a.k.a. RuleReasoner, a simple yet effective method to conductrule-based reasoning via a wide collection of curated tasks and a noveldomain-aware dynamic sampling approach. Specifically, RuleReasoner resampleseach training batch by updating the sampling weights of different domains basedon historical rewards. This facilitates domain augmentation and flexible onlinelearning schedules for RL, obviating the need for pre-hoc human-engineeredmix-training recipes used in existing methods. Empirical evaluations onin-distribution (ID) and out-of-distribution (OOD) benchmarks reveal thatRuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1%average points on eight ID tasks and $\Delta$10.4% average points on three OODtasks over OpenAI-o1). Notably, our approach also exhibits higher computationalefficiency compared to prior dynamic sampling methods for RL.