4 months ago

Abstract

Process Reward Models (PRMs) have recently emerged as a powerful frameworkfor enhancing the reasoning capabilities of large reasoning models (LRMs),particularly in the context of test-time scaling (TTS). However, theirpotential for supervising LRMs on tabular reasoning domains remainsunderexplored. Through detailed empirical analyses, we identify that existingPRMs, though widely adopted for supervising text-only reasoning steps, strugglewith table-specific operations such as sub-table retrieval and schemainteraction, leading to critical performance bottlenecks. To address thislimitation, we propose TaTToo, a novel table-grounded PRM framework that (i)reasons explicitly over tabular reasoning steps and (ii) integrates tool-basedverification to provide precise reward supervision. Concretely, we first designa scalable data curation pipeline that constructs over 60k high-qualitystep-level annotations by integrating table verification rationales withtool-based executions. Building on the collected data, we train TaTToo with adual-stage paradigm: cold-start supervised fine-tuning to capture tool-usereasoning patterns, followed by reinforcement learning with tool-groundedreward shaping to align our model with table-based verification. We provide acomprehensive evaluation of the policy improvement induced by our newlydesigned PRM. Across 5 challenging tabular reasoning benchmarks coveringnumerical reasoning, fact-checking, and data analysis, TaTToo improvesdownstream policy LRMs by 30.9% at inference, surpasses strong PRM baselinessuch as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates stronggeneralizability across diverse TTS strategies.

Source PDF