18 days ago

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, Kai Chen

View Paper Details View Code

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and
Outcome Reward

Abstract

Answer verification is crucial not only for evaluating large language models(LLMs) by matching their unstructured outputs against standard answers, butalso serves as the reward model to guide LLM optimization. Most evaluationframeworks rely on regularized matching or employ general LLMs for answerverification, which demands extensive, repetitive customization for regex rulesor evaluation prompts. Two fundamental limitations persist in currentmethodologies: 1) the absence of comprehensive benchmarks that systematicallyevaluate verification capabilities across different LLMs; and 2) the nascentstage of verifier development, where existing approaches lack both therobustness to handle complex edge cases and the generalizability acrossdifferent domains. In this work, we develop CompassVerifier, an accurate androbust lightweight verifier model for evaluation and outcome reward. Itdemonstrates multi-domain competency spanning math, knowledge, and diversereasoning tasks, with the capability to process various answer types, includingmulti-subproblems, formulas, and sequence answers, while effectivelyidentifying abnormal/invalid responses. We introduce VerifierBench benchmarkcomprising model outputs collected from multiple data sources, augmentedthrough manual analysis of metaerror patterns to enhance CompassVerifier. Weanticipate that CompassVerifier and VerifierBench will facilitate answerverification, evaluation protocols, and reinforcement learning research. Codeand dataset are available at https://github.com/open-compass/CompassVerifier.