BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs

Lu, Guilong ; Guo, Xuntao ; Zhang, Rongjunchen ; Zhu, Wenqiao ; Liu, Ji

Date de publication: 5/27/2025

BizFinBench: A Business-Driven Real-World Financial Benchmark for
Evaluating LLMs

Résumé

Large language models excel in general tasks, yet assessing their reliabilityin logic-heavy, precision-critical domains like finance, law, and healthcareremains challenging. To address this, we introduce BizFinBench, the firstbenchmark specifically designed to evaluate LLMs in real-world financialapplications. BizFinBench consists of 6,781 well-annotated queries in Chinese,spanning five dimensions: numerical calculation, reasoning, informationextraction, prediction recognition, and knowledge-based question answering,grouped into nine fine-grained categories. The benchmark includes bothobjective and subjective metrics. We also introduce IteraJudge, a novel LLMevaluation method that reduces bias when LLMs serve as evaluators in objectivemetrics. We benchmark 25 models, including both proprietary and open-sourcesystems. Extensive experiments show that no model dominates across all tasks.Our evaluation reveals distinct capability patterns: (1) In NumericalCalculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, whilesmaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning,proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), withopen-source models trailing by up to 19.49 points; (3) In InformationExtraction, the performance spread is the largest, with DeepSeek-R1 scoring71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition,performance variance is minimal, with top models scoring between 39.16 and50.00. We find that while current LLMs handle routine finance queriescompetently, they struggle with complex scenarios requiring cross-conceptreasoning. BizFinBench offers a rigorous, business-aligned benchmark for futureresearch. The code and dataset are available athttps://github.com/HiThink-Research/BizFinBench.

Voir les détails de l'article