HyperAI초신경

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

He, Zhongmou ; Choi, Yee Man ; Zhang, Kexun ; Ji, Jiabao ; Zhou, Junting ; Xu, Dejia ; Bercovich, Ivan ; Zhang, Aidan ; Li, Lei
발행일: 6/2/2025
HardTests: Synthesizing High-Quality Test Cases for LLM Coding
초록

Verifiers play a crucial role in large language model (LLM) reasoning, neededby post-training techniques such as reinforcement learning. However, reliableverifiers are hard to get for difficult coding problems, because awell-disguised wrong solution may only be detected by carefully human-writtenedge cases that are difficult to synthesize. To address this issue, we proposeHARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With thispipeline, we curate a comprehensive competitive programming dataset HARDTESTSwith 47k problems and synthetic high-quality tests. Compared with existingtests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage pointshigher and recall that is 17.5 percentage points higher when evaluatingLLM-generated code. For harder problems, the improvement in precision can be aslarge as 40 points. HARDTESTS also proves to be more effective for modeltraining, measured by downstream code generation performance. We willopen-source our dataset and synthesis pipeline athttps://leililab.github.io/HardTests/.