AI Generates Verifiable Code Automatically, Reducing Dependence on Human Prior Knowledge
Can AI-generated code be automatically verified for reliability? Scientists from the Shanghai Artificial Intelligence Laboratory’s Veri-Code team, led by young scientist Jie Fu, have proposed a novel approach to reduce AI systems’ dependence on human prior knowledge in software verification—a critical step toward ensuring the safety of AI-generated code in high-stakes domains like nuclear power control systems, autonomous vehicles, and operating systems. Current AI programming tools face a fundamental challenge: they rely on human-written test cases to validate code, a method that becomes impractical as complexity grows. Testing coverage expands exponentially with code complexity—often requiring dozens of test cases even for basic programming problems—and still fails to catch deep logical flaws due to incomplete path coverage. This limitation is especially dangerous when AI begins writing mission-critical software. To address this, Fu’s team leverages formal verification languages like Dafny, which allow code to be written alongside mathematically rigorous specifications. These specifications can be proven true or false using logical engines, offering a level of certainty akin to mathematical theorems—regardless of the AI’s intelligence level. As Turing Award winner Yoshua Bengio noted, “Even with superintelligence, a theorem remains true.” However, traditional formal verification methods depend heavily on human experts to create specifications and provide supervision signals, which is unsustainable at scale. For example, writing formal specifications for 50 beginner-level programs took two computer scientists 220 hours, and verifying the SeL4 operating system kernel required 20 person-years of effort. The team’s solution, detailed in their paper “Re:Form—Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny,” published on arXiv [1], uses reinforcement learning (RL) to train large language models (LLMs) to generate code and specifications directly in Dafny, minimizing reliance on human input. The approach eliminates the need for manually annotated chain-of-thought (CoT) reasoning and subjective human judgments, instead using the objective, mathematically sound feedback from Dafny’s verification engine as a reward signal. This aligns with Rich Sutton’s “Bitter Lesson” principle: long-term AI progress comes not from embedding human reasoning into models, but from scaling up computation and learning-based methods. By relying on formal verification as a reliable feedback mechanism, the team avoids the pitfalls of human bias and cognitive limits—especially crucial as AI surpasses human performance. The team’s method is grounded in the idea that software, being inherently formal, is an ideal domain for applying mathematical reasoning. They argue that if the universe can be described mathematically (as proposed by MIT’s Max Tegmark), then software—being a structured, logical system—can be precisely modeled and verified using formal languages. To make this scalable, the team developed a pipeline to convert existing Python code into Dafny code with verified specifications. They also created DafnyComp, a benchmark to test model generalization across modular code components. After fine-tuning Qwen2.5-coder to achieve 80% syntax correctness, they applied reinforcement learning using a novel reward metric called Specification Superiority Rate (SSR). SSR evaluates whether a model’s generated specification is logically stronger than the ground truth—such as imposing fewer input constraints or providing tighter output guarantees—thus encouraging higher-quality, more precise code. Results show that Re:Form models (0.5B to 14B parameters) outperform both supervised fine-tuned models and closed-source models like GPT-4o and Claude 4 in both verification pass rates and SSR. The 14B model surpassed existing models on DafnyBench and DafnyComp, demonstrating strong performance in both domain-specific and compositional tasks. The team also found that, surprisingly, CoT reasoning is not necessary for the initial task of specification generation. While CoT may still be useful for complex problems, the model can generate correct specifications without it—suggesting that formal verification can guide learning without mimicking human thought. Looking ahead, Fu’s team aims to explore whether models can learn to generate CoT themselves, and to scale verification to much longer codebases—eventually reaching tens of thousands of lines. They envision integrating this technology into tools like a “CodeRabbit” plugin that automatically verifies AI-generated code against user intent, ensuring correctness at the mathematical level. The team has open-sourced their code, data, and models [2,3], with first author Yan Chuanhao, a Tsinghua University YAO班 (Yao Class) PhD candidate, and Fu as corresponding author. Their work represents a foundational step toward building AI systems that are not just smart, but inherently safe and verifiable.
