ScarfBench Evaluates AI Agents on Enterprise Java Framework Migration.
IBM Research has released ScarfBench, a comprehensive open benchmark designed to evaluate artificial intelligence agents on enterprise-grade Java framework migration. As organizations increasingly modernize legacy systems to improve cloud readiness and developer productivity, AI-assisted coding tools have gained significant attention. ScarfBench addresses a critical evaluation gap by measuring real-world migration success rather than superficial code generation, providing a standardized method to assess whether AI can reliably handle complex enterprise modernization. The benchmark focuses on cross-framework transitions across three major Enterprise Java ecosystems: Spring, Jakarta EE, and Quarkus. Unlike traditional benchmarks that compare generated snippets against reference implementations, ScarfBench requires applications to successfully build, deploy, and pass behavioral validation. The dataset comprises thirty-four enterprise applications, one hundred two framework implementations, and two hundred four migration tasks, encompassing approximately one hundred fifty-one thousand lines of code and more than two thousand expert-written verification tests. Comprehensive evaluations of state-of-the-art coding agents reveal significant limitations in current AI capabilities. While compile success rates remain relatively high, deployment and behavioral success rates drop sharply, with top-tier agents achieving less than ten percent overall behavioral accuracy. This progression underscores a fundamental mismatch between traditional code-generation metrics and actual modernization requirements. Agents frequently generate syntactically correct code that fails to integrate with build systems, runtime dependencies, or configuration frameworks. Further analysis highlights systematic gaps in AI reasoning during complex migrations. Self-assessment mechanisms prove unreliable, with agents incorrectly reporting successful builds for the majority of whole-application migrations. The migration process itself operates iteratively rather than linearly, with AI models repeatedly revisiting configuration artifacts and dependency layers to resolve framework-specific conflicts. Configuration management emerges as the primary bottleneck, consuming the majority of agent effort and frequently triggering cascading errors across services, databases, and web components. Additionally, environmental and tooling misconfigurations consistently delay validation, even when source-code translation is largely complete. The findings demonstrate that enterprise modernization extends far beyond syntax translation. Successfully migrating frameworks requires navigating intricate webs of infrastructure, dependency injection, and runtime behavior. Current AI models can automate substantial portions of the code transformation but lack the architectural reasoning and independent validation necessary for production-ready deployment. ScarfBench establishes a rigorous methodology to measure this gap, shifting the industry focus from compilable output to functional, deployable systems. The benchmark is publicly available to accelerate technical progress. Researchers and practitioners can access the full dataset, evaluation leaderboard, and implementation guides through designated open-source repositories and hosting platforms, including GitHub, Hugging Face, and dedicated project portals. By providing a realistic stress test for AI-assisted modernization, ScarfBench aims to guide the development of more reliable automation tools and establish clear performance baselines for future enterprise software engineering workflows.
