HyperAIHyperAI

Command Palette

Search for a command to run...

ABC-Bench: 실세계 개발에서 에이전트 기반 백엔드 코드 작성 성능 평가

초록

대규모 언어 모델(Large Language Models, LLMs)이 자율 에이전트로 진화하면서 인공지능 기반 코드 작성의 범위는 국지적인 코드 생성을 넘어, 복잡한 레포지토리 수준의 문제 해결 및 실행 기반의 작업으로 확장되었다. 그러나 현재의 평가 벤치마크는 주로 정적 환경에서의 코드 논리만을 평가하고 있어, 실제 엔지니어링 작업에서 요구되는 동적이고 전 과정에 걸친 요구사항, 특히 환경 구성과 서비스 배포에 엄격한 요구가 있는 백엔드 개발의 특성을 간과하고 있다. 이 격차를 보완하기 위해, 현실적인 실행 가능한 워크플로우 내에서 에이전트 기반 백엔드 코드 작성 능력을 명시적으로 평가할 수 있도록 설계된 ABC-Bench를 제안한다. 확장 가능한 자동화된 파이프라인을 활용해 오픈소스 레포지토리에서 8개의 프로그래밍 언어와 19개의 프레임워크를 대상으로 총 224개의 실용적인 과제를 수집하였다. 기존 평가와 달리, ABC-Bench는 에이전트가 레포지토리 탐색부터 컨테이너화된 서비스 생성, 외부 엔드투엔드 API 테스트까지 전체 개발 생애 주기(Development Lifecycle)를 관리하고 완료해야 하는 요구사항을 포함한다. 본 연구의 광범위한 평가 결과, 최첨단 모델들조차 이러한 통합적 과제에서 신뢰할 수 있는 성능을 제공하기 어려운 것으로 나타났으며, 현재 모델의 능력과 실제 백엔드 엔지니어링의 요구사항 사이에 상당한 격차가 존재함을 시사한다. 관련 코드는 https://github.com/OpenMOSS/ABC-Bench 에서 공개되어 있다.

One-sentence Summary

The authors from Fudan University, Shanghai Qiji Zhifeng Co., Ltd., and Shanghai Innovation Institute propose ABC-Bench, a novel benchmark evaluating agentic back-end coding through end-to-end, executable workflows that require full lifecycle management from code exploration to containerized service deployment across 8 languages and 19 frameworks, revealing significant gaps in current LLMs' real-world engineering capabilities despite state-of-the-art performance in static code tasks.

Key Contributions

  • Current benchmarks for AI coding agents focus on static, localized code edits and fail to capture the dynamic, end-to-end requirements of real-world back-end development, which involves environment configuration, service deployment, and integration testing.
  • ABC-Bench introduces a scalable, automated pipeline to generate 224 realistic backend tasks across 8 languages and 19 frameworks, requiring agents to complete the full development lifecycle—from repository exploration to deploying containerized services and passing external API tests.
  • Evaluation shows even top-performing models achieve only a 63.2% pass@1 rate, with environment setup and deployment emerging as critical bottlenecks, revealing a significant gap between current agent capabilities and practical engineering demands.

Introduction

The authors leverage the growing trend of LLMs as autonomous agents in software engineering to address a critical gap in evaluating backend development capabilities. While prior benchmarks focus on isolated code edits within static or sandboxed environments, they fail to capture the full, dynamic workflow of real-world backend engineering—where code changes must be integrated with environment setup, containerization, deployment, and end-to-end API validation. This disconnect limits the reliability of current evaluations, as models often succeed in logic but fail in execution due to misconfigurations or deployment issues. To bridge this gap, the authors introduce ABC-Bench, a scalable, production-like benchmark that requires agents to complete the entire backend development lifecycle—from repository analysis to running containerized services and passing external API tests. Built via an automated pipeline from 2,000 real open-source repositories, ABC-Bench includes 224 tasks across 8 languages and 19 frameworks, ensuring realistic diversity. Evaluation reveals that even top models achieve only a 63.2% pass rate, with environment configuration and deployment emerging as dominant failure points, highlighting a substantial capability gap between current agents and practical engineering demands.

Dataset

  • ABC-Bench consists of 224 tasks derived from 127 public GitHub repositories licensed under the MIT license, ensuring compliance with open-source terms and avoiding proprietary or unclear-licensed code.
  • The dataset spans 8 programming languages and 19 backend frameworks, with tasks covering real-world domains such as data analytics, search systems, commerce platforms, payment gateways, and developer tooling.
  • Of the 224 tasks, 132 focus on logic implementation within a pre-provisioned runtime, while 92 require autonomous environment configuration and containerized service startup, testing end-to-end operational capability.
  • Tasks were initially generated as 600 candidates and filtered to ensure balanced representation across languages, frameworks, and task types, resulting in a high-quality, diverse benchmark.
  • During construction, automated pipelines process repository snapshots, with sensitive credentials (e.g., API keys, tokens, private keys) detected and redacted to protect privacy. Personal or demographic data, logs, and offensive content are excluded.
  • The dataset is constructed without human annotation or crowdwork; only internal manual verification by the research team was performed on task titles and descriptions for quality assurance.
  • The authors use the dataset in a training and evaluation setup where tasks are split into training and test sets, with mixture ratios designed to reflect the distribution of languages, frameworks, and task types.
  • For tasks requiring containerization, the model must generate a Dockerfile in the project root directory that successfully builds and runs the application, with specific requirements including dependency installation and correct file structure.
  • Metadata is constructed automatically from repository and task-level information, with no external human involvement, and all data processing is designed to preserve technical fidelity while ensuring privacy and ethical compliance.

Method

The authors leverage the ABC-Pipeline, an automated workflow designed to generate realistic backend development tasks at scale by transforming open-source repositories into full-lifecycle benchmark tasks. The pipeline operates in three distinct phases: Repository Exploration, Environment Setup & Verification, and Task Instantiation.

In the first phase, Repository Exploration, the pipeline begins by filtering a pool of 2,000 open-source, MIT-licensed repositories to identify high-quality backend candidates. An autonomous agent then explores each repository to identify functional API groups, avoiding reliance on potentially incomplete or outdated existing tests. Instead, the agent proactively generates dedicated verification suites that assess both service connectivity and functional logic, which are later used to validate the model's solution. This phase also involves generating API test scripts, such as test_connect.py and test_func.py, to ensure the correctness of the service endpoints.

As shown in the figure below, the second phase, Environment Setup & Verification, focuses on synthesizing a deployable runtime environment. The agent analyzes the repository structure to resolve dependencies and generates the necessary container configuration files, including a Dockerfile. It then attempts to build the runtime image and launch the service within an isolated container. This phase ensures that the service can start up and listen on the expected ports, establishing a functional infrastructure for subsequent validation. The pipeline verifies the environment by executing a series of end-to-end API tests, such as health checks, to confirm that the service is operational and responsive.

In the final phase, Task Instantiation, the pipeline constructs the benchmark task using a masking-based strategy. For a selected API group, the agent formulates a natural language task instruction and synthesizes a solution patch representing the correct implementation. The pipeline then applies a reverse operation, selectively masking the implementation logic of the target endpoint to simulate a pre-implementation state. The resulting ABC-Task package encapsulates the masked repository, the task instructions, the environment setup files, and the verification suite. For tasks designated as environment configuration challenges, the pipeline removes all synthesized environment setup files from the final package and supplements the task instructions with explicit requirements for the model to autonomously configure the runtime environment. This ensures that the benchmark evaluates both code implementation and environment setup capabilities.

Experiment

  • ABC-Bench evaluates 224 full-lifecycle backend tasks, validating autonomous repository exploration, environment configuration, deployment, and API-based verification; ABC-Pipeline enables scalable task extraction from GitHub, lowering barriers to realistic dataset construction.
  • On ABC-Bench, Claude Sonnet 4.5 achieves 63.2% pass@1 across 224 tasks, while smaller models like Qwen3-8B score below 10%, highlighting the challenge of end-to-end software engineering.
  • Environment configuration is the primary bottleneck: models like GPT-5 and DeepSeek-V3.2 achieve S₁ (build success) below 50% despite strong S₂ (functional execution) scores above 80%, indicating a critical gap in setup and dependency management.
  • Performance varies significantly by language stack: Rust tasks are particularly challenging, with most models failing (e.g., DeepSeek-V3.2 at 0.0%), while only top proprietary models (Claude Sonnet 4.5, GPT-5) exceed 30% success.
  • A strong positive correlation (r = 0.87) exists between interaction depth and success, with top models averaging over 60 turns versus under 10 for weaker models, underscoring the need for sustained, iterative reasoning.
  • Agent framework choice significantly impacts performance: OpenHands enables peak performance (~50%) for DeepSeek-V3.2 and GPT-5, while mini-SWE-agent reduces GPT-5’s success to below 20%.
  • Agentic post-training via supervised fine-tuning on Nex-N1 data boosts pass@1 from 8.3% to 13.9% (Qwen3-8B) and from 8.9% to 33.8% (Qwen3-32B), demonstrating the value of high-quality agentic training data.
  • Task category analysis reveals domain-specific challenges: DevTools tasks are consistently difficult (top model at 47.8%), while Analytics and Specialized tasks are easier (Claude Sonnet 4.5 at 86.7%).
  • Error analysis shows environment-related failures (e.g., Path Missing, Dependency Missing) dominate, especially for smaller models; larger models exhibit more Logic Errors, indicating a shift in failure mode from syntax to high-level reasoning.

The authors use a standardized evaluation framework to assess models on ABC-Bench, measuring performance across programming languages and identifying environment configuration as a primary bottleneck. Results show that while proprietary models like Claude Sonnet 4.5 achieve high overall pass rates, open-source models struggle significantly, particularly in Rust and with smaller parameter scales, highlighting persistent challenges in full-lifecycle software development.

The authors analyze error types across models, revealing that smaller models like Qwen3-8B primarily fail due to basic issues such as syntax and path errors, while larger models like GPT-5 and Qwen3-Coder exhibit more complex logic errors, indicating a shift in failure modes as model scale increases.

Results show a strong positive correlation between the average number of agent interaction turns and the average pass@1 performance across models, with a correlation coefficient of 0.87. The authors use this relationship to highlight that successful full-lifecycle software development requires sustained, iterative interaction, as top-performing models like Claude Sonnet 4.5 exhibit significantly longer execution trajectories compared to weaker models.

The authors introduce ABC-Bench, a benchmark that evaluates models on full-lifecycle backend tasks, including exploration, code generation, environment configuration, deployment, and end-to-end verification. Unlike existing benchmarks, ABC-Bench covers all these stages comprehensively, as demonstrated by the inclusion of all four phases—Exploration, Code, Environment, and Deployment—in its evaluation framework.

The authors evaluate models on ABC-Bench, a benchmark of full-lifecycle backend tasks, using a standardized sandbox environment that requires agents to autonomously configure environments, deploy services, and pass functional API tests. Results show that environment configuration is the primary bottleneck, with models like GPT-5 and DeepSeek-V3.2 achieving high functional success rates but failing at the initial build stage, while smaller models like Qwen3-8B struggle significantly at both stages.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
ABC-Bench: 실세계 개발에서 에이전트 기반 백엔드 코드 작성 성능 평가 | 문서 | HyperAI초신경