Command Palette
Search for a command to run...
QuanBench+: LLM 기반 양자 코드 생성을 위한 통합 멀티 프레임워크 Benchmark
QuanBench+: LLM 기반 양자 코드 생성을 위한 통합 멀티 프레임워크 Benchmark
Ali Slim Haydar Hamieh Jawad Kotaich Yehya Ghosn Mahdi Chehimi Ammar Mohanna Hasan Abed Al Kader Hammoud Bernard Ghanem
초록
제시해주신 영어 텍스트를 전문적인 기술/학술적 문체로 번역한 결과입니다.Large Language Models (LLMs)는 코드 생성 분야에서 점점 더 많이 활용되고 있으나, 양자 코드 생성(quantum code generation) 연구는 여전히 대부분 단일 프레임워크 내에서만 평가되고 있어 양자 추론 능력과 특정 프레임워크에 대한 숙련도를 분리하여 파악하기 어렵습니다. 본 연구에서는 Qiskit, PennyLane, Cirq를 아우르는 통합 benchmark인 QuanBench+를 제안합니다. QuanBench+는 양자 알고리즘, 게이트 분해(gate decomposition), 상태 준비(state preparation)를 포함하는 42개의 정렬된(aligned) 태스크로 구성되어 있습니다. 본 연구에서는 실행 가능한 기능 테스트(executable functional tests)를 통해 모델을 평가하며, Pass@1 및 Pass@5 지표를 보고하고, 확률적 출력(probabilistic outputs)에 대해서는 KL-divergence 기반의 수락(acceptance) 방식을 사용합니다. 또한, 런타임 에러나 오답 발생 시 모델이 코드를 수정하는 피드백 기반 수정(feedback-based repair) 이후의 Pass@1 성능을 추가적으로 연구합니다. 프레임워크 전반에 걸쳐 가장 높은 one-shot 점수는 Qiskit에서 59.5%, Cirq에서 54.8%, PennyLane에서 42.9%를 기록했습니다. 피드백 기반 수정을 적용했을 때 최적의 점수는 각각 83.3%, 76.2%, 66.7%로 상승했습니다. 이러한 결과는 분명한 진전을 보여주지만, 동시에 신뢰할 수 있는 다중 프레임워크 양자 코드 생성이 여전히 해결되지 않은 과제이며, 여전히 특정 프레임워크 지식에 크게 의존하고 있음을 시사합니다.
One-sentence Summary
To evaluate LLM-based quantum code generation, the authors introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq that utilizes 42 aligned tasks and executable functional tests to demonstrate that while feedback-based repair improves Pass@1 and Pass@5 scores, reliable multi-framework quantum reasoning remains an unsolved challenge.
Key Contributions
- This work introduces QuanBench+, a unified benchmark that evaluates quantum code generation across three distinct frameworks: Qiskit, PennyLane, and Cirq. The benchmark consists of 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation to differentiate between portable quantum reasoning and framework-specific knowledge.
- The researchers implement an executable functional evaluation method that defines correctness based on task success through measurement statistics. This approach utilizes Pass@k metrics and KL-divergence-based acceptance for probabilistic outputs to ensure that functionally equivalent but syntactically different circuits are correctly identified as valid.
- The study provides a comprehensive analysis of model performance through both one-shot generation and feedback-based repair. Results demonstrate that while one-shot scores reach up to 59.5% in Qiskit, the application of feedback-based repair significantly improves performance, with the highest scores rising to 83.3% in Qiskit, 76.2% in Cirq, and 66.7% in PennyLane.
Introduction
As quantum computing moves toward practical software applications, Large Language Models (LLMs) are increasingly used to automate code generation across various ecosystems like Qiskit, PennyLane, and Cirq. Current benchmarks typically focus on a single framework, which makes it difficult to determine if a model's failure stems from poor quantum reasoning or a simple lack of familiarity with a specific API. The authors introduce QuanBench+, a unified multi-framework benchmark that holds task intent constant across 42 aligned tasks to isolate these two failure modes. By utilizing executable functional tests and KL-divergence based acceptance for probabilistic outputs, the authors provide a standardized way to evaluate whether models possess portable quantum reasoning or merely framework-specific knowledge.
Dataset

The authors utilize QuanBench+, a dataset derived from the original QuanBench task set. The dataset is structured as follows:
- Composition and Categories: The benchmark consists of 42 tasks organized into three distinct categories: Quantum Algorithms, Gate Decomposition, and State Preparation.
- Sources and Adaptation: The authors adapted the original tasks to three specific quantum computing frameworks: Qiskit, PennyLane, and Cirq. This adaptation involved modifying prompts to align with framework-specific APIs and library conventions.
- Filtering and Refinement: To ensure reliable cross-framework grading, the authors removed two tasks from the original benchmark that did not meet the necessary criteria for consistent evaluation.
- Prompt Engineering and Processing:
- All prompts were modified to ensure the correct libraries are imported for each respective framework.
- A strict constraint was added to the beginning of each prompt requiring models to return code only. This removes accompanying explanations to improve execution efficiency and grading consistency.
Method
The authors leverage a modular pipeline for evaluating quantum code generation, structured around a framework that integrates multiple quantum computing platforms. The process begins with the selection of a framework to test—specifically Qiskit, Pennylane, or Cirq—each of which is accessed via a unified API interface managed by OpenRouter. This setup allows for consistent interaction with different quantum development environments while abstracting low-level implementation differences.

As shown in the figure below, the selected framework is used to generate quantum code, which is then sent as API requests to the backend system. The responses are parsed to extract the generated code, which is subsequently executed in an isolated sandbox environment to ensure safety and reproducibility. The output is validated against canonical solutions to assess correctness. For deterministic tasks, validation involves checking whether the generated program satisfies a fixed correctness criterion under a predefined harness. For probabilistic tasks, correctness is determined by the agreement of measurement outcome distributions with a reference distribution.
To handle probabilistic tasks, the authors calibrate a global acceptance threshold based on the inherent variability of canonical circuit executions. For each task, the reference distribution is computed as the normalized mean of 1000 repeated executions of the canonical circuit. The within-canonical variability is quantified using the Kullback-Leibler (KL) divergence between the empirical distributions and the reference distribution. A global threshold is then derived from the 99.7th percentile of the pooled KL divergence values across all tasks, resulting in a threshold of τ=0.05, which is used consistently across the evaluation. This calibration ensures that the acceptance criterion accounts for natural shot noise and variability in quantum measurements.
Experiment
The QuanBench+ benchmark evaluates a diverse set of frontier and open-weight large language models on their ability to generate functional quantum code across three frameworks: Qiskit, Cirq, and PennyLane. The evaluation utilizes Pass@k metrics and compares one-shot generation against settings involving prompt prefilling and iterative feedback-based repair. Results reveal a significant asymmetry in framework difficulty, with Qiskit being the easiest and PennyLane the most challenging, suggesting that model performance is heavily tied to framework-specific API familiarity. While feedback loops effectively recover many surface-level implementation and interface errors, they do not fully close the performance gap or resolve the deeper semantic and reasoning mistakes that persist across all frameworks.
The bar chart shows Pass@1 scores across different models and frameworks, with Qiskit achieving the highest performance and PennyLane the lowest. Feedback-based repair significantly improves scores across all frameworks, but the relative ranking and performance gap between frameworks remain consistent. Qiskit consistently achieves the highest Pass@1 scores across models PennyLane consistently has the lowest Pass@1 scores across models Feedback-based repair improves performance across all frameworks and reduces the gap between models

The authors compare model performance across three quantum computing frameworks, showing consistent differences in difficulty. Results indicate that Qiskit is the easiest framework, PennyLane the hardest, and that performance varies significantly by model and framework. Qiskit consistently yields the highest performance across models, while PennyLane shows the lowest scores. Model rankings shift across frameworks, with no single model dominating all environments. Performance differences between frameworks persist even after feedback repair, indicating framework-specific challenges.

The heatmap visualizes the one-shot correctness of various models across tasks, showing that Qiskit generally yields higher accuracy than PennyLane and Cirq. Models exhibit varying performance across tasks, with some achieving broad success and others showing more scattered results. Qiskit consistently shows higher accuracy compared to PennyLane and Cirq across models. Model performance varies significantly across tasks, with some models achieving broad success and others showing more scattered results. The heatmap reveals differences in model capabilities, with stronger models demonstrating more consistent success across tasks.

The bar chart compares the Pass@1 accuracy of various models with the additional correctness achieved by Pass@5. Results show that most models gain significantly from generating multiple samples, with GPT 5.1 achieving the highest overall accuracy and Qwen 2.5 7B showing the smallest improvement from Pass@5. Pass@5 substantially improves accuracy for most models compared to Pass@1 GPT 5.1 achieves the highest Pass@1 and Pass@5 scores Qwen 2.5 7B shows the smallest gain from Pass@5

The authors compare Pass@1 performance between no-prefill and prefill settings across multiple models. Results show that prefill consistently improves performance, with larger gains observed for models and frameworks where boilerplate code is more error-prone. The strongest models benefit less from prefill, suggesting it primarily reduces setup-related errors rather than core reasoning challenges. Prefill improves Pass@1 across all models, with larger gains for weaker models and frameworks with complex setup The largest performance gains occur in frameworks where boilerplate is easy to miss Stronger models benefit less from prefill, indicating it mainly addresses surface-level coding errors rather than reasoning

These experiments evaluate model performance across different quantum computing frameworks, repair strategies, and prompting settings to identify key drivers of coding accuracy. The results demonstrate that Qiskit is consistently the most accessible framework while PennyLane presents the greatest difficulty, with performance gaps persisting regardless of feedback-based repairs. Additionally, while multiple sampling and prefill techniques significantly enhance accuracy by mitigating boilerplate errors and setup challenges, the most capable models show less sensitivity to these improvements.