Command Palette
Search for a command to run...
RealChart2Code: 실제 데이터와 다중 작업 평가를 통한 차트-코드 생성의 발전
RealChart2Code: 실제 데이터와 다중 작업 평가를 통한 차트-코드 생성의 발전
초록
비전-언어 모델 (VLMs) 은 다양한 분야에서 코드 생성 분야에서 놀라운 능력을 입증해 왔습니다. 그러나 실제 데이터를 기반으로 한 복잡하고 다 패널 (multi-panel) 시각화를 재현하는 능력은 아직 충분히 평가되지 않았습니다. 이러한 격차를 해소하기 위해 본고에서는 2,800 개 이상의 인스턴스를 포함하는 새로운 대규모 벤치마크인 exttt{RealChart2Code}를 소개합니다. 이 벤치마크는 실제 데이터셋에 기반하며 명확한 분석 목적을 가진 태스크를 포함합니다. 특히, 대규모 원시 데이터로부터 차트 생성을 체계적으로 평가하고, 다turn 대화 환경에서 반복적인 코드 정제 (iterative code refinement) 능력을 측정하는 최초의 벤치마크라는 점에서 의의가 있습니다. 14 개의 선도적인 VLM 을 대상으로 한 포괄적인 평가 결과, 기존 단순 벤치마크에 비해 성능이 현저히 저하되는 것이 확인되었으며, 이는 복잡한 플롯 구조와 실제 데이터 처리에 있어 VLM 의 한계를 드러냅니다. 분석 결과, 프로프라이어터리 (proprietary) 모델과 오픈-가중치 (open-weight) 모델 간에 상당한 성능 격차가 존재함이 밝혀졌으며, 최첨단 VLM 조차도 정교한 다 패널 차트를 정확하게 재현하지 못하는 경우가 빈번함도 확인되었습니다. 이러한 발견은 현재 VLM 의 한계를 이해하는 데 중요한 통찰을 제공하며, 향후 연구 방향을 설정하는 데 기여할 것입니다. 본 벤치마크와 관련 코드는 https://github.com/Speakn0w/RealChart2Code 에서 공개됩니다.
One-sentence Summary
Researchers from USTC, THU, CUHK, UCAS, CASIA, and other institutions introduce RealChart2Code, a large-scale benchmark evaluating Vision-Language Models on generating code for complex, multi-panel charts from authentic data. This work uniquely assesses iterative refinement in conversational settings, revealing significant performance gaps between proprietary and open-weight models.
Key Contributions
- The paper introduces RealChart2Code, a large-scale benchmark containing over 2,800 instances grounded in authentic datasets to systematically evaluate chart generation from raw data and iterative code refinement in multi-turn conversations.
- A comprehensive evaluation of 14 leading Vision-Language Models on this benchmark reveals significant performance degradation compared to simpler benchmarks, highlighting specific struggles with complex plot structures and authentic data.
- The analysis uncovers a substantial performance gap between proprietary and open-weight models, confirming that even state-of-the-art systems often fail to accurately replicate intricate, multi-panel charts.
Introduction
Vision-Language Models (VLMs) are increasingly used to generate code for data visualizations, a capability that allows users to recover and edit logic from static images. However, existing benchmarks rely on synthetic data or simple single-panel charts, failing to assess how well models handle complex multi-panel layouts derived from authentic, large-scale datasets. To address this gap, the authors introduce RealChart2Code, a large-scale benchmark featuring over 2,800 instances grounded in real-world data that evaluates both initial code generation and iterative refinement in a conversational setting. Their evaluation of 14 leading VLMs reveals that while models excel at simple tasks, they struggle significantly with intricate structures and real data, exposing a substantial performance gap between proprietary and open-weight systems.
Dataset
RealChart2Code Dataset Overview
The authors introduce RealChart2Code, a benchmark designed to evaluate Vision Language Models on complex, real-world chart-to-code generation tasks. The dataset moves beyond simple synthetic plots to challenge models with intricate multi-panel layouts and high information density derived from authentic data sources.
-
Dataset Composition and Sources
- The foundation consists of open-source datasets collected from Kaggle, strictly adhering to scientific research licensing.
- The curation process began with over 8,000 candidate datasets containing more than 100,000 files and 30 billion data rows.
- A two-stage filtering pipeline reduced this pool to 1,036 high-quality datasets, resulting in a final collection of 3,271 raw data files with approximately 860 million rows.
- The data spans eight high-level domains including Finance, Health, Research, and Technology, covering 35 fine-grained sub-topics.
-
Key Details for Each Subset
- Chart Replication (1,016 instances): The model receives only the chart image and must generate code to replicate it without access to the underlying data.
- Chart Reproduction (1,016 instances): The model is provided with both the chart image and the corresponding raw CSV data files to generate the code.
- Chart Refinement (864 instances): This subset involves a multi-turn dialogue where the model must debug and modify code to fix errors in a "flawed" chart based on user feedback.
- The benchmark includes 50 distinct chart types and 7 high-level visualization intents, ensuring a mix of common plots (e.g., bar charts) and specialized techniques (e.g., Sankey diagrams).
-
Data Usage and Processing
- The authors constructed 1,016 unique visualization scenarios from the curated datasets, which serve as the basis for the Replication and Reproduction tasks.
- Ground-truth code was manually implemented by a team of five expert Python developers using Matplotlib, pandas, and NumPy to ensure high-quality, idiomatic, and executable solutions.
- For the Refinement subset, the authors manually injected diverse errors into the ground-truth code, including visual styling issues, data mapping mistakes, and incorrect chart types.
- The dataset is used to evaluate models on their ability to perceive visual details, interpret data, and perform iterative code editing.
-
Cropping, Metadata, and Quality Control
- No specific image cropping strategy is mentioned; the focus is on preserving the full complexity of multi-panel layouts and composite charts.
- Metadata construction involves strict adherence to data schemas, ensuring that column names, data types, and file paths in the prompts match the provided CSV files exactly.
- A rigorous multi-stage quality control protocol was applied, including automated execution checks in a sandbox environment and visual fidelity reviews by independent experts.
- For refinement tasks, a triple-verification strategy ensured that injected errors were clearly visible in the rendered images and that the correction instructions were logically solvable.
Method
The authors define the chart-to-code task as a conditional code generation problem. Formally, given a source chart image V and an accompanying prompt P, a Large Language Model (LLM), denoted by F(⋅), must generate an executable code snippet C. This code must render a visualization that accurately reproduces the visual and structural elements of V while adhering to any requirements in P. The task is formulated as C=F(V,P).
The framework evaluates models on three distinct variants of this core task, as illustrated in the figure below.
The first variant, Chart Replication, represents the fundamental chart-to-code task where the model must reverse-engineer the visualization from the image alone. This setup measures the core visual-to-code translation ability without external data support. The second variant, Chart Reproduction, provides the model with the chart image, raw data, and metadata. This assesses the capability to generate the correct plot using large-scale, real-world data sources. For this task, the Data Pattern Consistency metric is replaced with Data Alignment, which performs code-level verification to ensure computational correctness rather than visual similarity.
The third variant, Chart Refinement, requires the model to correct a chart with predefined errors through a multi-turn dialogue. This assesses the ability to perform iterative debugging based on user instructions. The process involves analyzing the chart image, interpreting specific refinement instructions, and generating corrected code to produce a refined chart. The model must identify the chart structure, understand the current state including errors, and apply corrections precisely as instructed while maintaining all other visual properties.
Experiment
- Evaluation of 14 LLMs on the RealChart2Code benchmark validates that while proprietary models like Claude-4.5-Opus lead in performance, a significant capability gap exists compared to open-source models on complex, real-world visualization tasks.
- Cross-benchmark analysis demonstrates that high scores on simpler existing benchmarks do not guarantee success on RealChart2Code, revealing a "Complexity Gap" where model performance drops drastically when facing authentic data-driven scenarios.
- Reliability testing confirms that the proposed multi-agent judging framework achieves high consistency and strong alignment with human expert evaluations, ensuring robust and discriminatory assessment of visual quality.
- Error analysis identifies distinct failure patterns where open-weight models frequently suffer from syntax hallucinations and spatial reasoning deficits, whereas proprietary models primarily struggle with data mapping accuracy and maintaining global consistency during iterative refinement.
- Case studies highlight systematic weaknesses in handling hierarchical layouts, composite chart structures, and global canvas scaling, indicating that current models lack the advanced spatial planning and semantic grouping required for professional-grade visualization generation.