HyperAIHyperAI

Command Palette

Search for a command to run...

RealChart2Code: تطوير توليد الكود من المخططات باستخدام بيانات حقيقية وتقييم متعدد المهام

الملخص

أظهرت نماذج الرؤية واللغة (VLMs) قدرات مبهرة في توليد الأكواد عبر مجالات متنوعة. غير أن قدرتها على إعادة إنشاء التصورات البيانية المعقدة متعددة اللوحات (multi-panel visualizations) المستندة إلى بيانات من العالم الحقيقي ظلت غير مُقيَّمة إلى حد كبير. ولسد هذه الفجوة، نقدم RealChart2Code، وهو معيار تقييم (benchmark) جديد واسع النطاق يضم أكثر من 2,800 حالة تستند إلى مجموعات بيانات أصيلة، وتتضمن مهامًا ذات نية تحليلية واضحة. والأهم من ذلك، أنه المعيار الأول الذي يُقيِّم بشكل منهجي توليد المخططات البيانية من بيانات خام واسعة النطاق، ويقيم عملية تحسين الأكواد التكرارية في سياق محادثة متعددة الجولات (multi-turn conversational setting). revelar تقييمنا الشامل لـ 14 من نماذج VLMs الرائدة على معيار RealChart2Code انخفاضًا ملحوظًا في الأداء مقارنةً بالمعايير الأبسط، مما يسلط الضوء على صعوباتها في التعامل مع هياكل الرسوم البيانية المعقدة والبيانات الأصيلة. وكشفت تحليلاتنا عن فجوة أداء كبيرة بين النماذج الاحتكارية والنماذج مفتوحة الأوزان (open-weight models)، وأكدت أن حتى أحدث نماذج VLMs غالبًا ما تفشل في إعادة إنشاء المخططات البيانية المعقدة متعددة اللوحات بدقة. وتوفر هذه النتائج رؤى قيمة حول القيود الحالية لنماذج VLMs، وتوجه اتجاهات البحث المستقبلية. وقد قمنا بنشر المعيار والأكواد على الرابط التالي: https://github.com/Speakn0w/RealChart2Code.

One-sentence Summary

Researchers from USTC, THU, CUHK, UCAS, CASIA, and other institutions introduce RealChart2Code, a large-scale benchmark evaluating Vision-Language Models on generating code for complex, multi-panel charts from authentic data. This work uniquely assesses iterative refinement in conversational settings, revealing significant performance gaps between proprietary and open-weight models.

Key Contributions

  • The paper introduces RealChart2Code, a large-scale benchmark containing over 2,800 instances grounded in authentic datasets to systematically evaluate chart generation from raw data and iterative code refinement in multi-turn conversations.
  • A comprehensive evaluation of 14 leading Vision-Language Models on this benchmark reveals significant performance degradation compared to simpler benchmarks, highlighting specific struggles with complex plot structures and authentic data.
  • The analysis uncovers a substantial performance gap between proprietary and open-weight models, confirming that even state-of-the-art systems often fail to accurately replicate intricate, multi-panel charts.

Introduction

Vision-Language Models (VLMs) are increasingly used to generate code for data visualizations, a capability that allows users to recover and edit logic from static images. However, existing benchmarks rely on synthetic data or simple single-panel charts, failing to assess how well models handle complex multi-panel layouts derived from authentic, large-scale datasets. To address this gap, the authors introduce RealChart2Code, a large-scale benchmark featuring over 2,800 instances grounded in real-world data that evaluates both initial code generation and iterative refinement in a conversational setting. Their evaluation of 14 leading VLMs reveals that while models excel at simple tasks, they struggle significantly with intricate structures and real data, exposing a substantial performance gap between proprietary and open-weight systems.

Dataset

RealChart2Code Dataset Overview

The authors introduce RealChart2Code, a benchmark designed to evaluate Vision Language Models on complex, real-world chart-to-code generation tasks. The dataset moves beyond simple synthetic plots to challenge models with intricate multi-panel layouts and high information density derived from authentic data sources.

  • Dataset Composition and Sources

    • The foundation consists of open-source datasets collected from Kaggle, strictly adhering to scientific research licensing.
    • The curation process began with over 8,000 candidate datasets containing more than 100,000 files and 30 billion data rows.
    • A two-stage filtering pipeline reduced this pool to 1,036 high-quality datasets, resulting in a final collection of 3,271 raw data files with approximately 860 million rows.
    • The data spans eight high-level domains including Finance, Health, Research, and Technology, covering 35 fine-grained sub-topics.
  • Key Details for Each Subset

    • Chart Replication (1,016 instances): The model receives only the chart image and must generate code to replicate it without access to the underlying data.
    • Chart Reproduction (1,016 instances): The model is provided with both the chart image and the corresponding raw CSV data files to generate the code.
    • Chart Refinement (864 instances): This subset involves a multi-turn dialogue where the model must debug and modify code to fix errors in a "flawed" chart based on user feedback.
    • The benchmark includes 50 distinct chart types and 7 high-level visualization intents, ensuring a mix of common plots (e.g., bar charts) and specialized techniques (e.g., Sankey diagrams).
  • Data Usage and Processing

    • The authors constructed 1,016 unique visualization scenarios from the curated datasets, which serve as the basis for the Replication and Reproduction tasks.
    • Ground-truth code was manually implemented by a team of five expert Python developers using Matplotlib, pandas, and NumPy to ensure high-quality, idiomatic, and executable solutions.
    • For the Refinement subset, the authors manually injected diverse errors into the ground-truth code, including visual styling issues, data mapping mistakes, and incorrect chart types.
    • The dataset is used to evaluate models on their ability to perceive visual details, interpret data, and perform iterative code editing.
  • Cropping, Metadata, and Quality Control

    • No specific image cropping strategy is mentioned; the focus is on preserving the full complexity of multi-panel layouts and composite charts.
    • Metadata construction involves strict adherence to data schemas, ensuring that column names, data types, and file paths in the prompts match the provided CSV files exactly.
    • A rigorous multi-stage quality control protocol was applied, including automated execution checks in a sandbox environment and visual fidelity reviews by independent experts.
    • For refinement tasks, a triple-verification strategy ensured that injected errors were clearly visible in the rendered images and that the correction instructions were logically solvable.

Method

The authors define the chart-to-code task as a conditional code generation problem. Formally, given a source chart image VVV and an accompanying prompt PPP, a Large Language Model (LLM), denoted by F()\mathcal{F}(\cdot)F(), must generate an executable code snippet CCC. This code must render a visualization that accurately reproduces the visual and structural elements of VVV while adhering to any requirements in PPP. The task is formulated as C=F(V,P)C = \mathcal{F}(V, P)C=F(V,P).

The framework evaluates models on three distinct variants of this core task, as illustrated in the figure below.

The first variant, Chart Replication, represents the fundamental chart-to-code task where the model must reverse-engineer the visualization from the image alone. This setup measures the core visual-to-code translation ability without external data support. The second variant, Chart Reproduction, provides the model with the chart image, raw data, and metadata. This assesses the capability to generate the correct plot using large-scale, real-world data sources. For this task, the Data Pattern Consistency metric is replaced with Data Alignment, which performs code-level verification to ensure computational correctness rather than visual similarity.

The third variant, Chart Refinement, requires the model to correct a chart with predefined errors through a multi-turn dialogue. This assesses the ability to perform iterative debugging based on user instructions. The process involves analyzing the chart image, interpreting specific refinement instructions, and generating corrected code to produce a refined chart. The model must identify the chart structure, understand the current state including errors, and apply corrections precisely as instructed while maintaining all other visual properties.

Experiment

  • Evaluation of 14 LLMs on the RealChart2Code benchmark validates that while proprietary models like Claude-4.5-Opus lead in performance, a significant capability gap exists compared to open-source models on complex, real-world visualization tasks.
  • Cross-benchmark analysis demonstrates that high scores on simpler existing benchmarks do not guarantee success on RealChart2Code, revealing a "Complexity Gap" where model performance drops drastically when facing authentic data-driven scenarios.
  • Reliability testing confirms that the proposed multi-agent judging framework achieves high consistency and strong alignment with human expert evaluations, ensuring robust and discriminatory assessment of visual quality.
  • Error analysis identifies distinct failure patterns where open-weight models frequently suffer from syntax hallucinations and spatial reasoning deficits, whereas proprietary models primarily struggle with data mapping accuracy and maintaining global consistency during iterative refinement.
  • Case studies highlight systematic weaknesses in handling hierarchical layouts, composite chart structures, and global canvas scaling, indicating that current models lack the advanced spatial planning and semantic grouping required for professional-grade visualization generation.

بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp