HyperAIHyperAI

Command Palette

Search for a command to run...

AutoFigure: توليد وتحسين الرسوم التوضيحية العلمية القابلة للنشر في المجلات العلمية التوضيحات:

  • "AutoFigure": تم الاحتفاظ بالاسم الإنجليزي كما طلبت.
  • "Generating and Refining": "توليد وتحسين" هي الترجمة الأكاديمية الدقيقة والمباشرة لهذه الأفعال في سياق معالجة الصور والرسومات.
  • "Publication-Ready": "القابلة للنشر" هو المصطلح المعتمد في الأوساط الأكاديمية للإشارة إلى الجاهزية للنشر في دوريات محكمة.
  • "Scientific Illustrations": "الرسوم التوضيحية العلمية" هو المصطلح القياسي في الأدبيات العلمية.

Minjun Zhu Zhen Lin Yixuan Weng Panzhong Lu Qiujie Xie Yifan Wei Sifan Liu Qiyao Sun Yue Zhang

الملخص

تُعد التصورات العلمية عالية الجودة أمراً بالغ الأهمية لنقل المفاهيم العلمية والتقنية المعقدة بفعالية. ومع ذلك، لا يزال إنشاؤها يدوياً يُعد نقطة ضعف معترفاً بها على نطاق واسع في كل من الأوساط الأكاديمية والصناعية. نقدم في هذا العمل "فيغوربينش" (FigureBench)، وهو أول معيار واسع النطاق لتوليد التصورات العلمية من نصوص علمية طويلة. يحتوي المعيار على 3,300 زوجاً من النصوص والشكولات العلمية عالية الجودة، وتغطي مهام متنوعة من تحويل النص إلى صور مستمدة من الأوراق البحثية، والمقالات الاستعراضية، والمدونات، والكتب الدراسية. علاوة على ذلك، نقترح إطار عمل "أوتو فيغور" (AUTOFIGURE)، وهو أول إطار عمل قائم على الوكلاء (Agentic) لتوليد تلقائي لتصورات علمية عالية الجودة استناداً إلى نصوص علمية طويلة. بشكل أكثر تحديداً، قبل عرض النتيجة النهائية، ينخرط "أوتو فيغور" في عمليات تفكير عميقة، وإعادة تركيب، وتحقق دقيق لإنتاج تخطيط يتميز بالسلامة الهيكلية والرقة الجمالية، مما ينتج عنه تصور علمي يحقق كلًا من الاكتمال الهيكلي والجاذبية البصرية. بالاستفادة من البيانات عالية الجودة المقدمة في "فيغوربينش" (FigureBench)، أجرينا تجارب مكثفة لاختبار أداء "أوتو فيغور" (AUTOFIGURE) مقارنةً بأساليب الأساس المختلفة (baseline methods). تُظهر النتائج أن "أوتو فيغور" (AUTOFIGURE) يتفوق باستمرار على جميع أساليب الأساس، منتجاً تصورات علمية جاهزة للنشر. تم إطلاق الكود البرمجي، ومجموعة البيانات، والحيز على منصة Huggingface عبر الرابط: https://github.com/ResearAI/AutoFigure.

One-sentence Summary

The authors present AUTOFIGURE, the first agentic framework generating publication-ready scientific illustrations from long-form scientific text through extensive thinking, recombination, and validation, alongside FigureBench, a large-scale benchmark containing 3,300 high-quality scientific text-figure pairs, where experiments demonstrate that AUTOFIGURE consistently surpasses all baseline methods, with code and resources publicly released.

Key Contributions

  • This work introduces FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form texts, which contains 3,300 high-quality text-figure pairs from diverse sources. The dataset supports various text-to-illustration tasks and provides a foundation for training and evaluating automated scientific visualization tools.
  • The paper presents AUTOFIGURE, an agentic framework that automatically generates high-quality scientific illustrations by engaging in extensive thinking and validation before rendering the final result. This process ensures the output achieves both structural completeness and aesthetic appeal required for academic publishing.
  • Extensive experiments grounded in VLM-as-a-judge paradigms and human expert assessments show that the method consistently surpasses baseline methods in producing publication-ready illustrations. The associated code, dataset, and Hugging Face space are released to enable further research into AI-driven scientific communication.

Introduction

High-quality scientific illustrations are essential for communicating complex concepts, yet their manual creation remains a major bottleneck for both human researchers and emerging AI scientists. Existing automated approaches typically rely on short captions or rearrange existing content, while mainstream text-to-image models struggle to maintain structural fidelity when processing long-form scientific texts. To overcome these challenges, the authors present FigureBench, the first large-scale benchmark designed for generating illustrations from long-form scientific documents. They further develop AutoFigure, an agentic framework that employs a Reasoned Rendering paradigm to separate layout planning from aesthetic rendering, ensuring the final output meets publication standards for both accuracy and visual appeal.

Dataset

  • Dataset Composition and Sources
    • The authors introduce FigureBench, a benchmark constructed from four distinct sources: research papers, surveys, technical blogs, and textbooks.
    • All data strictly adheres to open-source licenses, with papers sourced from the Research-14K dataset and textbooks drawn from platforms like OpenStax.
  • Key Details for Each Subset
    • The Test Set consists of 300 high-quality samples.
    • 200 samples originate from the Research-14K dataset. GPT-5 selected illustrations representing core methodology, and two independent annotators retained only conceptual figures with unanimous approval.
    • An additional 100 samples were manually curated from AI surveys on arXiv, technical blogs, and educational textbooks to ensure domain diversity.
    • The Development Set contains 3,000 samples. The authors fine-tuned a vision-language model on the Test Set to serve as an automated filter for the larger Research-14K corpus.
  • Data Usage and Training Splits
    • The Test Set is reserved strictly for evaluation purposes.
    • The Development Set is designed for training and development, though the proposed AUTOFIGURE pipeline operates as an inference-only system.
    • The authors provide the Development Set to facilitate future exploration of end-to-end or trainable methods by the community.
  • Processing and Metadata Construction
    • The curation process filters out data-driven charts to focus on conceptual illustrations where visual elements are explicitly described in the text.
    • Statistical metadata includes text density, color counts, and structural components, calculated automatically using the InternVL 3.5 model.
    • A human sanity check on 21 random samples verified the reliability of these automated measurements against manual annotations.

Method

The authors introduce AutoFigure, a decoupled generative paradigm designed to produce high-fidelity scientific illustrations by separating the reasoning and rendering processes. This approach addresses the challenge of creating semantically accurate and visually coherent figures through a three-stage pipeline.

The overall framework is illustrated in the architecture diagram below, which outlines the transition from unstructured text to a final publication-ready illustration.

Stage I: Conceptual Grounding and Layout Generation Given a long-form scientific document TTT, the first stage produces a machine-readable symbolic layout S0S_0S0 (e.g., SVG/HTML) and a style descriptor A0A_0A0. A Concept-Extraction agent processes the input text to distill a methodology summary TmethodT_{\text{method}}Tmethod and identifies entities and relations to be visualized as nodes and directed edges. This structure is serialized into a markup-based symbolic layout S0S_0S0, where S0S_0S0 encodes a directed graph G0=(V0,E0)G_0 = (V_0, E_0)G0=(V0,E0).

Stage II: Critique-and-Refine The core innovation of the method lies in a self-refinement loop that simulates a dialogue between an AI "designer" and an AI "critic" to optimize the blueprint for structural coherence and logical consistency. The initial layout (S0,A0)(S_0, A_0)(S0,A0) is evaluated to get an initial score q0q_0q0, setting the current best version (Sbest,Abest)(S_{best}, A_{best})(Sbest,Abest). In each iteration iii, the system generates a superior solution through the following process:

Fbest(i)=Feedback(Φcritic(Sbest,Abest))(Scand(i),Acand(i))=Φgen(Tmethod,Fbest(i)),\begin{array} { r } { F _ { b e s t } ^ { ( i ) } = \mathrm { F e e d b a c k } \big ( \Phi _ { c r i t i c } ( S _ { b e s t } , A _ { b e s t } ) \big ) } \\ { ( S _ { c a n d } ^ { ( i ) } , A _ { c a n d } ^ { ( i ) } ) = \Phi _ { g e n } \big ( T _ { m e t h o d } , F _ { b e s t } ^ { ( i ) } \big ) , } \end{array}Fbest(i)=Feedback(Φcritic(Sbest,Abest))(Scand(i),Acand(i))=Φgen(Tmethod,Fbest(i)),

The critic Φcritic\Phi_{\text{critic}}Φcritic evaluates the best-performing layout for alignment, balance, and overlap avoidance, producing textual feedback Fbest(i)F_{best}^{(i)}Fbest(i). The generator Φgen\Phi_{gen}Φgen then uses this feedback to reinterpret TmethodT_{method}Tmethod and produce a candidate layout (Scand(i),Acand(i))(S_{cand}^{(i)}, A_{cand}^{(i)})(Scand(i),Acand(i)). If the candidate score qcand(i)q_{cand}^{(i)}qcand(i) exceeds the current best, it replaces the best version. This loop continues until convergence or a preset iteration limit, yielding a logically consistent and aesthetically balanced conditioning layout (Sfinal,Afinal)(S_{final}, A_{final})(Sfinal,Afinal).

Stage III: Rendering Strategy The final stage translates the optimized symbolic blueprint into a high-fidelity illustration IfinalI_{final}Ifinal. A transformation function Φprompt\Phi_{prompt}Φprompt converts the layout and style into an exhaustive text-to-image prompt, which is fed into a multimodal generative model to render an image IpolishedI_{polished}Ipolished. To ensure textual accuracy, the system employs an erase-and-correct process. A non-LLM eraser Φerase\Phi_{\text{erase}}Φerase removes text pixels from IpolishedI_{\text{polished}}Ipolished to create a clean background IerasedI_{\text{erased}}Ierased. An OCR engine Φocr\Phi_{\text{ocr}}Φocr extracts preliminary strings, and a multimodal verifier Φverify\Phi_{\text{verify}}Φverify aligns these with ground-truth labels TgtT_{\text{gt}}Tgt parsed from SfinalS_{\text{final}}Sfinal. Finally, the corrected text map TcorrT_{\text{corr}}Tcorr is rendered as vector-text overlays on top of IerasedI_{\text{erased}}Ierased to obtain the final illustration.

The effectiveness of this pipeline is demonstrated by comparing the generated output against other methods. As shown in the figure below, AutoFigure produces a high-fidelity scientific diagram with crisp details and a pleasing color scheme, outperforming alternatives like SVG generation or direct diagram agents.

Experiment

This study evaluates AUTOFIGURE through automated benchmarks on FigureBench, a domain expert study involving paper authors, and controlled ablation experiments to isolate key modules. The framework consistently outperforms baseline methods by effectively balancing scientific accuracy with visual aesthetics, resulting in figures that experts consider suitable for publication. Qualitative case studies highlight superior handling of complex scientific structures compared to existing approaches, while ablation analyses validate the necessity of symbolic layouts and refinement loops for achieving high quality outputs.

The study evaluates the system's ability to adapt to different visual styles while maintaining overall quality. Results indicate that while overall performance remains stable across styles, specific dimensions like visual design appeal and communication effectiveness vary depending on the chosen aesthetic prompt. The 'Comic' style prompt achieves the highest visual design ratings but results in lower scores for communication clarity and flow. The 'Minimalist' style prompt outperforms other styles in communication effectiveness and content sophistication, despite lower visual design scores. Overall quality scores remain consistent across all tested style variations, demonstrating the system's robustness to aesthetic changes.

The authors evaluate the AUTOFIGURE framework using different open-source reasoning backbones, specifically Qwen3-VL-235B, GLM-4.5V, and ERNIE-4.5-VL. The results indicate that GLM-4.5V achieves the highest overall win rate, demonstrating superior performance in communication and content fidelity compared to the other models. In contrast, Qwen3-VL-235B shows strong capabilities in visual design aspects like aesthetics and polish, while ERNIE-4.5-VL underperforms significantly across all measured dimensions. GLM-4.5V achieves the highest overall win rate, outperforming Qwen3-VL-235B and ERNIE-4.5-VL in blind pairwise comparisons. Qwen3-VL-235B demonstrates superior visual design quality, achieving the highest scores in aesthetics, expressiveness, and polish. ERNIE-4.5-VL exhibits the lowest performance across all metrics, particularly in communication clarity and content fidelity.

The the the table details the statistical composition of the FigureBench dataset across four distinct document categories. It compares categories based on sample volume, text volume, text density, and visual complexity metrics such as the average number of components, colors, and shapes. The data reveals that academic papers are the most prevalent and text-heavy category, while textbooks represent the least text-dense and visually simplest group. Academic papers dominate the dataset in sample count and average text token volume. Textbook entries exhibit the lowest text density and the simplest visual structure with fewer components and colors. Survey documents demonstrate the highest visual complexity, containing the most components and colors on average.

The study compares AutoFigure against multiple baseline methods, including code-generation and end-to-end models, to evaluate figure generation quality. Results demonstrate that AutoFigure achieves the highest overall performance, particularly excelling in visual design dimensions such as aesthetics and expressiveness. While code-based baselines show moderate strength in communication clarity, AutoFigure dominates in pairwise comparisons, indicating a strong preference for its outputs. AutoFigure achieves the highest overall score and wins the majority of pairwise comparisons against other models. Visual design metrics, including aesthetics and polish, are significantly higher for AutoFigure compared to baselines. Code-based baselines like Gemini-HTML perform better on communication clarity but lag significantly in visual appeal and overall quality.

The authors conduct an ablation study to isolate the impact of the Text Refinement module on the final output quality. Results demonstrate that the full model outperforms the variant without refinement, particularly in visual design dimensions such as aesthetics and professional polish. This confirms that the refinement stage is essential for eliminating artifacts and achieving publication-ready standards. The complete model achieves a higher overall score than the version lacking the text refinement module. Visual design metrics including aesthetics and expressiveness show the most notable improvement with the refinement step. The refinement process effectively enhances professional polish, ensuring the figures meet high-quality standards.

The study evaluates the AUTOFIGURE framework through assessments of visual style adaptation, backbone model performance, and comparisons against baseline methods. Results demonstrate that while overall quality remains consistent across aesthetic variations, specific styles present trade-offs between visual design appeal and communication clarity. The full framework significantly outperforms baselines and ablated versions, confirming that the GLM-4.5V backbone and Text Refinement module are essential for achieving high visual fidelity and professional polish.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp