HyperAIHyperAI

Command Palette

Search for a command to run...

Qwen-Image-Agent: سد فجوة السياق في توليد الصور في العالم الحقيقي

الملخص

على الرغم من التقدم الملحوظ الذي أحرزته نماذج تحويل النص إلى صور (T2I)، إلا أنها تواجه تحديات في التعامل مع الطلبات الواقعية التي غالباً ما تكون غير محددة بدقة، أو ضمنية، أو تعتمد على معرفة حديثة. نُعرّف هذا التحدي بـ "فجوة السياق" (Context Gap)، وهي عدم التطابق بين سياق المستخدم والسياق الكافي اللازم لعملية التوليد في نماذج تحويل النص إلى صور (T2I). ولتجاوز هذه الفجوة، نقترح Qwen-Image-Agent، وهو إطار عمل موحد قائم على الـ agents يدمج التخطيط، والاستدلال، والبحث، والذاكرة، والتغذية الراجعة بطريقة تركز على السياق. يعالج Qwen-Image-Agent مدخلات المستخدم على أنها سياق جزئي، ويقوم ببناء سياق التوليد تدريجياً عبر "التخطيط الواعي بالسياق" (Context-Aware Planning) و"التأريض السياقي" (Context Grounding). وتحديداً، يحدد "التخطيط الواعي بالسياق" السياق المفقود ويضع خطة لكيفية اكتسابه واستخدامه، في حين يقوم "التأريض السياقي" بجمع هذا السياق من الاستدلال، والبحث، والذاكرة، والتغذية الراجعة. ولتقييم توليد الصور القائم على الـ agents، نقدم أيضاً Image Agent Bench (IA-Bench)، وهو معيار تقييم يغطي أربع قدرات أساسية للـ agent في توليد الصور: التخطيط، والاستدلال، والبحث، والذاكرة. وتُظهر النتائج التجريبية على معايير IA-Bench وMindbench وWISE-Verified أن Qwen-Image-Agent يتفوق على النماذج الأساسية القوية، ويحقق أداءً متفوقاً (state-of-the-art).

One-sentence Summary

Qwen-Image-Agent bridges the context gap in real-world text-to-image generation through a unified agentic framework that employs Context-Aware Planning and Context Grounding to progressively integrate reasoning, search, memory, and feedback, thereby achieving state-of-the-art performance on the IA-Bench, Mindbench, and WISE-Verified benchmarks.

Key Contributions

  • Qwen-Image-Agent is a training-free agentic framework that bridges the context gap in text-to-image generation by unifying planning, reasoning, search, memory, and feedback. The system progressively constructs sufficient generation context through Context-Aware Planning and Context Grounding to handle underspecified or implicit user requests.
  • IA-Bench systematically evaluates agentic image generation across plan, reason, search, and memory capabilities using 17 real-world tasks, 730 test instances, and 1801 fine-grained checklist items. A structured vision-language model evaluation protocol ensures reliable assessment of these core agent functions.
  • Experiments on IA-Bench, MindBench, and WISE-Verified demonstrate that Qwen-Image-Agent substantially outperforms strong agentic baselines and achieves state-of-the-art performance. Ablation studies confirm that integrating multiple grounded context sources yields complementary improvements in generation quality and task completion.

Introduction

Text-to-image models have advanced rapidly, yet their deployment in practical fields like marketing and product design is hindered by a fundamental mismatch between explicit training prompts and the underspecified, knowledge-dependent requests users actually submit. Prior work has explored isolated agentic capabilities such as planning, reasoning, or web search, but these efforts remain fragmented and fail to systematically acquire and integrate the missing context required for real-world generation. The authors bridge this divide with Qwen-Image-Agent, a unified framework that progressively constructs complete generation context through context-aware planning and multi-source grounding. The authors leverage reasoning, web search, memory, and iterative feedback to transform incomplete user inputs into fully specified generation conditions. To validate this approach, the authors also introduce IA-Bench for systematically evaluating agentic capabilities and demonstrate state-of-the-art performance across multiple benchmarks.

Dataset

  • Dataset Composition and Sources: The authors introduce IA-Bench, a benchmark designed to holistically evaluate agentic capabilities in image generation. It consists of 730 instances organized into four primary tasks and seventeen subtasks, supported by 1,801 evaluation checklist items. The data draws from domain knowledge, physical commonsense, logical and spatio-temporal reasoning, external world knowledge, and interactive conversational context.
  • Key Details for Each Subset:
    • Planning tasks (Composition, Enumeration, Multi-Panel) focus on decomposing high-level goals into structured layouts and satisfying multi-object constraints.
    • Reasoning tasks (Math, Science, Commonsense, Maze, Map, Geometry) test logical, commonsense, and visual reasoning to infer latent constraints before rendering.
    • Search tasks (IP entities like games and celebrities, plus Information like stocks and weather) evaluate the ability to retrieve and ground external or real-time knowledge.
    • Memory tasks (User Profile and Conversation History) assess cross-turn consistency and the retention of persistent user attributes or prior dialogue.
  • How the Paper Uses the Data: The authors use IA-Bench strictly as an evaluation framework to measure planning, reasoning, search, and memory capabilities. The dataset is not applied to model training, fine-tuning, or data mixing, and no training splits or mixture ratios are utilized.
  • Processing and Construction Details: The benchmark relies on careful human annotation to balance quality and difficulty. The authors filter out instances solvable by memorization or pretrained visual priors, such as removing highly iconic characters that models can generate without external search. Evaluation checklists are initially generated by LLMs and then manually refined for accuracy and necessity. For memory-focused tasks, the authors implement dynamic evaluation checklists where reference targets adapt based on previously generated images rather than static ground truth. Feasibility is verified and evaluation ambiguity is minimized across all subtasks.

Method

The authors leverage a conditional rendering formulation that explicitly accounts for the discrepancy between incomplete user inputs and the complete context required by image generators. To resolve this context gap, the framework treats the image generator as a renderer and introduces an iterative context construction process. At each step, the agent maintains a state st=(ct,Ot1)s_t = (c_t, O_{t-1})st=(ct,Ot1) representing the current context and accumulated observations, selects an action ata_tat from a defined operation space, and receives an observation oto_tot. This interaction forms a trajectory τ={(st,at,ot)}t=1T\tau = \{ (s_t, a_t, o_t) \}_{t=1}^Tτ={(st,at,ot)}t=1T. The overall generation process is mathematically expressed as:

pagent(ycu)=τp(τcu)pgen(ycg=c(τ))p_{\text{agent}}(y \mid c_u) = \sum_{\tau} p(\tau \mid c_u) \, p_{\text{gen}}(y \mid c_g = c(\tau))pagent(ycu)=τp(τcu)pgen(ycg=c(τ))

Under this formulation, the agent progressively builds the generation context before executing the final rendering pass.

The architecture is organized around two core modules: Context-Aware Planning and Context Grounding. Context-Aware Planning operates across three hierarchical levels to systematically manage information flow. Information-level planning identifies missing context elements and routes specific queries to appropriate grounding strategies. Content-level planning assembles the retrieved signals and rewrites the initial prompt into a detailed specification that explicitly defines subjects, attributes, layout, style, and textual elements. Generation-level planning handles context allocation for multi-turn and multi-image workflows, mitigating content drift by filtering relevant historical information and distributing context across dependent images.

Context Grounding collects and structures missing information through four distinct mechanisms. Reasoning grounding employs vision-language models to infer implicit intents through commonsense, logical, and visual reasoning, transforming underspecified requests into concrete context items. Search grounding retrieves external factual knowledge or visual references by extracting keywords, performing web queries, and utilizing vision-language models to rank and retain the most relevant results. Memory grounding preserves continuity in extended interactions by incorporating conversation history, updating user profiles, and querying external multimodal knowledge bases via a dedicated retriever. Feedback grounding closes the iterative loop by evaluating generated outputs against a predefined attribute checklist. Vision-language models assess each result, and any discrepancies are converted into feedback signals that are merged with existing context to refine the prompt for subsequent generation rounds.

Experiment

The evaluation employs a checklist-based VLM protocol across three distinct benchmarks to systematically assess planning, reasoning, search, and memory capabilities against a wide range of proprietary and open-source baselines. Quantitative and qualitative analyses demonstrate that the proposed agentic framework substantially outperforms direct generation models by effectively bridging context gaps through progressive information grounding and iterative refinement. Ablation studies further validate that sustained performance relies on a powerful multimodal language backbone, carefully structured contextual inputs, and a capable image renderer. Ultimately, the findings underscore the necessity of intelligent context management for complex image synthesis while highlighting ongoing challenges in handling implicit requirements, multiturn context limits, and computational efficiency.

The authors evaluate image generation models on the IA-Bench benchmark, comparing closed-source, open-source, and agentic methods. Results show that the proposed Qwen-Image-Agent achieves the highest overall IA-score, significantly outperforming both direct generation baselines like Qwen-Image-2.0 and other agentic models. The agentic framework demonstrates substantial improvements in Plan, Reason, and Search capabilities, while closed-source models retain a noticeable advantage in Memory performance. Qwen-Image-Agent achieves the highest overall IA-score among all evaluated models. The agentic framework significantly improves performance over the direct generation baseline Qwen-Image-2.0. Agentic models generally outperform direct generation models in Plan, Reason, and Search dimensions.

The evaluation of Qwen-Image-Agent on the IA-Bench demonstrates its superior performance compared to a wide range of proprietary and open-source baselines. The model achieves the highest overall scores, significantly outperforming strong closed-source models like Nano Banana Pro and GPT-Image-1.5 across both knowledge-driven and reasoning-driven tasks. Qwen-Image-Agent achieves the highest overall IA-score, surpassing leading proprietary models. The model demonstrates substantial improvements in knowledge-driven and reasoning-driven capabilities. The agentic framework significantly boosts performance over the direct generation baseline Qwen-Image-2.0.

The authors conduct an ablation study to validate the contribution of specific components within the Qwen-Image-Agent framework. The results demonstrate that the full system significantly outperforms ablated versions, with the removal of any grounded context leading to a distinct decline in its corresponding evaluation dimension. Furthermore, substituting the MLLM or generation backbones with weaker alternatives consistently degrades performance across all metrics. Removing the memory context leads to a complete collapse in the memory dimension performance. The search component is critical for search capabilities, as removing it causes a drastic decline in that specific metric. Replacing the MLLM or generation backbones with weaker models results in consistent performance degradation across all evaluated capabilities.

The authors present an evaluation of Qwen-Image-Agent against various proprietary and open-source baselines to measure semantic understanding and world knowledge. The results demonstrate that Qwen-Image-Agent achieves the highest performance across all categories, including culture, time, space, biology, physics, and chemistry. It significantly outperforms strong closed-source models like Nano Banana Pro and GPT-Image-1.5, validating the effectiveness of the proposed agentic framework. Qwen-Image-Agent achieves the highest overall score, outperforming all other listed models including Nano Banana Pro and GPT-Image-1.5. The model demonstrates consistent superiority across all specific knowledge domains such as biology, physics, and chemistry. The agentic framework shows substantial improvements over direct generation baselines like Qwen-Image-2.0.

Evaluated on the IA-Bench benchmark, Qwen-Image-Agent is compared against direct generation baselines, other agentic approaches, and leading proprietary models, while an ablation study validates the necessity of its core architectural components. The benchmark evaluation demonstrates that the agentic framework substantially enhances planning, reasoning, search, and domain-specific knowledge capabilities, establishing a clear qualitative advantage over direct generation methods. Component analysis confirms that each module is essential for maintaining robust performance across all evaluated dimensions. Ultimately, the model achieves consistent superiority across knowledge and reasoning tasks, though proprietary baselines retain a relative edge in memory retention.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp