UltraFlux: التصميم المشترك للبيانات والنموذج لتوليد الصور من النصوص بجودة عالية ودقة 4K أصلية عبر نسب أبعاد متنوعة
Tian Ye Song Fei Lei Zhu

الملخص
فيما يلي ترجمة النص إلى اللغة العربية، مع مراعاة الدقة التقنية، والأسلوب الأكاديمي الرسمي، واستخدام المصطلحات المعتمدة في مجالات الذكاء الاصطناعي والرؤية الحاسوبية:"حققت نماذج "محولات الانتشار" (Diffusion Transformers) مؤخراً نتائج قوية في توليد الصور من النصوص بدقة تقارب 1K. ومع ذلك، نوضح في هذا العمل أن توسيع نطاق هذه النماذج لتعمل بدقة 4K الأصلية (Native 4K) عبر نسب أبعاد متنوعة يكشف عن نمط فشل مترابط وثيق الصلة يشمل الترميز الموضعي (Positional Encoding)، وضغط المشفّرات التلقائية المتغيرة (VAE Compression)، وعمليات التحسين (Optimization). وتؤدي معالجة أي من هذه العوامل بمعزل عن غيرها إلى فقدان جزء كبير من الجودة المحتملة.لذا، نتبنى منظوراً مشتركاً لتصميم البيانات والنموذج (Data-Model Co-Design) ونقدم "UltraFlux"، وهو نموذج DiT قائم على Flux تم تدريبه محلياً بدقة 4K باستخدام مجموعة بيانات "MultiAspect-4K-1M". تتألف هذه المجموعة من مليون صورة بدقة 4K مع تغطية مضبوطة لنسب أبعاد متعددة (Multi-AR)، وشروحات نصية ثنائية اللغة، وبيانات وصفية غنية مستمدة من نماذج اللغة والبصر (VLM) وتقييم جودة الصورة (IQA) لتمكين المعاينة (Sampling) المدركة للدقة ونسب الأبعاد.من جانب النموذج، يدمج UltraFlux بين المكونات التالية:(1) تقنية Resonance 2D RoPE مع YaRN لتحقيق ترميز موضعي بدقة 4K يراعي نافذة التدريب، والتردد، ونسب الأبعاد؛(2) مخطط بسيط وغير عدائي (Non-Adversarial) لما بعد تدريب الـ VAE يعمل على تحسين دقة إعادة بناء الصور بدقة 4K؛(3) دالة هدف Huber Wavelet المدركة لنسبة الإشارة إلى الضجيج (SNR-Aware) التي تعيد موازنة التدرجات (Gradients) عبر الخطوات الزمنية ونطاقات التردد؛(4) استراتيجية "تعلم المنهج الجمالي المرحلي" (Stage-wise Aesthetic Curriculum Learning) التي تركز الإشراف عالي الجمالية على خطوات الضجيج العالي التي تحكمها أولويات النموذج (Model Prior).وتؤدي هذه المكونات مجتمعة إلى إنتاج نموذج DiT بدقة 4K يتسم بالاستقرار والحفاظ على التفاصيل، مع قدرة على التعميم عبر نسب الأبعاد العريضة، والمربعة، والطويلة. وفي اختبارات تقييم الجماليات (Aesthetic-Eval) عند معيار 4096 وإعدادات 4K متعددة الأبعاد، يتفوق UltraFlux باستمرار على النماذج المرجعية القوية مفتوحة المصدر عبر مقاييس الدقة، والجماليات، والمحاذاة. وعند استخدامه مع مُحسّن التلقين القائم على النماذج اللغوية الكبيرة (LLM Prompt Refiner)، فإنه يضاهي أو يتجاوز نموذج Seedream 4.0 الاحتكاري."
Summarization
Researchers from HKUST(GZ) and HKUST introduce UltraFlux, a native 4K diffusion transformer that leverages the MultiAspect-4K-1M dataset and integrates Resonance 2D RoPE with YaRN, VAE post-training, and an SNR-Aware Huber Wavelet objective to overcome optimization bottlenecks and achieve high-fidelity text-to-image generation rivaling proprietary models.
Introduction
Diffusion Transformers (DiTs) have achieved impressive fidelity at 1K resolution, but scaling these models to generate native 4K images across diverse aspect ratios presents unique engineering hurdles. Simply increasing the resolution often causes standard 2D rotary embeddings to drift or alias, while aggressive VAE compression tends to erase the fine high-frequency details that are critical for 4K perception. Furthermore, standard optimization objectives struggle with the statistical imbalance of 4K latents, where low-frequency data dominates the gradients and obscures fine textures.
Prior approaches have attempted to mitigate these issues through training-free upscaling or tiled diffusion, but these methods often introduce coherence gaps or fail to address the underlying instability of positional encodings at extreme resolutions. Progress has also been stalled by data limitations; existing public 4K datasets are relatively small, biased toward landscape orientations, and lack the rich, structured metadata required for modern generative training.
The authors address these coupled challenges by introducing UltraFlux, a system that co-designs the dataset and the model architecture for native 4K generation. They contribute MultiAspect-4K-1M, a curated dataset of 1 million high-quality 4K images with comprehensive VLM-generated captions and aesthetic scores. By training a Flux-based backbone on this specialized corpus, they achieve state-of-the-art fidelity and alignment without relying on super-resolution cascades.
Key innovations in the UltraFlux framework include:
- Resonance 2D RoPE with YaRN: A specialized positional encoding scheme that prevents phase drift and aliasing, allowing the model to maintain structural stability across native 4K resolutions and widely varying aspect ratios.
- SNR-Aware Huber Wavelet Objective: A novel loss function designed to handle the heavy-tailed statistics of 4K latents, balancing the optimization so that dominant low-frequency energy does not suppress the learning of high-frequency details.
- Stage-wise Aesthetic Curriculum Learning (SACL): A two-stage training strategy that concentrates high-aesthetic supervision specifically on high-noise timesteps, effectively refining the model's global prior while allowing standard data to guide local detail generation.
Dataset
Dataset Composition and Sources The authors introduce MultiAspect-4K-1M, a corpus designed to address gaps in aspect ratio coverage and subject balance found in existing public 4K datasets.
- Source Pool: The dataset is curated from an initial pool of approximately 6 million high-resolution images, which originally skewed heavily toward landscapes.
- Final Composition: The resulting dataset comprises 1 million images featuring native 4K resolution, diverse aspect ratios (including 1:1, 16:9, 3:2, and 9:16), and a balanced mix of landscapes, objects, and human subjects.
Subsets and Filtering Pipeline To curate the final dataset, the authors employ a dual-channel pipeline that merges a general curation path with a targeted human-centric augmentation path.
- General AR-Aware Channel: This subset enforces native 4K resolution (minimum 3840x2160 total pixels) and broad aspect ratio coverage.
- Human-Centric Augmentation: To correct the under-representation of people, this path retrieves person-related images and validates them using YOLOE, a promptable open-vocabulary detector that ensures structured evidence of human presence.
- VLM-Driven Filtering: Both channels undergo rigorous filtering using Q-Align for semantic quality (retaining scores > 4.0) and ArtiMuse for aesthetics (keeping the top 30%).
- Texture Guardrails: Classical signal processing is used to remove low-texture images. This includes a Sobel-based flatness detector (removing images where >50% of patches are flat) and a Shannon entropy filter (removing images with values < 7.0).
Processing and Metadata The authors prioritize native resolution and rich metadata to facilitate flexible training and analysis.
- No Cropping Strategy: The pipeline preserves each image's native aspect ratio without resizing, ensuring the data remains artifact-free.
- Bilingual Captioning: Detailed English captions are generated using Gemini-2.5-Flash, followed by translation into Chinese using Hunyuan-MT-7B.
- Metadata Construction: Each image is tagged with resolution details, VLM quality/aesthetic scores, classical texture signals, and a specific
characterflag for human subjects. - Usage: These metadata fields serve as analysis tags and keys for stratified sampling during text-to-image training, allowing for transparent auditing and data-model co-design.
Method
The authors leverage the Flux transformer architecture as the foundation for UltraFlux, focusing on three key components to enable efficient and high-fidelity native 4K image generation: the VAE, the positional representation, and the training objective. The overall framework is designed to scale the model effectively while maintaining performance across diverse resolutions and aspect ratios. The data pipeline, which underpins the model's training, begins with a large pool of internet data that is filtered through a series of stages to produce the curated MultiAspect-4K-1M dataset. This dataset is then used to train the model components, with the final output being high-resolution images that are both aesthetically pleasing and structurally coherent.

The VAE component is optimized for high-resolution reconstruction fidelity. The authors adopt an F16 VAE, which reduces the latent resolution compared to the original F8 VAE, thereby improving computational efficiency. To enhance the decoder's ability to reconstruct fine details at 4K resolution, a post-training phase is conducted on the MultiAspect-4K-1M corpus. This phase focuses on improving high-frequency content through a combination of wavelet reconstruction loss and feature-space perceptual loss, while avoiding the use of adversarial terms due to their tendency to induce optimization instability. The data curation process is also critical, as it allows for significant reconstruction gains with a relatively small number of carefully selected, detail-rich images, making the post-training stage both efficient and effective.
The positional representation is addressed through the introduction of Resonance 2D RoPE, a modified rotary positional embedding that enhances stability during inference at higher resolutions and different aspect ratios. The baseline Flux model uses a fixed per-axis rotary spectrum, which can lead to phase drift and artifacts when extrapolating to larger resolutions. Resonance 2D RoPE reinterprets the 2D rotary spectrum on a finite training window, snapping the number of cycles completed by each frequency component to the nearest nonzero integer. This ensures that the positional encoding is training-window aware and prevents the accumulation of fractional-cycle phase errors, which manifest as spatial drift and striping artifacts. The method is further enhanced with a YaRN-style extrapolation scheme, which makes the positional encoding band-aware and aspect-ratio aware by using the resonant cycle count to determine the scaling of each frequency band for a given extrapolation factor.

The training objective is designed to address the challenges of frequency imbalance, timestep imbalance, and cross-scale energy coupling that are common in standard L2-based training at native 4K resolution. The authors introduce the SNR-Aware Huber Wavelet (SAHW) objective, which combines a robust Pseudo-Huber penalty with an adaptive threshold that is small under high noise and grows as signal dominates. This objective is measured in a wavelet space, which decouples low and high-frequency bands, allowing for more effective handling of high-frequency residuals. The loss is further balanced across timesteps using Min-SNR weighting, which emphasizes mid-SNR timesteps for stable and faster optimization. The final objective is a drop-in replacement for standard flow-matching losses, tailored to the specific demands of native 4K generation.

Experiment
- Quantitative comparison with open-source methods: Evaluations on the Aesthetic-Eval@4096 benchmark show UltraFlux matches or surpasses baselines (ScaleCrafter, FouriScale, Sana, Diffusion-4K) across metrics such as FID, HPSv3, PickScore, and Q-Align.
- Gemini-based preference study: In pairwise comparisons using Gemini-2.5-Flash as a judge, UltraFlux is preferred over baselines in 70–82% of cases for visual appeal and 60–89% for prompt alignment.
- Comparison with proprietary models: When equipped with a GPT-4O prompt refiner, UltraFlux achieves a slightly higher HPSv3 score (12.03 vs. 11.98) than the closed-source Seedream 4.0 and surpasses it on Q-Align and MUSIQ metrics.
- Ablation study results: The SNR-Aware Huber Wavelet Training (SNR-HW) and Resonance 2D RoPE with YaRN provide complementary gains, delivering the best overall configuration with monotonically improved perceptual metrics and reduced FID.
- VAE reconstruction analysis: The UltraFlux-F16-VAE demonstrates substantially better reconstruction quality and high-frequency detail preservation compared to the Flux-VAE-F16 baseline on the Aesthetic-4K@4096 set.
- Geometric stability and efficiency: Analysis of Resonance 2D RoPE confirms it eliminates phase mismatch and geometric drift seen in baselines. Additionally, the model maintains inference speeds comparable to Sana while outperforming upsampling-based methods in wide aspect ratios (e.g., 2:1, 2.39:1).
The authors use the MultiAspect-4K-1M dataset, which contains 1.007 million images with an average resolution of 4,521×4,703, to train their model. This dataset is distinguished by its significantly longer average caption length of 125.1 tokens and the inclusion of bilingual captions, compared to the smaller PixArt-30k and Aesthetic-4K datasets.

Results show that UltraFlux outperforms all compared open-source methods across multiple metrics, achieving the lowest FID and highest scores in HPSv3, PickScore, ArtiMuse, CLIP Score, Q-Align, and MUSIQ. The authors use this table to demonstrate that UltraFlux consistently surpasses baselines like ScaleCrafter, FouriScale, Sana, and Diffusion-4K in both quantitative and qualitative evaluations.

Results show that UltraFlux outperforms Sana on the 1:2 aspect ratio across all metrics, achieving lower FID and higher HPSv3, ArtiMuse, and Q-Align scores. On the 2:1 aspect ratio, UltraFlux also surpasses Sana in HPSv3 and ArtiMuse while maintaining competitive performance on FID and Q-Align.

Results show that UltraFlux outperforms Sana across multiple metrics at the 2.39:1 aspect ratio, achieving lower FID and higher HPSv3, ArtiMuse, and Q-Align scores. This indicates superior image quality and better alignment with prompts in challenging ultra-wide formats.

The authors compare UltraFlux with Prompt Refiner against Seedream 4.0, a proprietary 4K model, under the same 4096×4096 evaluation protocol. Results show that UltraFlux achieves a slightly higher HPSv3 score and surpasses Seedream 4.0 on Q-Align and MUSIQ, indicating competitive performance in semantic alignment and perceptual quality despite using a stage-wise SFT pipeline without large-scale RL post-training.

بناء الذكاء الاصطناعي بالذكاء الاصطناعي
من الفكرة إلى الإطلاق — عجّل تطوير الذكاء الاصطناعي الخاص بك من خلال البرمجة المشتركة المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.