6시간 전

UltraFlux: 다양한 종횡비에서 고품질 네이티브 4K 텍스트-이미지 생성을 위한 데이터-모델 공동 설계

Tian Ye Song Fei Lei Zhu

초록

다음은 요청하신 텍스트의 한국어 번역입니다. 기술적 정확성과 학술적 작문 스타일에 맞추어 작성되었습니다.[번역]디퓨전 트랜스포머(Diffusion Transformers)는 최근 1K 해상도 수준에서 강력한 텍스트-이미지(text-to-image) 생성 성능을 입증했지만, 이를 다양한 종횡비(aspect ratio)의 네이티브 4K 해상도로 확장할 경우 위치 인코딩(positional encoding), VAE 압축, 그리고 최적화 전반에 걸쳐 긴밀하게 결합된 실패 요인들이 드러남을 확인했습니다. 이러한 요인들을 개별적으로 해결하는 것만으로는 충분한 품질을 확보하기 어렵습니다.이에 따라 우리는 데이터-모델 공동 설계(data-model co-design) 관점을 채택하여, UltraFlux를 제안합니다. UltraFlux는 Flux 기반의 DiT 모델로서, 제어된 다중 종횡비 커버리지, 이중 언어 캡션, 그리고 해상도 및 종횡비 인식 샘플링을 위한 풍부한 VLM/IQA 메타데이터를 포함한 100만 장 규모의 4K 코퍼스인 MultiAspect-4K-1M을 사용하여 네이티브 4K 환경에서 훈련되었습니다.모델 측면에서 UltraFlux는 다음의 핵심 요소들을 결합했습니다:(i) 4K 환경에서 훈련 윈도우, 주파수, 종횡비를 고려한 위치 인코딩을 위해 YaRN을 적용한 Resonance 2D RoPE;(ii) 4K 복원 충실도(fidelity)를 개선하는 단순하고 비적대적(non-adversarial)인 VAE 사후 학습(post-training) 스킴;(iii) 타임스텝과 주파수 대역 전반에 걸쳐 그래디언트(gradient)의 균형을 재조정하는 SNR 인식 Huber Wavelet 목적 함수(SNR-Aware Huber Wavelet objective);(iv) 모델의 사전 확률(prior)에 기반하여 높은 노이즈 단계에 고품질의 심미적 감독을 집중시키는 단계별 심미적 커리큘럼 학습(Stage-wise Aesthetic Curriculum Learning) 전략.이러한 구성 요소들의 결합을 통해 와이드(wide), 스퀘어(square), 톨(tall) 등 다양한 종횡비에서 우수한 일반화 성능을 보이며, 안정적이고 디테일을 보존하는 4K DiT를 구현했습니다. Aesthetic-Eval at 4096 벤치마크 및 다중 종횡비 4K 설정에서 UltraFlux는 충실도, 심미성, 정렬(alignment) 지표 전반에 걸쳐 강력한 오픈 소스 베이스라인들을 일관되게 능가하였으며, LLM 프롬프트 리파이너(refiner)와 함께 사용할 경우 상용 모델인 Seedream 4.0과 대등하거나 이를 능가하는 성능을 보입니다.

Summarization

Researchers from HKUST(GZ) and HKUST introduce UltraFlux, a native 4K diffusion transformer that leverages the MultiAspect-4K-1M dataset and integrates Resonance 2D RoPE with YaRN, VAE post-training, and an SNR-Aware Huber Wavelet objective to overcome optimization bottlenecks and achieve high-fidelity text-to-image generation rivaling proprietary models.

Introduction

Diffusion Transformers (DiTs) have achieved impressive fidelity at 1K resolution, but scaling these models to generate native 4K images across diverse aspect ratios presents unique engineering hurdles. Simply increasing the resolution often causes standard 2D rotary embeddings to drift or alias, while aggressive VAE compression tends to erase the fine high-frequency details that are critical for 4K perception. Furthermore, standard optimization objectives struggle with the statistical imbalance of 4K latents, where low-frequency data dominates the gradients and obscures fine textures.

Prior approaches have attempted to mitigate these issues through training-free upscaling or tiled diffusion, but these methods often introduce coherence gaps or fail to address the underlying instability of positional encodings at extreme resolutions. Progress has also been stalled by data limitations; existing public 4K datasets are relatively small, biased toward landscape orientations, and lack the rich, structured metadata required for modern generative training.

The authors address these coupled challenges by introducing UltraFlux, a system that co-designs the dataset and the model architecture for native 4K generation. They contribute MultiAspect-4K-1M, a curated dataset of 1 million high-quality 4K images with comprehensive VLM-generated captions and aesthetic scores. By training a Flux-based backbone on this specialized corpus, they achieve state-of-the-art fidelity and alignment without relying on super-resolution cascades.

Key innovations in the UltraFlux framework include:

Resonance 2D RoPE with YaRN: A specialized positional encoding scheme that prevents phase drift and aliasing, allowing the model to maintain structural stability across native 4K resolutions and widely varying aspect ratios.
SNR-Aware Huber Wavelet Objective: A novel loss function designed to handle the heavy-tailed statistics of 4K latents, balancing the optimization so that dominant low-frequency energy does not suppress the learning of high-frequency details.
Stage-wise Aesthetic Curriculum Learning (SACL): A two-stage training strategy that concentrates high-aesthetic supervision specifically on high-noise timesteps, effectively refining the model's global prior while allowing standard data to guide local detail generation.

Dataset

Dataset Composition and Sources The authors introduce MultiAspect-4K-1M, a corpus designed to address gaps in aspect ratio coverage and subject balance found in existing public 4K datasets.

Source Pool: The dataset is curated from an initial pool of approximately 6 million high-resolution images, which originally skewed heavily toward landscapes.
Final Composition: The resulting dataset comprises 1 million images featuring native 4K resolution, diverse aspect ratios (including 1:1, 16:9, 3:2, and 9:16), and a balanced mix of landscapes, objects, and human subjects.

Subsets and Filtering Pipeline To curate the final dataset, the authors employ a dual-channel pipeline that merges a general curation path with a targeted human-centric augmentation path.

General AR-Aware Channel: This subset enforces native 4K resolution (minimum 3840x2160 total pixels) and broad aspect ratio coverage.
Human-Centric Augmentation: To correct the under-representation of people, this path retrieves person-related images and validates them using YOLOE, a promptable open-vocabulary detector that ensures structured evidence of human presence.
VLM-Driven Filtering: Both channels undergo rigorous filtering using Q-Align for semantic quality (retaining scores > 4.0) and ArtiMuse for aesthetics (keeping the top 30%).
Texture Guardrails: Classical signal processing is used to remove low-texture images. This includes a Sobel-based flatness detector (removing images where >50% of patches are flat) and a Shannon entropy filter (removing images with values < 7.0).

Processing and Metadata The authors prioritize native resolution and rich metadata to facilitate flexible training and analysis.

No Cropping Strategy: The pipeline preserves each image's native aspect ratio without resizing, ensuring the data remains artifact-free.
Bilingual Captioning: Detailed English captions are generated using Gemini-2.5-Flash, followed by translation into Chinese using Hunyuan-MT-7B.
Metadata Construction: Each image is tagged with resolution details, VLM quality/aesthetic scores, classical texture signals, and a specific character flag for human subjects.
Usage: These metadata fields serve as analysis tags and keys for stratified sampling during text-to-image training, allowing for transparent auditing and data-model co-design.

Method

The authors leverage the Flux transformer architecture as the foundation for UltraFlux, focusing on three key components to enable efficient and high-fidelity native 4K image generation: the VAE, the positional representation, and the training objective. The overall framework is designed to scale the model effectively while maintaining performance across diverse resolutions and aspect ratios. The data pipeline, which underpins the model's training, begins with a large pool of internet data that is filtered through a series of stages to produce the curated MultiAspect-4K-1M dataset. This dataset is then used to train the model components, with the final output being high-resolution images that are both aesthetically pleasing and structurally coherent.

The VAE component is optimized for high-resolution reconstruction fidelity. The authors adopt an F16 VAE, which reduces the latent resolution compared to the original F8 VAE, thereby improving computational efficiency. To enhance the decoder's ability to reconstruct fine details at 4K resolution, a post-training phase is conducted on the MultiAspect-4K-1M corpus. This phase focuses on improving high-frequency content through a combination of wavelet reconstruction loss and feature-space perceptual loss, while avoiding the use of adversarial terms due to their tendency to induce optimization instability. The data curation process is also critical, as it allows for significant reconstruction gains with a relatively small number of carefully selected, detail-rich images, making the post-training stage both efficient and effective.

The positional representation is addressed through the introduction of Resonance 2D RoPE, a modified rotary positional embedding that enhances stability during inference at higher resolutions and different aspect ratios. The baseline Flux model uses a fixed per-axis rotary spectrum, which can lead to phase drift and artifacts when extrapolating to larger resolutions. Resonance 2D RoPE reinterprets the 2D rotary spectrum on a finite training window, snapping the number of cycles completed by each frequency component to the nearest nonzero integer. This ensures that the positional encoding is training-window aware and prevents the accumulation of fractional-cycle phase errors, which manifest as spatial drift and striping artifacts. The method is further enhanced with a YaRN-style extrapolation scheme, which makes the positional encoding band-aware and aspect-ratio aware by using the resonant cycle count to determine the scaling of each frequency band for a given extrapolation factor.

The training objective is designed to address the challenges of frequency imbalance, timestep imbalance, and cross-scale energy coupling that are common in standard $L_2$ -based training at native 4K resolution. The authors introduce the SNR-Aware Huber Wavelet (SAHW) objective, which combines a robust Pseudo-Huber penalty with an adaptive threshold that is small under high noise and grows as signal dominates. This objective is measured in a wavelet space, which decouples low and high-frequency bands, allowing for more effective handling of high-frequency residuals. The loss is further balanced across timesteps using Min-SNR weighting, which emphasizes mid-SNR timesteps for stable and faster optimization. The final objective is a drop-in replacement for standard flow-matching losses, tailored to the specific demands of native 4K generation.

Experiment

Quantitative comparison with open-source methods: Evaluations on the Aesthetic-Eval@4096 benchmark show UltraFlux matches or surpasses baselines (ScaleCrafter, FouriScale, Sana, Diffusion-4K) across metrics such as FID, HPSv3, PickScore, and Q-Align.
Gemini-based preference study: In pairwise comparisons using Gemini-2.5-Flash as a judge, UltraFlux is preferred over baselines in 70–82% of cases for visual appeal and 60–89% for prompt alignment.
Comparison with proprietary models: When equipped with a GPT-4O prompt refiner, UltraFlux achieves a slightly higher HPSv3 score (12.03 vs. 11.98) than the closed-source Seedream 4.0 and surpasses it on Q-Align and MUSIQ metrics.
Ablation study results: The SNR-Aware Huber Wavelet Training (SNR-HW) and Resonance 2D RoPE with YaRN provide complementary gains, delivering the best overall configuration with monotonically improved perceptual metrics and reduced FID.
VAE reconstruction analysis: The UltraFlux-F16-VAE demonstrates substantially better reconstruction quality and high-frequency detail preservation compared to the Flux-VAE-F16 baseline on the Aesthetic-4K@4096 set.
Geometric stability and efficiency: Analysis of Resonance 2D RoPE confirms it eliminates phase mismatch and geometric drift seen in baselines. Additionally, the model maintains inference speeds comparable to Sana while outperforming upsampling-based methods in wide aspect ratios (e.g., 2:1, 2.39:1).

The authors use the MultiAspect-4K-1M dataset, which contains 1.007 million images with an average resolution of 4,521×4,703, to train their model. This dataset is distinguished by its significantly longer average caption length of 125.1 tokens and the inclusion of bilingual captions, compared to the smaller PixArt-30k and Aesthetic-4K datasets.

Results show that UltraFlux outperforms all compared open-source methods across multiple metrics, achieving the lowest FID and highest scores in HPSv3, PickScore, ArtiMuse, CLIP Score, Q-Align, and MUSIQ. The authors use this table to demonstrate that UltraFlux consistently surpasses baselines like ScaleCrafter, FouriScale, Sana, and Diffusion-4K in both quantitative and qualitative evaluations.

Results show that UltraFlux outperforms Sana on the 1:2 aspect ratio across all metrics, achieving lower FID and higher HPSv3, ArtiMuse, and Q-Align scores. On the 2:1 aspect ratio, UltraFlux also surpasses Sana in HPSv3 and ArtiMuse while maintaining competitive performance on FID and Q-Align.

Results show that UltraFlux outperforms Sana across multiple metrics at the 2.39:1 aspect ratio, achieving lower FID and higher HPSv3, ArtiMuse, and Q-Align scores. This indicates superior image quality and better alignment with prompts in challenging ultra-wide formats.

The authors compare UltraFlux with Prompt Refiner against Seedream 4.0, a proprietary 4K model, under the same 4096×4096 evaluation protocol. Results show that UltraFlux achieves a slightly higher HPSv3 score and surpasses Seedream 4.0 on Q-Align and MUSIQ, indicating competitive performance in semantic alignment and perceptual quality despite using a stage-wise SFT pipeline without large-scale RL post-training.

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 공동 코딩, 즉시 사용 가능한 환경, 최적 가격 GPU로 AI 개발을 가속화하세요.

AI 공동 코딩

즉시 사용 가능한 GPU

최적 가격

시작하기

Hyper Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette