HyperAIHyperAI

Command Palette

Search for a command to run...

EasyControl: Hinzufügen effizienter und flexibler Steuerung für Diffusion Transformer

Yuxuan Zhang Yirui Yuan Yiren Song Haofan Wang Jiaming Liu

Abstract

Neuere Fortschritte bei Unet-basierten Diffusionsmodellen, wie beispielsweise ControlNet und IP-Adapter, haben effektive Mechanismen zur räumlichen und thematischen Steuerung eingeführt. Dennoch leidet die DiT-(Diffusion Transformer)-Architektur weiterhin unter ineffizienter und unflexibler Steuerung. Um dieses Problem anzugehen, stellen wir EasyControl vor – einen neuartigen Rahmen, der bedingungsgeleitete Diffusions-Transformer mit hoher Effizienz und Flexibilität vereint. Unser Framework basiert auf drei zentralen Innovationen. Erstens führen wir ein leichtgewichtiges Condition Injection LoRA-Modul ein. Dieses Modul verarbeitet bedingte Signale isoliert und fungiert als plug-and-play-Lösung. Es vermeidet Änderungen an den Gewichten des Basismodells, gewährleistet somit Kompatibilität mit angepassten Modellen und ermöglicht die flexible Injektion verschiedenster Bedingungen. Insbesondere unterstützt dieses Modul eine harmonische und robuste zero-shot-Verallgemeinerung für mehrere Bedingungen, selbst wenn das Modul nur auf einseitigen Bedingungsdaten trainiert wurde. Zweitens schlagen wir ein positionssensitives Trainingsparadigma vor. Dieser Ansatz standardisiert die Eingabebedingungen auf feste Auflösungen, wodurch die Generierung von Bildern mit beliebigen Seitenverhältnissen und flexiblen Auflösungen möglich wird. Gleichzeitig optimiert er die rechnerische Effizienz und macht das Framework praktikabler für reale Anwendungen. Drittens entwickeln wir eine kausale Aufmerksamkeitsmechanik, kombiniert mit der KV-Cache-Technik, speziell angepasst für bedingte Generierungsaufgaben. Diese Innovation reduziert die Latenz der Bildsynthese erheblich und verbessert die Gesamteffizienz des Frameworks. Durch umfangreiche Experimente zeigen wir, dass EasyControl außergewöhnliche Leistung in verschiedenen Anwendungsszenarien erzielt. Diese Innovationen machen unser Framework insgesamt äußerst effizient, flexibel und für eine breite Palette von Aufgaben geeignet.

One-sentence Summary

The authors, affiliated with Tiamat AI, ShanghaiTech University, National University of Singapore, and Liblib AI, propose EasyControl—a lightweight, plug-and-play framework for diffusion transformers that enables efficient spatial and subject/face control via a Condition Injection LoRA Module, Position-Aware Training, and a Causal Attention Mechanism with KV Cache, achieving zero-shot multi-condition generalization and reduced latency, making it highly suitable for real-world image generation applications.

Key Contributions

  • EasyControl addresses the inefficiency and inflexibility of condition-guided diffusion transformers by introducing a lightweight Condition Injection LoRA Module that processes conditional signals in isolation through a parallel branch, enabling plug-and-play integration without modifying base model weights and supporting robust zero-shot multi-condition generalization even with single-condition training.

  • The framework enhances computational efficiency and resolution flexibility via a Position-Aware Training Paradigm that normalizes input conditions to fixed resolutions and employs Position-Aware Interpolation, allowing consistent generation across arbitrary aspect ratios and resolutions while reducing sequence length and inference overhead.

  • By replacing full attention with a Causal Attention Mechanism integrated with KV Cache, EasyControl achieves significant latency reduction through precomputed and reused condition feature key-value pairs, marking the first application of KV Cache in conditional generation and substantially improving inference speed.

Introduction

The authors leverage the growing adoption of Diffusion Transformers (DiT) in image generation, which offer higher quality and resolution than traditional UNet-based models but face challenges in efficiency, multi-condition control, and plug-and-play flexibility. Prior methods suffer from quadratic computational costs due to full attention over long token sequences, struggle with stable coordination across multiple conditions—especially in zero-shot combinations—and often introduce parameter conflicts that degrade performance during style transfer or customization. To address these issues, the authors introduce EasyControl, a lightweight, plug-and-play framework that enables efficient and flexible condition-guided generation. It achieves this through three core innovations: a Condition Injection LoRA module that isolates condition signals in a parallel branch, preserving the frozen backbone while enabling seamless integration; a Position-Aware Training Paradigm that normalizes input resolution and interpolates tokens to maintain spatial consistency across resolutions; and a Causal Attention mechanism with KV Cache that precomputes and reuses condition features, drastically reducing inference latency. Together, these advances enable high-efficiency, zero-shot multi-condition generalization, robust resolution flexibility, and strong compatibility with custom models—advancing the practical deployment of DiT-based generation systems.

Dataset

  • The dataset is composed of multiple specialized subsets tailored to different control tasks: MultiGen-20M for spatial control (depth, canny, OpenPose), Subject200K for subject control, and a curated subset of LAION-Face combined with a private multi-view human dataset for face control.
  • MultiGen-20M contains 20 million images and serves as the primary source for spatial control tasks. Subject200K provides 200,000 images focused on subject consistency. The LAION-Face subset is filtered for high-quality face images, augmented with a private collection of multi-view human images.
  • All human images in the private multi-view dataset are preprocessed using InsightFace to ensure precise cropping and facial alignment, enhancing input consistency and accuracy.
  • The authors use these datasets to train their model by combining them into a training mixture, with specific ratios optimized for each control type. The data is processed to align inputs with corresponding control signals, and cropping strategies are applied uniformly to maintain spatial and semantic coherence across all subsets.

Method

The authors leverage the FLUX.1 diffusion transformer architecture as the foundation for EasyControl, extending it with a modular framework designed for efficient and flexible condition-guided image generation. The overall framework integrates several key components: a Condition Injection LoRA Module, a Position-Aware Training Paradigm, a Causal Attention mechanism, and a KV Cache for inference. Refer to the framework diagram for a visual overview of the system.

The core of the method is the Condition Injection LoRA Module, which enables the efficient and plug-and-play integration of conditional signals into the pre-trained DiT model. This module operates by introducing a dedicated Condition Branch that processes the input condition independently. The authors apply Low-Rank Adaptation (LoRA) to adaptively enhance the query, key, and value (QKV) features of the Condition Branch, while leaving the text and noise branches unmodified. This targeted adaptation allows the model to inject conditional information without disrupting the pre-trained representations of text and noise, ensuring high-fidelity generation. The LoRA transformation is defined as ΔQc,ΔKc,ΔVc=BOAOZc,BKAKZc,BVAVZc\Delta Q_c, \Delta K_c, \Delta V_c = B_O A_O Z_c, B_K A_K Z_c, B_V A_V Z_cΔQc,ΔKc,ΔVc=BOAOZc,BKAKZc,BVAVZc, where Ai,BiA_i, B_iAi,Bi are low-rank matrices, and the updated QKV features are Qc=Qc+ΔQcQ_c' = Q_c + \Delta Q_cQc=Qc+ΔQc, etc. This design ensures that the model can flexibly integrate diverse conditions while maintaining compatibility with customized models.

To manage the flow of information between the different input modalities, the framework employs a Causal Attention mechanism. This unidirectional attention restricts each position in the sequence to attend only to previous positions and itself, enforcing a causal structure. The authors design two specialized causal attention mechanisms to handle different scenarios. For single-condition training, Causal Conditional Attention is used, which blocks attention from the condition branch to the denoising (text and noise) branch, allowing only the reverse flow. This isolation enables decoupled Key-Value (KV) Cache states for each branch during inference, reducing redundant computation. For multi-condition inference, Causal Mutual Attention is employed. This mechanism allows all conditions to interact normally with the denoising tokens but prevents cross-condition interactions by applying a mask that blocks attention between tokens from different condition blocks. This ensures that while multiple conditions are integrated, they do not interfere with each other during generation.

The Position-Aware Training Paradigm is designed to improve computational efficiency and resolution flexibility. It involves downscaling high-resolution control signals to a fixed target resolution (e.g., 512×512512 \times 512512×512) before encoding them into latent space. To preserve spatial alignment, especially for spatial conditions like canny maps, the authors introduce Position-Aware Interpolation (PAI). This strategy interpolates position encodings during the resizing process, ensuring that the spatial relationships between patches in the original and resized images are maintained. For subject conditions, a PE Offset Strategy is applied, which adds a fixed displacement to the position encodings in the height dimension to separate them from spatial conditions. The loss function used for training is a flow-matching loss, defined as LˉRF=Et,ϵN(0,I)vθ(z,t,ci)(ϵx0)22\bar{L}_{RF} = E_{t, \epsilon \sim N(0, I)} ||v_\theta(z, t, c_i) - (\epsilon - x_0)||_2^2LˉRF=Et,ϵN(0,I)∣∣vθ(z,t,ci)(ϵx0)22, which guides the model to predict the correct velocity field for denoising.

Finally, the framework achieves efficient inference by leveraging the KV Cache technique. The unique design of the Causal Attention mechanism, which isolates the conditioning branch from the denoising timestep, allows the Key-Value pairs of all conditional features to be precomputed and stored only once at the initial timestep. These cached pairs are then reused across all subsequent denoising steps, eliminating the need for NNN-fold recomputation and significantly reducing inference latency.

Experiment

  • Single-condition generation: Validates robust text consistency and controllability across Canny, Depth, and Subject conditions. On COCO 2017 and Concept-101 benchmarks, achieved state-of-the-art performance in controllability, text consistency (CLIP-Score), and generation quality (FID, MAN-IQA), outperforming ControlNet, OmniControl, and Uni-ControlNet.
  • Multi-condition integration: Demonstrates superior identity preservation and controllability under face + OpenPose conditions. On a custom dataset, achieved the best Face Similarity, lowest MJPE (controllability), lowest FID, highest MAN-IQA, and highest CLIP-Score, surpassing ControlNet+IP-Adapter, ControlNet+Redux, Uni-ControlNet, and ID customization methods.
  • Resolution adaptability: Maintains strong controllability and image quality across resolutions from low to high (up to 2560×3520), outperforming ControlNet and OmniControl, which exhibit distortion and degradation at extreme resolutions.
  • Efficiency: On a single A100 GPU, achieved 16.3 seconds inference time for single-condition generation (58% faster than ablated version) and 18.3 seconds for dual-condition tasks (75% faster than ablated version), with only 15M parameters (vs. 3B for ControlNet), demonstrating high efficiency and compactness.

Results show that the proposed method achieves competitive identity preservation with CLIP-I and DINO-I scores, while outperforming baseline methods in generative quality as measured by FID and MAN-IQA, and achieving the highest text consistency with a CLIP-Score of 0.283. The authors use this table to demonstrate superior performance in subject control tasks compared to IP-Adapter, OminiControl, and Uni-ControlNet.

Results show that the proposed method achieves the fastest inference time of 16.3 seconds in single-condition settings and 18.3 seconds in double-condition settings, with a parameter count of 15M and 30M respectively, outperforming baseline methods in efficiency while maintaining a significantly smaller model size. The full model demonstrates a 58% reduction in inference time compared to the ablated version without PATP and KV Cache in single-condition tasks, and a 75% reduction in double-condition tasks, highlighting the effectiveness of these mechanisms in improving inference speed without compromising model compactness.

Results show that the proposed method achieves the best performance across all metrics in multi-condition generation with OpenPose and face inputs. It attains the highest face similarity, the lowest mean joint position error, the best generative quality, and the strongest text consistency compared to baseline methods.

Results show that the proposed method achieves the highest controllability and text consistency under Canny conditions, with an F1 score of 0.311 and a CLIP-Score of 0.286, while also achieving the best generative quality with a MAN-IQA score of 0.503. Under depth conditions, the method demonstrates superior controllability with a score of 1092 and maintains strong text consistency with a CLIP-Score of 0.289, while achieving competitive generative quality with a MAN-IQA score of 0.469.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Abonnieren Sie unsere neuesten Updates
Wir werden die neuesten Updates der Woche in Ihren Posteingang liefern um neun Uhr jeden Montagmorgen
Unterstützt von MailChimp