Diffusers Integrates FLUX.2: A New Era in Open Image Generation with Advanced Features and Efficient Inference
Diffusers has officially welcomed FLUX.2, the latest generation of open image generation models from Black Forest Labs. Building on the foundation of the earlier Flux.1 series, FLUX.2 introduces a completely new architecture and is trained from scratch, marking a significant leap forward in image synthesis capabilities. FLUX.2 supports both text-guided and image-guided generation, and uniquely allows users to input up to ten reference images simultaneously. This enables more complex and context-rich image creation, where users can reference specific visual elements by name or index, combining natural language with multiple visual inputs for precise control. A key change in FLUX.2 is the use of a single text encoder—Mistral Small 3.1—replacing the dual encoder setup in Flux.1. This simplifies prompt embedding computation and streamlines the pipeline, with a maximum sequence length of 512 tokens. The model also features a redesigned DiT (Diffusion Transformer) architecture based on a multimodal diffusion transformer (MM-DiT) with parallel blocks, now featuring 8 double-stream and 48 single-stream blocks—up from 19 and 38 in Flux.1. This shift means a much higher proportion of parameters are in the single-stream blocks, enhancing the model’s ability to process combined image and text information. Notable technical updates include the sharing of time and guidance modulation across all transformer blocks, the removal of bias parameters in all layers, and a new fusion pattern where attention QKV projections are combined with the feedforward input projection, creating a fully parallel block structure. The model uses a SwiGLU activation instead of GELU and omits bias, improving efficiency and performance. Inference with FLUX.2 is demanding, requiring over 80GB of VRAM when run without optimization. However, Diffusers provides multiple strategies to make it accessible on consumer hardware. These include CPU offloading, 4-bit quantization using bitsandbytes, and remote text encoding. For example, using 4-bit quantization, the model can run on a 24GB GPU, while combining local DiT with a remote text encoder allows inference on systems with as little as 18GB of VRAM. For even lower memory usage, group offloading enables running the model on GPUs with as little as 8GB of VRAM, though it requires 32GB of system RAM. The use of low_cpu_mem_usage can further reduce RAM needs to 10GB, making the model accessible to a wider range of users. FLUX.2 also supports LoRA fine-tuning, a powerful method for customizing the model to specific styles or subjects. Thanks to memory-saving techniques like gradient checkpointing, remote text encoding, and quantization, training is now feasible on lower-end hardware. The model can be fine-tuned using either FP8 training or QLoRA with 4-bit quantization, with example scripts available for text-to-image and image-to-image training. The model has already shown strong results in custom training, such as generating unique tarot card designs, with fine-tuned versions producing more consistent and stylized outputs compared to the base model. With its advanced architecture, flexible input options, and support for multiple optimization techniques, FLUX.2 represents a major step forward in open, accessible, and powerful generative AI. The integration with Diffusers ensures that developers and creators can experiment with the model using familiar tools and workflows.
