HyperAIHyperAI

Command Palette

Search for a command to run...

拡散 Transformer における文脈空間内でのオンザフライ反発による多様性の向上

Omer Dahary Benaya Koren Daniel Garibi Daniel Cohen-Or

概要

現代のテキストから画像(T2I)への拡散モデルは、驚くべき意味的整合性を達成していますが、任意のプロンプトに対して視覚的解が狭い範囲に収束し、多様性が著しく不足するという課題を抱えています。この「典型性バイアス」は、多様な生成結果を必要とする創造的応用において大きな障壁となっています。我々は、現在の多様性向上アプローチにおける根本的なトレードオフを特定しました。すなわち、モデル入力を変更して生成パスからのフィードバックを取り込むには、高コストな最適化が必要となる一方、空間的に固定された中間潜在表現に作用すると、形成途中の視覚構造が乱され、アーティファクトが生じる傾向があります。本研究では、拡散トランスフォーマーにおいて豊かな多様性を達成するための新たな枠組みとして、「文脈空間(Contextual Space)における反発」の適用を提案します。マルチモーダル注意チャネルに介入し、トランスフォーマーのフォワードパス中にオンザフライで反発を適用することで、テキスト条件付けが出現する画像構造によって豊かにされたブロック間で介入を行います。これにより、構造的な情報が得られた後かつ構成が固定される前に、ガイダンス軌道を再方向付けすることが可能になります。実験結果により、文脈空間における反発は、視覚的忠実度や意味的準拠を犠牲にすることなく、はるかに豊かな多様性を生み出すことが実証されました。さらに、本手法は極めて効率的であり、計算オーバーヘッドは微小です。加えて、従来の軌道ベースの介入手法が一般的に機能しない最新の「Turbo」モデルや蒸留モデルにおいても、その有効性を維持します。

One-sentence Summary

Researchers from Tel Aviv University and Snap Research propose Contextual Space repulsion, a framework that injects on-the-fly diversity into Diffusion Transformers by intervening in multimodal attention channels. This technique overcomes typicality bias in models like Flux-dev by steering generative intent before visual commitment, delivering rich variety with minimal computational overhead.

Key Contributions

  • The paper introduces a Contextual Space repulsion framework that applies on-the-fly intervention within the multimodal attention channels of Diffusion Transformers to steer generative intent after structural information emerges but before composition is fixed.
  • This method injects repulsion between transformer blocks where text conditioning is enriched with emergent image structure, allowing the model to explore diverse paths while preserving samples within the learned data manifold to avoid visual artifacts.
  • Experiments on the COCO benchmark across multiple DiT architectures demonstrate that the approach produces significantly richer diversity without sacrificing visual fidelity or semantic adherence, even in efficient "Turbo" and distilled models where traditional interventions fail.

Introduction

Modern Text-to-Image diffusion models excel at semantic alignment but often suffer from typicality bias, converging on a narrow set of visual solutions that limits their utility for creative applications. Prior attempts to restore diversity face a critical trade-off: upstream methods that modify inputs require costly optimization to incorporate structural feedback, while downstream interventions on image latents often disrupt the formed visual structure and introduce artifacts. The authors leverage the Contextual Space within Diffusion Transformers to apply on-the-fly repulsion during the forward pass, intervening in multimodal attention channels where text conditioning is enriched with emergent image structure. This approach redirects the guidance trajectory after the model is structurally informed but before the composition is fixed, achieving rich diversity with minimal computational overhead while preserving visual fidelity and semantic adherence.

Method

The authors leverage the inherent structure of Multimodal Diffusion Transformers (DiTs) to introduce a novel intervention strategy for generative diversity. Unlike U-Net architectures that rely on static text embeddings, DiTs facilitate a bidirectional exchange between text features fTf_TfT and image features fIf_IfI within Multimodal Attention (MM-Attention) blocks. As shown in the framework diagram, the standard processing flow involves dual-stream blocks where text prompts p(i)p^{(i)}p(i) and noisy latents zt(i)z_t^{(i)}zt(i) are processed to generate the next state zt1(i)z_{t-1}^{(i)}zt1(i).

The core difficulty with existing methods lies in the timing and location of the repulsion. Upstream methods act on uninformed noise, while downstream methods act on a rigid latent manifold. The authors identify the Contextual Space, formed by the enriched text tokens f^T(l)\hat{f}_T^{(l)}f^T(l) after MM-Attention, as an effective environment for diversity interventions because it is structurally informed yet conceptually flexible.

To achieve this, the authors adopt a particle guidance framework that treats a batch of samples as interacting particles. However, unlike prior work that applies guidance to the image latents ztz_tzt, as illustrated in the figure below where repulsion is applied to the output latent, the proposed method applies repulsive forces directly to the Contextual Space tokens f^T\hat{f}_Tf^T.

By enforcing distance between batch samples in this space, the model's high-level planning is steered before it commits to a specific visual mode. As shown in the figure below, the intervention is applied within the transformer blocks, indicated by the red arrows on the contextual stream, allowing for the manipulation of generative intent without requiring backpropagation through the model layers.

The updated state of the contextual tokens for a sample iii after each iteration is given by:

f^T,i(l)=f^T,i(l)+ηMf^T,i(l)Ldiv({f^T,j(l)}j=1B)\hat { f } _ { T , i } ^ { ( l ) \prime } = \hat { f } _ { T , i } ^ { ( l ) } + \frac { \eta } { M } \nabla _ { \hat { f } _ { T , i } ^ { ( l ) } } \mathcal { L } _ { d i v } ( \{ \hat { f } _ { T , j } ^ { ( l ) } \} _ { j = 1 } ^ { B } )f^T,i(l)=f^T,i(l)+Mηf^T,i(l)Ldiv({f^T,j(l)}j=1B)

where η\etaη is the overall repulsion scale and Ldiv\mathcal{L}_{div}Ldiv is a diversity loss defined over the batch. To maintain diversity throughout the trajectory, this repulsion is applied across all transformer MM-blocks, specifically restricted to the first few timesteps where guidance signals are strongest.

For the diversity objective, the authors utilize the Vendi Score, which provides a principled way to measure the effective number of distinct samples in a batch. This is computed by analyzing the eigenvalues of a similarity matrix constructed from flattened contextual vectors. The Contextual Space encodes global semantic intent shared across the batch, making diversity objectives based on batch-level similarity more appropriate. As shown in the figure below, this approach allows for diverse interpolations and extrapolations while maintaining semantic alignment in the Contextual Space, preventing the semantic collapse typically induced by standard guidance.

Experiment

  • Interpolation and extrapolation experiments in the Contextual Space versus VAE Latent Space demonstrate that the Contextual Space enables smooth semantic transitions and maintains high visual fidelity, whereas the Latent Space suffers from structural blurring and artifacts due to spatial misalignment.
  • Qualitative evaluations across Flux-dev, SD3.5-Turbo, and SD3.5-Large architectures show that the proposed method generates diverse compositions and styles without the visual artifacts common in downstream latent interventions or the semantic drift seen in some upstream baselines.
  • Quantitative analysis reveals a superior trade-off between semantic diversity and image quality, with the method achieving higher human preference and prompt alignment scores while incurring significantly lower computational overhead than optimization-based approaches.
  • Ablation studies confirm that intervening in the Contextual Space is more effective than in image token spaces, as it allows for varied global compositions without the spatial rigidity that leads to local texture artifacts.
  • Integration tests on image editing models validate that the approach generalizes beyond text-to-image generation, producing diverse yet coherent edits while preserving the original image integrity.

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています