HyperAIHyperAI

Command Palette

Search for a command to run...

D-OPSD: التكثيف الذاتي القائم على السياسة لضبط مستمر لنماذج الانتشار التي تم تكثيفها خطوة بخطوة

الملخص

تشهد نماذج توليد الصور عالية الأداء تحولاً كبيراً في الآونة الحالية، حيث تتجه من النماذج متعددة الخطوات الأقل كفاءة إلى نظيراتها ذات الخطوات القليلة والأكثر كفاءة (مثل Z-Image-Turbo وFLUX.2-klein). ومع ذلك، تطرح هذه النماذج تحديات كبيرة أمام عملية الضبط الدقيق الخاضع للإشراف المباشر والمستمر. فعلى سبيل المثال، فإن تطبيق تقنيات الضبط الدقيق الشائعة قد يُلحق الضرر بقدرتها الجوهرية على الاستدلال بخطوات قليلة. ولمعالجة هذه الإشكالية، نقترح نموذج تدريبياً مبتكراً يُعرف بـ D-OPSD، وهو مصمم خصيصاً لنماذج الانتشار التي خضعت لعملية تقصير للخطوات (step-distilled diffusion models)، ويتيح ممارسة التعلم على السياسة الحالية (on-policy learning) أثناء عملية الضبط الدقيق الخاضع للإشراف.نستعرض في دراستنا أولاً أن نماذج الانتشار الحديثة، والتي يعتمد فيها الترميز على استخدام نماذج اللغات الكبيرة (LLM) أو نماذج الرؤية واللغة الكبيرة (VLM) كمشفرات (encoders)، قادرة على توريث القدرات القائمة على السياق (in-context capabilities) من هذه المشفرات. وهذا يمكّننا من تحويل عملية التدريب إلى عملية تنقية ذاتية (self-distillation) تعمل على السياسة الحالية. تحديداً، أثناء التدريب، نجعل النموذج يؤدي دور المعلم والطالب في آنٍ واحد، ولكن في سياقات مختلفة؛ فحيث يقترن دور الطالب فقط بسمات النص، بينما يقترن دور المعلم بسمات متعددة الوسائط تجمع بين مطالبة النص (text prompt) والصورة المستهدفة. وتستهدف عملية التدريب تقليل الفرق بين التوزيعين المتوقعين على مسار التنفيذ (roll-outs) الخاص بالطالب. ومن خلال التحسين على مسار النموذج الخاص به وبإشرافه الذاتي، يمكّن D-OPSD النموذج من تعلم مفاهيم وأنماط جديدة، دون المساس بقدرته الأصلية على العمل بخطوات قليلة.

One-sentence Summary

The authors propose D-OPSD, a training paradigm for step-distilled diffusion models that enables continuous supervised fine-tuning via on-policy self-distillation by minimizing predicted distributions over the student's own roll-outs while conditioning the student on the text feature and the teacher on the multimodal feature of both the text prompt and the target image to preserve inherent few-step inference capability.

Key Contributions

  • The paper introduces D-OPSD, a training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning.
  • The framework utilizes the emergent in-context capabilities of modern encoders to facilitate self-distillation, where the model acts as both teacher and student under different multimodal and text-only conditions.
  • Experiments across LoRA adaptation and full fine-tuning demonstrate that the method effectively learns new concepts and styles while preserving the original few-step generation ability.

Introduction

Text-to-image diffusion models have advanced significantly, yet their iterative sampling processes incur high computational costs that step-distillation techniques aim to mitigate. However, continually fine-tuning these efficient models poses significant challenges because conventional supervised fine-tuning disrupts learned few-step dynamics through a train-test mismatch. Online reinforcement learning offers a solution but demands reward functions that are often impractical for developers. The authors address these issues with D-OPSD, a novel on-policy self-distillation framework that leverages the emergent in-context capabilities of modern LLM-based encoders. By enabling the model to act as both a student and a teacher during training, this method allows for supervised adaptation on the model's own rollouts without external rewards. Consequently, new concepts are learned while preserving the original few-step inference capability.

Method

The authors propose D-OPSD, a training paradigm designed to enable on-policy learning for step-distilled diffusion models. This approach addresses the challenge where standard supervised fine-tuning compromises the inherent few-step inference capability of modern efficient models. The core idea leverages on-policy self-distillation, where the model acts as both a teacher and a student under different contextual conditions.

Refer to the framework diagram for a visual overview of the method.

The process begins with an encoder that processes the input text prompt and the target image. For each training pair, the system constructs two distinct conditioning vectors. The student condition csc_scs is derived solely from the text prompt, ensuring the student branch follows the original text-to-image generation pathway. In contrast, the teacher condition ctc_tct incorporates multimodal features from both the text prompt and the target image. This multimodal context allows the teacher to provide stronger supervision regarding the target concept or style without disrupting the student's sampling trajectory.

During training, the student model generates an on-policy trajectory by sampling from Gaussian noise using a few-step solver. Let xtksx_{t_k}^sxtks denote the latent state at step kkk. The student predicts the velocity field uks=vθ(xtks,tk,cs)u_k^s = v_{\theta}(x_{t_k}^s, t_k, c_s)uks=vθ(xtks,tk,cs). Simultaneously, the teacher model, parameterized by an exponential moving average (EMA) of the student weights, predicts the velocity ukt=vθˉ(xtks,tk,ct)u_k^t = v_{\bar{\theta}}(x_{t_k}^s, t_k, c_t)ukt=vθˉ(xtks,tk,ct) on the exact same states generated by the student. This setup ensures that the supervision is computed on the model's own trajectory rather than an external offline distribution.

The optimization objective minimizes the mean squared error between the student's velocity predictions and the teacher's predictions on these shared states. The loss function is formulated as:

LD-OPSD=E(x0,y)[1Kk=1Kukssg(ukt)22]\mathcal{L}_{\text{D-OPSD}} = \mathbb{E}_{(x_0, y)} \left[ \frac{1}{K} \sum_{k=1}^{K} \left\| u_k^s - \text{sg}(u_k^t) \right\|_2^2 \right]LD-OPSD=E(x0,y)[K1k=1Kukssg(ukt)22]

where sg()\text{sg}(\cdot)sg() denotes the stop-gradient operation. By aligning the student's conditional generation dynamics with the teacher's stronger multimodal guidance, the model learns new concepts or styles while preserving the original few-step sampling behavior. After training, the teacher branch is discarded, and inference proceeds using the standard few-step pipeline conditioned only on text.

Experiment

The evaluation utilizes Z-Image-Turbo and FLUX.2-klein models to compare the proposed method against baselines like Vanilla SFT and Dreambooth across small-scale LoRA and large-scale full finetuning scenarios. Experimental results demonstrate that while standard training approaches often compromise few-step generation quality or suffer from overfitting, the proposed method effectively learns new concepts and adapts to new domains without catastrophic forgetting. Furthermore, ablation studies validate that on-policy self-distillation is essential for maintaining high generation quality and achieving faster convergence compared to off-policy variants.

The authors compare the proposed D-OPSD method against baselines like Vanilla SFT and PSO across Z-Image-Turbo and FLUX.2-klein architectures. Results indicate that D-OPSD achieves superior alignment with target images while preserving the model's original few-step generation quality and general knowledge retention. In contrast, baseline methods suffer from significant degradation in image quality and capability retention. D-OPSD achieves lower error rates in image similarity metrics compared to Vanilla SFT and PSO. The proposed method preserves few-step sampling capacity with quality scores close to or exceeding the base model. Baseline methods exhibit significant drops in general knowledge benchmarks, whereas D-OPSD retains these capabilities.

The authors compare their proposed D-OPSD method against several baselines including Vanilla SFT, Dreambooth, and PSO using Z-Image-Turbo and FLUX.2-klein models. The results demonstrate that D-OPSD effectively balances learning new concepts with maintaining high image quality and aesthetic standards, outperforming methods that degrade generation capabilities. D-OPSD achieves the highest Quality-S and Aesthetic-S scores in both model configurations, indicating it preserves few-step sampling capacity better than SFT and Dreambooth. The proposed method attains the highest CLIP-S scores and ties for the top VLM-J score, demonstrating strong generalization and subject consistency. While PSO achieves low distance metrics suggesting strong target alignment, D-OPSD avoids overfitting by maintaining significantly higher generalization and quality scores.

The the the table contrasts the proposed D-OPSD method with standard SFT, offline RL, and online RL baselines based on their supervision signals and training properties. It demonstrates that D-OPSD uniquely combines on-policy training with self-distilled velocity to ensure the training distribution matches the inference distribution without needing an external reward model. D-OPSD is the only method listed that achieves a match between training and inference conditions. The approach utilizes on-policy sampling with self-distilled velocity rather than ground truth or external rewards. Unlike online RL methods, D-OPSD does not require a separate reward model for supervision.

The authors evaluate D-OPSD against various baselines including Vanilla SFT and PSO across Z-Image-Turbo and FLUX.2-klein architectures to assess alignment and capability retention. Results indicate that the proposed method effectively balances learning new concepts with maintaining high image quality and general knowledge, whereas baseline methods suffer from significant degradation or overfitting. Additionally, the method uniquely aligns training and inference distributions through on-policy sampling without requiring external reward models.


بناء الذكاء الاصطناعي بالذكاء الاصطناعي

من الفكرة إلى الإطلاق — سرّع تطوير الذكاء الاصطناعي الخاص بك مع المساعدة البرمجية المجانية بالذكاء الاصطناعي، وبيئة جاهزة للاستخدام، وأفضل أسعار لوحدات معالجة الرسومات.

البرمجة التعاونية باستخدام الذكاء الاصطناعي
وحدات GPU جاهزة للعمل
أفضل الأسعار

HyperAI Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp