HyperAIHyperAI

Command Palette

Search for a command to run...

FashionChameleon: 실시간 및 대화형 인간-의류 비디오 맞춤화를 향해

Quanjian Song Yefeng Shen Mengting Chen Hao Sun Jinsong Lan Xiaoyong Zhu Bo Zheng Liujuan Cao

초록

인간 중심의 비디오 커스터마이징, 특히 의류 수준에서의 커스터마이징은 상당한 상업적 가치를 지니고 있습니다. 그러나 기존 접근 방식은 전자상거래 및 콘텐츠 제작과 같은 응용 분야에서 중요한 저지연성 및 상호작용형 의류 제어 기능을 지원하지 못합니다. 본 논문은 단일 의류 비디오 데이터만을 사용하여 운동 일관성을 유지하면서 상호작용형 다중 의류 비디오 커스터마이징을 달성하는 방법을 연구합니다. 우리는 오토레그레시브 비디오 생성에서 인간-의류 커스터마이징을 위한 실시간 및 상호작용형 프레임워크인 FashionChameleon을 제시하며, 이를 통해 사용자는 생성 과정에서 의류를 상호작용형으로 전환할 수 있습니다. FashionChameleon은 세 가지 핵심 기술로 구성됩니다: (i) 다중 의류 비디오 데이터로 학습하는 대신, 인-컨텍스트 학습(In-Context Learning)을 활용하여 단일 참조-의류 쌍으로 Teacher Model을 학습합니다. 참조 이미지와 의류 이미지 간 불일치를 강제하면서 이미지-비디오 학습 패러다임을 유지함으로써, 모델은 단일 의류 전환 시 일관성을 암묵적으로 보존하도록 유도됩니다. (ii) 생성 중 일관성과 효율성을 달성하기 위해, 인-컨텍스트 티처 포싱(In-context teacher forcing)으로 모델을 파인튜닝하고 그래디언트 재가중 분포 매칭(distillation)을 통해 외삽 일관성을 개선하는 인-컨텍스트 학습 기반 스트리밍 디스틸레이션(Streaming Distillation with In-Context Learning)을 도입합니다. (iii) 모델을 상호작용형 다중 의류 비디오 커스터마이징으로 확장하기 위해, Training-Free KV Cache Rescheduling을 제안합니다. 이는 의류 KV 갱신, 역사적 KV 회수, 참조 KV 분리(disentangle)를 포함하여 운동 일관성을 유지하면서 의류 전환을 가능하게 합니다. 우리의 FashionChameleon은 상호작용형 커스터마이징과 일관된 장기 비디오 외삽을 고유하게 지원하며, 단일 GPU에서 초당 23.8 프레임(FPS)의 실시간 생성 속도를 달성하여 기존 베이스라인보다 30~180배 빠릅니다.

One-sentence Summary

FashionChameleon is a real-time, interactive framework for autoregressive human-garment video customization that enables dynamic garment switching during generation by leveraging in-context learning on single-reference data, enforcing a mismatched reference-image training paradigm to preserve motion coherence, and applying streaming distillation with in-context teacher forcing to ensure low-latency performance for e-commerce and content creation.

Key Contributions

  • Introduces FashionChameleon, a real-time interactive framework for human-garment customization in autoregressive video generation that enables dynamic garment switching during synthesis while preserving motion coherence.
  • Achieves multi-garment control using only single-garment reference pairs by training a teacher model with in-context learning and enforcing a deliberate mismatch between reference and target garment images to implicitly maintain temporal consistency.
  • Integrates streaming distillation with in-context teacher forcing and gradient-reweighted distribution matching distillation to reduce inference latency and improve extrapolation consistency across generated video sequences.

Introduction

The authors build upon recent advances in diffusion-based video generation, where subject-to-video customization enables users to inject reference concepts into generated content. Garment-level control is particularly valuable for e-commerce and filmmaking, yet existing methods suffer from high inference latency, limited interactivity, and difficulty maintaining motion consistency while dynamically switching clothing. To address these gaps, the authors introduce FashionChameleon, a real-time interactive framework that adapts hybrid autoregressive generation for streaming human-garment customization. The authors leverage a teacher network with in-context learning to generalize from single-garment data, employ streaming distillation to balance efficiency and long-video consistency, and utilize a training-free KV cache rescheduling mechanism to enable seamless, dynamic garment transitions during generation.

Dataset

Dataset Composition and Sources: The authors curate a primary training dataset and an evaluation benchmark called HGC-Bench. Both are built around triplets consisting of a reference image, a garment image, and a video sequence paired with structured prompts. Raw videos are collected from the internet, garment images are drawn from a dedicated database, and reference images are algorithmically constructed to improve training robustness.

Subset Details and Filtering Rules: The training pipeline initially yields approximately 82K triplets, which are reduced to 62K after manual verification. Videos undergo a four-stage coarse-to-fine filtering process: scene segmentation into 3 to 5 second clips, single-person retention using YOLOv8-Seg, motion filtering via optical flow thresholds, and quality assessment with Q-Align and FAST-VQA-M. The HGC-Bench subset contains 240 high-aesthetic samples where faces are anonymized through swapping, paired with database garments, and accompanied by prompts generated under strict movement and formatting guidelines.

Model Usage and Training Configuration: For both pre-training and streaming distillation post-training, the authors sample 81 frame sequences and resize videos and reference images to 1280 by 704 pixels while preserving aspect ratios. Garment images are center-padded to match this resolution. During pre-training, the authors apply a 70 to 30 mixture ratio of dynamic-only captions to full static-dynamic captions to reduce textual reliance. Post-training switches to full captions for improved performance. The pipeline leverages Fully Sharded Data Parallelism with a global batch size of 64, using AdamW optimization and precision settings that keep the VAE in float32 during pre-training and switch to bfloat16 for both components during post-training.

Metadata Construction and Processing Details: The authors implement a static-dynamic decoupling strategy using Gemini-3.1 to generate bilingual Chinese and English captions formatted in JSON. Garment extraction relies on Qwen-Image-Edit followed by a three-stage VLM validation check for semantic, textural, and contextual consistency. Reference images are dynamically constructed by classifying garment types, retrieving compatible items from the database, and applying image try-on models, with VLM verification ensuring non-edited regions remain unchanged. During interactive inference, garment-related terms are explicitly excluded from prompts to prevent conflicts with the visual input.

Method

The authors leverage a three-component framework to achieve real-time and interactive garment customization in autoregressive video generation. The overall architecture, as illustrated in the framework diagram, consists of a Teacher Model trained with in-context learning, a streaming distillation process for efficient inference, and a training-free KV cache rescheduling mechanism for dynamic garment switching while preserving motion coherence.

The foundation of the system is the Teacher Model, which is trained using in-context learning on a single reference-garment pair. This model operates within a unified backbone network that processes discrete reference and garment images without requiring auxiliary encoders. The training retains the image-to-video (I2V) paradigm, ensuring the first generated frame remains consistent with the reference frame, except for the garment information. To achieve this, the reference image and garment image are separately encoded into latent representations using a shared VAE encoder. These latent representations, along with the noisy video latent, are concatenated and passed through a multi-modal attention mechanism within the transformer. This shared attention mechanism enables global interaction between the conditional and video latents without introducing additional parameters, allowing the model to implicitly learn single-garment switching while maintaining coherence.

To enable real-time generation, the pretrained teacher model is distilled into a few-step autoregressive student model. This distillation process, known as Streaming Distillation with In-Context Learning, employs an in-context teacher forcing mask to stabilize training. This mask allows the model to condition on ground-truth historical frames and conditional signals during generation, which is essential for the in-context learning setup. Following teacher forcing, gradient-reweighted distribution matching distillation is applied to improve extrapolation consistency. This technique uses an aesthetic reward model to estimate frame quality during distillation, normalizing the scores into frame-wise gradient weights. This adaptive reweighting increases the influence of low-quality frames and decreases that of high-quality ones, mitigating error accumulation and drift in later frames during self-rolling generation.

For interactive multi-garment video customization, the system employs Training-Free KV Cache Rescheduling. This mechanism manages the key-value (KV) cache to enable stable long-video extrapolation. It consists of three key operations: Garment KV Refresh, which updates the garment's KV entry in the cache to switch the outfit; Historical KV Withdraw, which removes historical KV entries to reduce the model's reliance on old garment context and allow the new garment to take effect; and Reference KV Disentangle, which replaces the old reference KV with a new one derived from the last historical frame to maintain temporal coherence across the switching point. The framework diagram illustrates how these operations work together to achieve seamless garment switching while preserving coherent human motion.

Experiment

The evaluation establishes a comprehensive benchmark comparing the proposed autoregressive streaming distillation framework against leading multi-reference video generation baselines, validating its ability to maintain character identity, garment fidelity, and motion coherence through both automated assessments and human preference studies. Qualitative analyses demonstrate that the method consistently preserves fine-grained clothing details and natural movement across complex poses and extended sequences, effectively mitigating the appearance degradation and temporal incoherence observed in competing approaches. Ablation studies further validate that specific training strategies and cache rescheduling mechanisms are essential for preventing motion collapse during long-video extrapolation and enabling real-time interactive garment switching. Overall, the experiments confirm that the framework delivers superior visual quality and temporal consistency while unlocking interactive customization capabilities that existing bidirectional models cannot achieve.

The the the table presents a quantitative ablation study comparing different variants of Gradient-Reweighted Distribution Matching Distillation (GR-DMD) with varying temperature coefficients. Results show that the variant with a temperature of 0.2 achieves the best performance across multiple metrics, including temporal smoothness, visual quality, and garment consistency. Other variants exhibit lower performance, with some showing significant drops in key areas like motion magnitude and temporal smoothness. The variant with a temperature coefficient of 0.2 achieves the best performance across multiple metrics. Higher temperature values lead to a noticeable decline in motion magnitude and temporal smoothness. The variant with a temperature of 0.2 outperforms others in visual quality and garment consistency metrics.

The authors present a comparison of various methods for human-garment video customization, focusing on metrics such as identity consistency, text alignment, motion magnitude, temporal smoothness, visual quality, and garment consistency. FashionChameleon achieves the highest scores in several key areas, including temporal smoothness, visual quality, and garment consistency, while also demonstrating superior inference efficiency compared to other methods. FashionChameleon outperforms all baselines in temporal smoothness, visual quality, and garment consistency metrics. FashionChameleon achieves the highest inference efficiency, enabling real-time generation at a significantly higher frames per second than other methods. The method ranks second in identity consistency and motion magnitude, closely following the top-performing baselines.

The authors present a comparison of different methods on a performance-efficiency trade-off, where the proposed method achieves the highest average performance and significantly higher inference speed compared to other approaches. Results show that the proposed method outperforms baselines in both efficiency and effectiveness, placing it in a superior region of the performance-speed space. The proposed method achieves the highest average performance among all compared methods. The proposed method demonstrates significantly higher inference speed than other methods. The proposed method outperforms baselines in both performance and efficiency, placing it in a superior region of the performance-speed trade-off space.

The authors compare different teacher model training strategies in an ablation study, focusing on their impact on video generation quality. The results show that the proposed method, which uses in-context learning with full fine-tuning, achieves superior performance across multiple metrics, particularly in temporal smoothness, visual quality, and garment consistency. The best-performing variant outperforms alternatives that use different fine-tuning methods or concatenation-based approaches. The proposed method with in-context learning and full fine-tuning outperforms alternatives in key metrics including temporal smoothness and visual quality. The best variant achieves the highest scores in garment consistency and non-target garment preservation. The method using full fine-tuning consistently outperforms those using attention or LoRA fine-tuning across most evaluation metrics.

The evaluation framework comprises ablation studies and comparative analyses that validate the optimal temperature settings for distribution matching distillation and the effectiveness of different teacher model training strategies. These experiments demonstrate that a moderate temperature coefficient combined with full fine-tuning and in-context learning consistently enhances temporal coherence, visual fidelity, and garment preservation. Comparative assessments against existing baselines further validate the method's superiority in video customization quality while confirming its significantly faster inference speeds. Ultimately, the approach establishes a highly efficient solution that successfully balances computational performance with high-fidelity human-garment video synthesis.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp