Command Palette
Search for a command to run...
Tstars-Tryon 1.0:多様なファッションアイテムに対応した、堅牢かつリアルなバーチャルTry-On
Tstars-Tryon 1.0:多様なファッションアイテムに対応した、堅牢かつリアルなバーチャルTry-On
概要
画像生成および編集における近年の進歩は、バーチャル試着(virtual try-on)に新たな可能性をもたらしました。しかし、既存の手法では、複雑な実世界の需要を満たすには依然として課題が残っています。本稿では、堅牢性、リアリズム、汎用性、そして高い効率性を兼ね備えた商用規模のバーチャル試着システム「Tstars-Tryon 1.0」を発表します。第一に、本システムは、極端なポーズ、激しい照度変化、モーションブラー(動きのブレ)、およびその他の自然環境下(in-the-wild)における困難な条件下においても、高い成功率を維持します。第二に、衣服の質感、素材の特性、および構造的特徴を忠実に保持し、AI生成に特有のアーティファクト(不自然なノイズや歪み)を大幅に回避することで、微細なディテールまで再現した極めてフォトリアルな結果を提供します。第三に、衣服の試着に留まらず、本モデルは8つのファッションカテゴリーにおいて、柔軟なマルチイメージ合成(最大6枚のリファレンス画像を使用可能)をサポートしており、人物のアイデンティティと背景を協調的に制御することが可能です。第四に、商用展開におけるレイテンシ(遅延)のボトルネックを解消するため、本システムは推論速度に対して高度な最適化が行われており、シームレスなユーザー体験を実現するニアリアルタイムの生成を可能にしています。これらの機能は、エンドツーエンドのモデルアーキテクチャ、スケーラブルなデータエンジン、堅牢なインフラストラクチャ、およびマルチステージの学習パラダイムにわたる統合的なシステム設計によって実現されています。広範な評価と大規模な製品展開により、Tstars-Tryon 1.0が極めて優れた総合性能を達成していることが実証されました。また、今後の研究を支援するため、包括的なベンチマークも公開いたします。本モデルは、Taobao Appにおいて産業規模での展開が進んでおり、数百万人のユーザーに対して数千万件のリクエストを提供しています。
One-sentence Summary
The authors propose Tstars-Tryon 1.0, a commercial-scale virtual try-on system that utilizes a multi-stage training paradigm and a scalable data engine to deliver robust, photorealistic, and near real-time results across eight fashion categories, including tops, pants, skirts, dresses, coats, shoes, bags, and hats.
Key Contributions
- The paper introduces Tstars-Tryon 1.0, a commercial-scale virtual try-on system that utilizes an integrated design of end-to-end architecture, a scalable data engine, and a multi-stage training paradigm to achieve high success rates in challenging in-the-wild conditions such as extreme poses and motion blur.
- This work presents a general-purpose framework capable of flexible multi-image composition using up to six reference images across eight fashion categories, including tops, shoes, and bags, while maintaining photorealistic garment textures and coordinated control over person identity and background.
- The research provides a highly optimized system for near real-time inference and introduces a comprehensive benchmark to support future development, with successful large-scale deployment on the Taobao App serving millions of users.
Introduction
Virtual try-on technology is essential for modern e-commerce, yet existing academic models often fail to meet the demands of real-world commercial deployment. Current benchmarks are limited by simplistic studio backgrounds, a narrow focus on basic clothing categories, and a reliance on pristine flat-lay garment images that do not reflect the complex, unconstrained photos provided by actual users. The authors introduce Tstars-Tryon 1.0, a commercial-scale system designed to handle extreme poses, diverse lighting, and multi-item compositions across eight fashion categories including accessories. By integrating a scalable data engine with a multi-stage training paradigm, the authors achieve a robust balance between high-fidelity photorealism and the near real-time inference speeds required for large-scale industrial applications.
Dataset
The authors introduce the Tstars-VTON Benchmark, a large scale dataset designed to evaluate virtual try on models under commercial grade standards.
- Dataset Composition and Sources: The authors collect data from the internet and proprietary e-commerce domains. The benchmark consists of 1780 refined paired samples that cover 5 garment categories (tops, dresses, coats, pants, and skirts) and 3 accessory categories (shoes, hats, and bags). These are further divided into 465 fine grained subcategories.
- Key Details and Diversity: The dataset is designed to support complex multi item scenarios, where samples can include between 1 and 6 layered items. It features diverse model characteristics, including a gender distribution of 74.9% female and 25.1% male, and various age groups from children to seniors. To increase difficulty, the authors include complex poses (29.6%) and intricate in the wild backgrounds (over 40%).
- Processing and Metadata Construction: The authors employ a three stage pipeline for construction:
- Collection: A hybrid retrieval strategy combines automated platform extraction with manual collection guided by a multi dimensional tag system.
- Refinement and Annotation: Data undergoes a two stage tag retrieval and refinement process. Metadata is initially derived from SKU metadata, refined by a VLM based pipeline, and finalized through manual verification to ensure accuracy across 11 model tag dimensions and 13 garment tag dimensions.
- Anonymization: To ensure privacy, all model portraits undergo a face swapping process where faces are matched to licensed surrogates based on skin tone, gender, and age.
- Pairing and Usage: The authors use a structured layering logic for the try on pairing strategy. This ensures that outfit combinations follow realistic physical and semantic rules, such as gender matching and proper clothing layers. The benchmark supports both single garment and multi garment evaluation, including a fully unpaired setting that decouples the model and garment databases to maximize combinatorial diversity.
Method
The authors leverage a two-stage framework for their try-on model, consisting of a training stage and an inference stage. The training stage begins with pre-training on general editing tasks, followed by progressive resolution continuous training to refine the model's ability to handle high-resolution outputs. This is then followed by high-quality vertical domain supervised fine-tuning, where the model is optimized using carefully curated data specific to the clothing domain. The final phase of training employs reinforcement learning with multi-reward signals to further enhance the model's performance, culminating in few-step and CFG distillation to improve inference efficiency and output quality. 
The inference stage is initiated by a user prompt, which is processed through a prompt rewriter to generate an optimized prompt. This optimized prompt is then encoded by a text encoder, which feeds into a unified multi-image editing DiT (Diffusion Transformer) model to produce the final output image. 
The model's training process is underpinned by a three-stage data pipeline. Stage 1 focuses on data collection, where raw data is gathered from internet sources and e-commerce platforms. This data undergoes expert-tagged retrieval and a combination of automated and manual filtering to produce a large set of garment and accessory data, as well as model data. Stage 2 involves data filtering, refinement, and anonymization. This stage includes quality filtering using hierarchical policies, such as a model-domain policy that filters based on pose and human presence, and a cloth-domain policy that checks for incomplete or multi-subject images. A VLM judge and expert check are used to validate the data. Tagging refinement ensures accurate attribute labeling, and privacy protection mechanisms, including face library matching and quality checks, are applied to safeguard user data. Stage 3 is the try-on pairing strategy, which pairs model and cloth data based on a pairing policy to generate diverse and intricate try-on benchmarks, including multi-cloth and multi-layered combinations. 
Experiment
The evaluation utilizes the Tstars-VTON Benchmark, covering single-garment and complex multi-garment scenarios, alongside academic benchmarks and human preference studies to validate commercial readiness. Results demonstrate that the model excels in maintaining identity consistency, background preservation, and intricate garment textures even under extreme poses or lighting. Notably, the system shows superior stability in multi-item coordination and cross-domain applications, such as dressing 3D avatars or anime characters, while maintaining significantly higher inference speeds than existing proprietary and open-source models.
The authors evaluate Tstars-Tryon 1.0 against various open-source and closed-source models on a comprehensive benchmark, demonstrating its superior performance across multiple dimensions including overall quality, identity consistency, garment fidelity, background preservation, and physical and structural logic. The results show that Tstars-Tryon 1.0 achieves the highest scores in all evaluated categories, outperforming both specialized academic models and leading proprietary systems, particularly in complex multi-garment scenarios. Tstars-Tryon 1.0 achieves the highest scores in all evaluated metrics, including overall quality, identity consistency, and garment fidelity. The model outperforms both open-source and closed-source competitors, especially in complex multi-garment try-on tasks. Tstars-Tryon 1.0 demonstrates exceptional performance in maintaining physical and structural logic, preserving identity, and handling background details.
The authors present a comprehensive evaluation of Tstars-Tryon 1.0, a foundation model for virtual try-on, comparing its performance against state-of-the-art academic and commercial models. The results show that Tstars-Tryon 1.0 achieves superior or competitive performance across multiple metrics, particularly in garment fidelity and identity consistency, while maintaining high performance in complex multi-garment scenarios. The model demonstrates strong generalization capabilities, handling diverse inputs and complex instructions with high fidelity and robustness. Tstars-Tryon 1.0 outperforms both academic and proprietary models in key metrics, especially in garment fidelity and identity consistency. The model maintains high performance in complex multi-garment scenarios, demonstrating robustness and the ability to handle intricate layering and coordination. Tstars-Tryon 1.0 shows strong generalization capabilities, effectively managing diverse inputs and complex instructions while preserving identity and background details.
The authors compare Tstars-Tryon 1.0 against two leading proprietary models, Nano Banana Pro and Seedream5 lite, using human evaluation metrics across varying numbers of garments. The results show that Tstars-Tryon 1.0 consistently outperforms the competitors, particularly as the complexity of the try-on task increases. The performance gap widens significantly in multi-garment scenarios, where the competitors exhibit substantial declines in quality, while Tstars-Tryon 1.0 maintains high stability and fidelity. Tstars-Tryon 1.0 outperforms proprietary models in human evaluation, with a significant advantage in multi-garment scenarios. The performance gap between Tstars-Tryon 1.0 and competitors widens as the number of garments increases, indicating superior robustness under complex conditions. Tstars-Tryon 1.0 maintains high quality and consistency across all tested scenarios, while competitors show a marked decline in performance with increased complexity.
The authors present a comprehensive evaluation of Tstars-Tryon 1.0, demonstrating its superior performance in virtual try-on tasks compared to existing academic and commercial models. The model achieves state-of-the-art results on both single-garment and multi-garment scenarios, particularly excelling in complex multi-item generation and maintaining high fidelity across diverse conditions. Results show that Tstars-Tryon 1.0 outperforms other methods in key metrics, indicating its effectiveness in handling challenging real-world applications. Tstars-Tryon 1.0 achieves the best performance in both single-garment and multi-garment try-on tasks, outperforming specialized academic models and leading commercial systems. The model demonstrates exceptional robustness and high-fidelity rendering, maintaining identity, pose, and background consistency even in complex multi-garment scenarios. Tstars-Tryon 1.0 shows strong generalization capabilities, with superior performance on academic benchmarks despite not being trained on them, indicating its ability to handle unseen data distributions effectively.
Tstars-Tryon 1.0 was evaluated against various open-source academic models and leading proprietary systems through comprehensive benchmarks and human assessments. The experiments validate the model's ability to maintain identity consistency, garment fidelity, and structural logic across both single and multi-garment scenarios. The findings demonstrate that Tstars-Tryon 1.0 provides superior robustness and generalization, particularly as task complexity increases, whereas competing models show significant performance declines in multi-item coordination.