HyperAIHyperAI

Command Palette

Search for a command to run...

3年前

パーソナライズされた生成からのパーソナライズされた表現

Shobhita Sundaram Julia Chae Yonglong Tian Sara Beery Phillip Isola

パーソナライズドレコメンデーション

RTX 5090のコンピュートリソースがわずか20時間分 $1 (価値 $7)
ノートブックへ移動

概要

現代のビジョンモデルは汎用的な下流タスクにおいて卓越した性能を発揮する。しかし、微細な粒度かつデータが希少なパーソナライズされたビジョンタスクにおいて、それらをどのように活用できるかは依然として不明確である。近年の研究では、合成データが汎用的な表現学習に成功裏に応用されており、T2I(テキストから画像への)拡散モデルの進歩により、ほんの数枚の実際の例からパーソナライズされた画像を生成することが可能となっている。本研究では、これらのアイデア間の潜在的な関連性を探求し、パーソナライズされた合成データを用いてパーソナライズされた表現を学習するという課題を形式化する。この表現は関心のある対象に関する知識を符号化し、対象に関連する任意の下流タスクに柔軟に適用可能である。本研究では、この課題に対する評価スイートを導入する。これには、既存の2つのデータセットの再定式化と、この目的のために明示的に構築された新規データセットが含まれる。さらに、画像生成器を創造的に活用するコントラストive学習アプローチを提案する。我々は、認識からセグメンテーションに至る多様な下流タスクにおいて、本手法がパーソナライズされた表現学習を向上させることを示し、この向上に重要な画像生成アプローチの特徴を分析する。

One-sentence Summary

The authors propose a contrastive learning framework that leverages text-to-image diffusion models to generate personalized synthetic data from a few real examples, addressing data-scarce vision tasks through an evaluation suite of reformulated and novel datasets while demonstrating improved personalized representation learning across recognition and segmentation benchmarks.

Key Contributions

  • The study formalizes the challenge of learning personalized representations from data-scarce scenarios and introduces an evaluation suite featuring a novel instance-level dataset, PODS, alongside reformulated splits and annotations for two existing benchmarks.
  • A contrastive learning framework adapts general-purpose representation spaces by conditioning text-to-image diffusion models on few-shot real examples, enabling synthetic data generation that operates without external real negatives.
  • Empirical evaluations across recognition and segmentation tasks demonstrate consistent performance gains, while the analysis identifies critical image generation characteristics and assesses computationally efficient multi-method synthesis alternatives.

Introduction

Modern vision models excel at broad recognition tasks but struggle with fine-grained personalization, which requires learning instance-specific representations from minimal real examples. This capability matters because it enables private, localized model training without relying on centralized data repositories or extensive user annotations. Prior approaches typically depend on large-scale labeled datasets, require external data sharing, or treat image generation and representation learning as disconnected pipelines, making them impractical for few-shot scenarios. The authors leverage text-to-image diffusion models to synthesize diverse training examples from just a handful of real images and apply contrastive fine-tuning to adapt a general-purpose vision backbone. This strategy produces robust personalized representations that consistently boost performance across classification, retrieval, detection, and segmentation tasks. To advance the field, the authors also release a dedicated evaluation suite and a new instance-level dataset designed specifically for benchmarking personalized representation learning.

Dataset

  1. Dataset Composition and Sources

    • The authors evaluate personalized representation learning using three distinct datasets: DeepFashion2 (DF2), DogFaceNet, and a newly introduced benchmark called PODS (Personal Object Discrimination Suite).
    • DF2 provides large-scale commercial and consumer fashion imagery focused on shirts, DogFaceNet supplies dog re-identification footage, and PODS features 100 everyday personal objects captured across five categories (mugs, screwdrivers, shoes, bags, and water bottles) using an iPhone 15 Pro and the PolyCam app.
  2. Key Details for Each Subset

    • DeepFashion2: The authors extract 169 unique shirt instances from the validation split after filtering for categories with sufficient gallery images. The final subset contains 507 training images and 1,271 test images, strictly allocating three training images per instance and four to twenty-four test images per instance.
    • DogFaceNet: From the DogFaceNet_large split, the authors retain only classes with more than ten images per instance. They perform a random train-test split, manually inspect all sequences to eliminate data poisoning from overlapping footage, and discard instances with fewer than four remaining test images. The cleaned subset holds 80 dogs, yielding 240 training images and 1,218 test images.
    • PODS: This custom dataset includes 100 objects (20 per category) recorded across four controlled scenes: a canonical training scene, a distractor-heavy scene, a pose-variation scene, and a combined variation scene. Each object is captured at three vantage points, resulting in 300 training images and 10,888 test images, with 1,200 test images receiving dense annotations.
  3. Data Usage and Training Configuration

    • The authors treat the three-image training split as positive examples for contrastive fine-tuning, pairing them with a large pool of negative samples to learn instance-level representations.
    • They assess model performance across classification, retrieval, detection, and segmentation tasks, deliberately structuring test sets to include both in-distribution and out-of-distribution scenarios for robustness evaluation.
    • The data supports comparative experiments between standard DreamBooth fine-tuning and computationally lighter alternatives like Cut-and-Paste and Masked DreamBooth, allowing the authors to measure how real versus generated data impacts representation quality.
  4. Annotation, Metadata, and Processing Strategies

    • Metadata Construction: Each object receives a unique identifier, and all test images are labeled for classification and retrieval. The authors generate 100 instance-specific prompts per object using an LLM, replacing object names with a <new1> placeholder to standardize training inputs.
    • Mask Generation and Cropping: Ground truth segmentation masks are manually annotated for DF2 and Dogs. For PODS, the authors generate initial mask proposals with Grounding-SAM and refine them manually using TORAS, then extract bounding boxes directly from these masks.
    • DreamBooth Processing: To prevent background overfitting, the authors apply gradient masking during DreamBooth training, zeroing out gradients for background pixels. They also implement an automatic filtering step that uses DreamSim and perSAM to embed masked training and generated images. They compute cosine similarity between embeddings and discard generated samples falling below a threshold of 0.6 for DF2 and PODS, and 0.55 for Dogs.
    • Cut-and-Paste Synthesis: When masks are available, the authors extract foreground objects and composite them onto LLM-generated backgrounds. They remove the <new1> placeholder from background prompts, randomly resize the foreground to 0.3 to 1.3 times its original scale, and paste it at random coordinates.

Method

The authors leverage a three-stage method to achieve personalized visual representation using generative models. The overall framework begins with a small set of real images of a target instance ccc, denoted as DR\mathcal{D}_RDR, and the generic category cprc_{pr}cpr associated with the object. The goal is to adapt a general-purpose vision encoder fϕf_{\phi}fϕ to produce a personalized representation for ccc by training on synthetic data generated from a pretrained generative model.

Refer to the framework diagram, which illustrates the transformation from a pretrained representation space to a personalized one. The process starts with the personal instance data, which is used to train a generator. The generator then produces personalized synthetic data, which is used to train the representation model. The figure shows that real images of the target instance are fed into the system, and the generator learns to produce new images of the instance. The personalized representation space is then trained to distinguish the target instance from other data, using both real and generated images.

In the first stage, the authors generate personalized data from DR\mathcal{D}_RDR using Stable Diffusion 1.5, a text-to-image (T2I) model, as the generator gθg_{\theta}gθ. They adapt gθg_{\theta}gθ using DreamBooth, which fine-tunes the model to generate novel images of ccc when conditioned on an identifier token. The T2I diffusion model gθg_{\theta}gθ generates images given an initial noise latent ϵN(0,1)\epsilon \sim \mathcal{N}(0, 1)ϵN(0,1) and a conditioning text embedding y^=Γω(y)\hat{y} = \Gamma_{\omega}(y)y^=Γω(y), where Γω\Gamma_{\omega}Γω is a text encoder and yyy is a user-provided prompt. DreamBooth fine-tunes gθg_{\theta}gθ using a loss function that includes a reconstruction loss on the real image xxx and a prior preservation loss on a synthesized image xprx_{pr}xpr conditioned on the generic category cprc_{pr}cpr. The loss is defined as:

Ex,y^,ϵ,ϵ,t[wtgθ(αtx+σtϵ,y^)x22]+λwtgθ(αtxpr+σtϵ,c^pr)xpr22],\begin{array} { r l } & { \mathbb { E } _ { x , \hat { y } , \epsilon , \epsilon ^ { \prime } , t } [ w _ { t } | | g _ { \theta } ( \alpha _ { t } x + \sigma _ { t } \epsilon , \hat { y } ) - x | | _ { 2 } ^ { 2 } ] } \\\\ & { \qquad + \lambda w _ { t ^ { \prime } } | | g _ { \theta } ( \alpha _ { t ^ { \prime } } x _ { pr } + \sigma _ { t ^ { \prime } } \epsilon ^ { \prime } , \hat { c } _ { pr } ) - x _ { pr } | | _ { 2 } ^ { 2 } ] , } \end{array}Ex,y^,ϵ,ϵ,t[wt∣∣gθ(αtx+σtϵ,y^)x22]+λwt∣∣gθ(αtxpr+σtϵ,c^pr)xpr22],

where xprx_{pr}xpr is an image synthesized with the pre-trained generator conditioned on cprc_{pr}cpr, ttt is the timestep, and variables αt\alpha_{t}αt, σt\sigma_{t}σt, and wtw_{t}wt relate to the noise schedule and sampling quality. The first term is a reconstruction loss on xxx, and the second term is a prior preservation loss on xprx_{pr}xpr, weighted by λ\lambdaλ. The text encoder Γω\Gamma_{\omega}Γω is also fine-tuned with the same loss.

As shown in the figure below, the personalized training pipeline consists of three stages. Stage 1 involves training the generator gθg_{\theta}gθ on real images of the target instance. Stage 2 involves generating synthetic data using the trained generator. Stage 3 involves fine-tuning the vision encoder fϕf_{\phi}fϕ on the generated synthetic data using a contrastive objective.

In the second stage, the authors generate synthetic data DS\mathcal{D}_SDS using the trained generator gθg_{\theta}gθ. The generated data is used to train the personalized representation model. The authors control the attributes of the generated dataset by using classifier-free guidance (CFG) to inject diversity into the generated outputs, experimenting with CFG values of 4.0, 5.0, and 7.5. They also use LLM-generated captions to provide rich context descriptions for the target object, ensuring that the generated data is diverse and realistic.

As shown in the figure below, the inference pipelines for global and local tasks are illustrated. For global tasks, such as classification and retrieval, the model uses cosine similarity between CLS embeddings. For local tasks, such as detection and segmentation, the model extracts patch features with spatial information. The figure shows that the training images are used to generate feature maps, which are then processed to produce dense predictions and confidence scores.

In the third stage, the authors train a personalized representation on the generated synthetic data using a contrastive objective. Given the real images DR\mathcal{D}_RDR and the synthetic data DS\mathcal{D}_SDS, they obtain positive examples from DS\mathcal{D}_SDS and negative examples D~S\tilde{\mathcal{D}}_SD~S by prompting the pretrained generator with the generic object category: "a photo of cprc_{pr}cpr". They extract features from the vision encoder fϕf_{\phi}fϕ as a concatenation of the CLS token and average-pooled final-layer patch-embeddings. They then fine-tune fϕf_{\phi}fϕ using the infoNCE loss:

LInfoNCE=logexp(sim(x0,x+)/τ)i=1Nexp(sim(x0,xi)/τ).\mathcal { L } _ { \mathrm { I n f o N C E } } = - \log \frac { \exp ( \mathrm { s i m } ( \mathbf { x } _ { 0 } , \mathbf { x } _ { + } ) / \tau ) } { \sum _ { i = 1 } ^ { N } \exp ( \mathrm { s i m } ( \mathbf { x } _ { 0 } , \mathbf { x } _ { i } ) / \tau ) } .LInfoNCE=logi=1Nexp(sim(x0,xi)/τ)exp(sim(x0,x+)/τ).

This loss pushes together the representations of real and synthetic images of ccc, and pushes apart representations of ccc and other instances. The authors fine-tune via Low-Rank Adaptation (LoRA), which is more parameter-efficient than full fine-tuning. They experiment with several state-of-the-art backbones, including DINOv2-ViT B/14, CLIP-ViT B/16, and MAE-ViT B/16. Each dataset is randomly divided into validation and test sets, and the authors sweep over key training parameters on the validation set to determine the optimal configuration. Based on their validation experiments, they LoRA fine-tune with the infoNCE loss for 2 epochs over 4500 anchor-positive pairs, drawn from 450 synthetic positives and 1000 synthetic negatives.

Experiment

The experiments evaluate personalized visual representations trained on minimal real examples augmented with synthetic data against standard pretrained models across classification, retrieval, segmentation, and detection tasks. Results demonstrate that personalization consistently enhances both global semantic understanding and precise object localization, with combined synthetic generation strategies proving most effective by balancing visual fidelity and pose diversity. While different data generation techniques introduce distinct inductive biases, such as DreamBooth’s superior pose generalization versus Cut-and-Paste’s strict object fidelity, the findings confirm that carefully curated synthetic augmentation significantly elevates representation quality. Furthermore, these personalized features seamlessly integrate into downstream pipelines and maintain robust performance as real training data scales, highlighting the enduring practical value of synthetic data for few-shot personalization.

The authors evaluate personalized representations trained with synthetic data against pretrained models across multiple backbones and tasks, finding that personalized models consistently outperform pretrained models in classification and retrieval tasks. The performance gains are particularly notable for DINOv2 and CLIP, with improvements observed across different datasets and tasks, although the gains are less pronounced for MAE. The best results are achieved using synthetic data generated with specific configurations, such as higher CFG values and LLM-generated prompts, which enhance representation quality. Personalized representations consistently outperform pretrained models across classification and retrieval tasks for DINOv2 and CLIP backbones. Synthetic data generated with higher CFG values and LLM prompts leads to better performance in classification and retrieval tasks. MAE-based personalized representations show smaller improvements compared to DINOv2 and CLIP, indicating varying effectiveness across different backbones.

The authors evaluate different loss functions for personalized representations, comparing InfoNCE, InfoNCE with multi-positive, Hinge, and Cross-Entropy against a baseline. Results show that InfoNCE consistently outperforms other loss functions across both classification and retrieval tasks, with the best performance achieved using InfoNCE for classification and retrieval. The personalized model (DINOv2-P) significantly improves retrieval performance compared to the pretrained baseline, while classification performance varies more across loss functions. InfoNCE achieves the highest performance for both classification and retrieval tasks. DINOv2-P outperforms the pretrained DINOv2 baseline in retrieval, with a notable improvement. Classification performance varies more across loss functions, with InfoNCE showing the best results.

The authors compare different methods for generating synthetic data to personalize vision models, evaluating their performance across classification, retrieval, detection, and segmentation tasks. Results show that combining multiple data augmentation strategies leads to the best overall performance, with improvements across most tasks and datasets compared to using real images alone or single augmentation methods. Combining multiple data augmentation methods achieves the highest performance across all tasks and datasets. Using synthetic data with real backgrounds and negatives significantly outperforms using only real images for most tasks. Different augmentation methods show distinct strengths, with some excelling in classification and others in dense prediction tasks.

The authors compare the runtime efficiency of different synthetic data generation methods for training personalized representations. Results show that the Cut/Paste method with real backgrounds is the fastest, requiring minimal generation time, while DreamBooth methods are significantly slower, especially when filtering is applied. The total runtime for DreamBooth with filtering is substantially higher due to the time-intensive generation process. Cut/Paste with real backgrounds is the fastest method, requiring minimal generation time. DreamBooth methods are substantially slower, with filtering increasing generation time significantly. Total runtime for DreamBooth with filtering is much higher due to prolonged generation processes.

The authors compare the performance of personalized representations against pretrained models across classification and retrieval tasks, using varying numbers of synthetic images and anchor-positive pairs. Results show that personalized models generally outperform pretrained models, with improvements observed across both global and dense tasks. The best performance is achieved when combining a large number of anchor-positive pairs with a fixed number of synthetic images. Personalized models outperform pretrained models in both classification and retrieval tasks. Increasing the number of anchor-positive pairs improves performance, with the best results achieved at higher ratios. Performance improvements are consistent across different datasets and tasks, indicating robustness of the personalized approach.

The experiments evaluate personalized vision representations trained with synthetic data against pretrained baselines to validate the impact of backbone architecture, loss functions, generation strategies, training data ratios, and computational efficiency across classification, retrieval, detection, and segmentation tasks. Results consistently demonstrate that personalized models outperform pretrained counterparts, with DINOv2 and CLIP benefiting most from synthetic data generated using high CFG values, LLM prompts, and combined augmentation strategies. The analysis further reveals that InfoNCE loss functions yield the strongest representation quality, while increasing anchor-positive pair ratios enhances performance and robustness across diverse datasets. Although generation methods like Cut/Paste offer superior runtime efficiency compared to computationally intensive approaches like DreamBooth, the overall findings establish that carefully constructed synthetic data significantly improves model personalization and generalization.


AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助
すぐに使える GPU
最適な料金体系

HyperAI Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています