Command Palette
Search for a command to run...
Le titre est vide. Veuillez fournir le titre à traduire.
Le titre est vide. Veuillez fournir le titre à traduire.
SetFit : Apprentissage peu échantillonné efficace sans prompts
Résumé
Please provide the title and abstract you would like me to translate into French.
One-sentence Summary
The authors propose a framework that infers latent class statistics by predicting the mean and covariance of visual feature distributions for each class from text, enriching the latent space to improve the cross-domain robustness and few-shot classification performance of foundation models like CLIP across various datasets.
Key Contributions
- This work introduces a predictive framework that leverages text-derived statistics to estimate the mean and covariance of visual feature distributions for each class in foundation models such as CLIP.
- By treating textual inputs as statistical summaries rather than auxiliary prompts or generative seeds, the method enriches the latent feature space to enhance cross-domain robustness.
- Comprehensive evaluations across multiple datasets demonstrate that incorporating these predicted distribution statistics consistently improves few-shot classification performance relative to prior adaptation techniques.
Introduction
Foundation models like CLIP have significantly advanced few-shot learning, yet they frequently struggle with cross-domain robustness when training data is limited. Prior approaches that incorporate text typically treat it as a simple prompt or an auxiliary data source, overlooking its potential to model the statistical structure of visual features. The authors leverage text-derived statistics to predict the mean and covariance of class-wise visual feature distributions. By mapping these text-informed parameters into the latent space, they enrich the model's representational capacity and deliver more robust few-shot classifiers across multiple benchmarks.
Dataset
- The authors use ImageNet and iNaturalist as the foundational base datasets, extracting visual and text features with a CLIP ResNet50 model pre-trained on the strictly disjoint LAION400M corpus.
- iNaturalist is chosen for its hierarchical layout and fine-grained species classes, which require precise covariance matrix modeling, while evaluation is performed on either a held-out base subset or nine cross-domain benchmarks: Caltech, EuroSAT, Food, Flowers, SUN397, DTD, Pets, Cars, and UCF101.
- The authors apply the dataset for few-shot learning experiments, relying entirely on the pre-trained CLIP backbone for feature extraction rather than implementing custom cropping or metadata construction pipelines.
- Strict distributional separation is maintained across all stages, with LAION400M reserved for pre-training, the base datasets used for model adaptation, and disjoint or cross-domain splits reserved for evaluation to validate robust generalization.
Method
The authors leverage a two-phase framework to predict the mean and covariance of visual feature distributions for a class directly from textual descriptions, aiming to enhance few-shot learning performance through text-guided statistical modeling. The overall pipeline consists of a learning phase and an adaptation phase, as illustrated in the figure below.
During the learning phase, the method trains two separate mapping networks to estimate the mean and covariance of visual features from text. The framework begins with a pre-trained image encoder fv and a text encoder ft, both of which are used to extract features from images and corresponding textual descriptions, respectively. The image encoder is frozen during this phase, while the text encoder may be either frozen or learnable depending on the setup. The textual descriptions are derived either from predefined templates such as "a photo of a {class}" or generated using GPT-3 to produce richer visual descriptions. For each class, the empirical mean μ~i and covariance Σ~i are computed from the visual features of images in the base dataset. These statistics are then used to train two lightweight Multi-Layer Perceptrons (MLPs), denoted gμ and gΣ, which map text features si to the predicted mean and diagonal covariance, respectively. The training objective for each network is an L2 loss over the predicted and empirical statistics, augmented with a regularization term such as weight decay.
In the adaptation phase, the learned mappings are applied to new downstream few-shot tasks. Given a few labeled examples from a target class, the empirical mean μ^i,v is computed from the available images. The predicted mean gμ(si) from the text encoder is then interpolated with this empirical mean using a learned coefficient α: μ^i=(1−α)μ^i,v+αgμ(si). This interpolation balances the reliability of the text-derived estimate with the observed data. For the covariance, due to the high variance and unreliability of estimates from few samples, a shrinkage approach is used. The predicted covariance gΣ(si) is combined with the identity matrix I using a shrinkage coefficient β: Σ^i=(1−β)I+βgΣ(si). This ensures a stable and well-conditioned covariance matrix. The resulting mean and covariance estimates are then used for classification, typically via a Gaussian model, enabling the system to generalize effectively even with limited visual data.
Experiment
The experiments evaluate a framework that maps textual class descriptions to visual feature statistics, validating that CLIP-aligned text representations can reliably predict both the mean and covariance of visual distributions across in-domain and cross-domain settings. When integrated into a Mahalanobis classifier for few-shot tasks, these text-derived statistics consistently enhance performance over visual-only baselines, with the covariance matrix proving especially vital for one-class classification and low-shot regimes. Ultimately, combining both predicted means and covariances delivers robust improvements across diverse datasets, demonstrating that textual priors can effectively compensate for limited visual samples and bridge the gap between zero-shot and few-shot learning.
The authors evaluate the impact of using text-derived mean and covariance predictions for few-shot classification tasks, comparing different methods across multiple datasets. Results show that incorporating both mean and covariance from text consistently improves performance over the baseline, with significant gains observed in one-class settings and particularly on datasets where the baseline performance is low. The benefits of covariance are more pronounced in one-class tasks, while mean predictions provide advantages in low-shot regimes. Using both mean and covariance predictions from text consistently improves classification performance across datasets and settings. Covariance predictions lead to larger improvements in one-class classification tasks, especially in low-shot regimes. Mean predictions from text provide notable gains in very low-shot scenarios, while covariance benefits are more consistent across different shot levels.
The authors conduct experiments to evaluate the ability of text-based models to predict mean and covariance of visual features for few-shot classification tasks. Results show that incorporating text-predicted mean and covariance improves classification performance across different settings, with covariance providing consistent gains especially in one-class scenarios. The improvements are more pronounced in low-shot regimes and on datasets where baseline performance is low. Predicting mean and covariance from text leads to significant improvements in few-shot classification performance. Covariance predictions consistently improve results across both one-class and multi-class settings, particularly in low-shot regimes. The method achieves gains comparable to using a few labeled examples, even enabling effective zero-shot classification.
The authors evaluate the impact of using text-derived mean and covariance predictions for few-shot classification tasks on multiple datasets. Results show that incorporating both mean and covariance from text consistently improves performance over the baseline across different settings, with the covariance contributing more significantly in one-class classification and the mean providing greater gains in low-shot regimes. The improvements are particularly notable on datasets where the baseline performance is low relative to zero-shot performance. Using text-derived mean and covariance predictions consistently improves classification performance over the baseline in both one-class and multi-class settings. The covariance from text provides a more significant improvement than the mean, especially in one-class classification tasks. The largest gains are observed in low-shot regimes and on datasets where the baseline performance is low compared to zero-shot performance.
The authors evaluate the impact of predicting visual feature means and covariances from text for few-shot and zero-shot classification tasks across multiple datasets and training regimes. The experiments validate that incorporating these text-derived statistical predictions consistently enhances classification accuracy over standard baselines, with covariance providing the most substantial gains in one-class settings and mean predictions driving notable improvements in extremely low-shot scenarios. These qualitative benefits are particularly evident on datasets with weak baseline performance, indicating that textual priors effectively stabilize feature distributions under data scarcity. Ultimately, the approach demonstrates that text-based statistical modeling can reliably approximate the advantages of limited labeled examples while enabling robust zero-shot transfer.