ZeroDiff: Solidified Visual-Semantic Correlation in Zero-Shot Learning

Zero-shot Learning (ZSL) aims to enable classifiers to identify unseenclasses. This is typically achieved by generating visual features for unseenclasses based on learned visual-semantic correlations from seen classes.However, most current generative approaches heavily rely on having a sufficientnumber of samples from seen classes. Our study reveals that a scarcity of seenclass samples results in a marked decrease in performance across manygenerative ZSL techniques. We argue, quantify, and empirically demonstrate thatthis decline is largely attributable to spurious visual-semantic correlations.To address this issue, we introduce ZeroDiff, an innovative generativeframework for ZSL that incorporates diffusion mechanisms and contrastiverepresentations to enhance visual-semantic correlations. ZeroDiff comprisesthree key components: (1) Diffusion augmentation, which naturally transformslimited data into an expanded set of noised data to mitigate generative modeloverfitting; (2) Supervised-contrastive (SC)-based representations thatdynamically characterize each limited sample to support visual featuregeneration; and (3) Multiple feature discriminators employing aWasserstein-distance-based mutual learning approach, evaluating generatedfeatures from various perspectives, including pre-defined semantics, SC-basedrepresentations, and the diffusion process. Extensive experiments on threepopular ZSL benchmarks demonstrate that ZeroDiff not only achieves significantimprovements over existing ZSL methods but also maintains robust performanceeven with scarce training data. Our codes are available athttps://github.com/FouriYe/ZeroDiff_ICLR25.