HyperAI超神経

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Itay Laish, Dani Lischinski, Idan Szpektor
公開日: 4/28/2025
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image
  Generation
要約

Subject-driven text-to-image (T2I) generation aims to produce images thatalign with a given textual description, while preserving the visual identityfrom a referenced subject image. Despite its broad downstream applicability --ranging from enhanced personalization in image generation to consistentcharacter representation in video rendering -- progress in this field islimited by the lack of reliable automatic evaluation. Existing methods eitherassess only one aspect of the task (i.e., textual alignment or subjectpreservation), misalign with human judgments, or rely on costly API-basedevaluation. To address this, we introduce RefVNLI, a cost-effective metric thatevaluates both textual alignment and subject preservation in a singleprediction. Trained on a large-scale dataset derived from video-reasoningbenchmarks and image perturbations, RefVNLI outperforms or matches existingbaselines across multiple benchmarks and subject categories (e.g.,Animal, Object), achieving up to 6.4-point gains in textualalignment and 8.5-point gains in subject consistency. It also excels withlesser-known concepts, aligning with human preferences at over 87% accuracy.