11 days ago

What If We Recaption Billions of Web Images with LLaMA-3?

Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie

View Paper Details

What If We Recaption Billions of Web Images with LLaMA-3?

Abstract

Web-crawled image-text pairs are inherently noisy. Prior studies demonstratethat semantically aligning and enriching textual descriptions of these pairscan significantly enhance model training across various vision-language tasks,particularly text-to-image generation. However, large-scale investigations inthis area remain predominantly closed-source. Our paper aims to bridge thiscommunity effort, leveraging the powerful and open-sourced LLaMA-3, aGPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune aLLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion imagesfrom the DataComp-1B dataset. Our empirical results confirm that this enhanceddataset, Recap-DataComp-1B, offers substantial benefits in training advancedvision-language models. For discriminative models like CLIP, we observeenhanced zero-shot performance in cross-modal retrieval tasks. For generativemodels like text-to-image Diffusion Transformers, the generated images exhibita significant improvement in alignment with users' text instructions,especially in following complex queries. Our project page ishttps://www.haqtu.me/Recap-Datacomp-1B/