HyperAIHyperAI

Command Palette

Search for a command to run...

What If We Recaption Billions of Web Images with LLaMA-3?

Abstract

Web-crawled image-text pairs are inherently noisy. Prior studies demonstratethat semantically aligning and enriching textual descriptions of these pairscan significantly enhance model training across various vision-language tasks,particularly text-to-image generation. However, large-scale investigations inthis area remain predominantly closed-source. Our paper aims to bridge thiscommunity effort, leveraging the powerful and open-sourced LLaMA-3, aGPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune aLLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion imagesfrom the DataComp-1B dataset. Our empirical results confirm that this enhanceddataset, Recap-DataComp-1B, offers substantial benefits in training advancedvision-language models. For discriminative models like CLIP, we observeenhanced zero-shot performance in cross-modal retrieval tasks. For generativemodels like text-to-image Diffusion Transformers, the generated images exhibita significant improvement in alignment with users' text instructions,especially in following complex queries. Our project page ishttps://www.haqtu.me/Recap-Datacomp-1B/


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp