ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

This paper presents ScaleCap, an inference-time scalable image captioningstrategy that generates comprehensive and detailed image captions. The keychallenges of high-quality image captioning lie in the inherent biases ofLVLMs: multimodal bias resulting in imbalanced descriptive granularity,offering detailed accounts of some elements while merely skimming over others;linguistic bias leading to hallucinated descriptions of non-existent objects.To address these issues, we propose a scalable debiased captioning strategy,which continuously enriches and calibrates the caption with increased inferencebudget. Specifically, we propose two novel components: heuristic questionanswering and contrastive sentence rating. The former generatescontent-specific questions based on the image and answers them to progressivelyinject relevant information into the caption. The latter employs sentence-leveloffline contrastive decoding to effectively identify and eliminatehallucinations caused by linguistic biases. With increased inference cost, moreheuristic questions are raised by ScaleCap to progressively capture additionalvisual details, generating captions that are more accurate, balanced, andinformative. Extensive modality alignment experiments demonstrate theeffectiveness of ScaleCap. Annotating 450K images with ScaleCap and using themfor LVLM pretraining leads to consistent performance gains across 11 widelyused benchmarks. Furthermore, ScaleCap showcases superb richness and fidelityof generated captions with two additional tasks: replacing images with captionsin VQA task, and reconstructing images from captions to assess semanticcoverage. Code is available at https://github.com/Cooperx521/ScaleCap.