HyperAI
12 days ago

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu
AnyCap Project: A Unified Framework, Dataset, and Benchmark for
  Controllable Omni-modal Captioning
Abstract

Controllable captioning is essential for precise multimodal alignment andinstruction following, yet existing models often lack fine-grained control andreliable evaluation protocols. To address this gap, we present the AnyCapProject, an integrated solution spanning model, dataset, and evaluation. Weintroduce AnyCapModel (ACM), a lightweight plug-and-play framework thatenhances the controllability of existing foundation models for omni-modalcaptioning without retraining the base model. ACM reuses the original captionsfrom base models while incorporating user instructions and modality features togenerate improved captions. To remedy the data scarcity in controllablemultimodal captioning, we build AnyCapDataset (ACD), covering three modalities,28 user-instruction types, and 300\,k high-quality data entries. We furtherpropose AnyCapEval, a new benchmark that provides more reliable evaluationmetrics for controllable captioning by decoupling content accuracy andstylistic fidelity. ACM markedly improves caption quality across a diverse setof base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scoresby 45\% and style scores by 12\%, and it also achieves substantial gains onwidely used benchmarks such as MIA-Bench and VidCapBench.