AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Controllable captioning is essential for precise multimodal alignment andinstruction following, yet existing models often lack fine-grained control andreliable evaluation protocols. To address this gap, we present the AnyCapProject, an integrated solution spanning model, dataset, and evaluation. Weintroduce AnyCapModel (ACM), a lightweight plug-and-play framework thatenhances the controllability of existing foundation models for omni-modalcaptioning without retraining the base model. ACM reuses the original captionsfrom base models while incorporating user instructions and modality features togenerate improved captions. To remedy the data scarcity in controllablemultimodal captioning, we build AnyCapDataset (ACD), covering three modalities,28 user-instruction types, and 300\,k high-quality data entries. We furtherpropose AnyCapEval, a new benchmark that provides more reliable evaluationmetrics for controllable captioning by decoupling content accuracy andstylistic fidelity. ACM markedly improves caption quality across a diverse setof base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scoresby 45\% and style scores by 12\%, and it also achieves substantial gains onwidely used benchmarks such as MIA-Bench and VidCapBench.