HyperAIHyperAI
2 months ago

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Zhang, Pan ; Dong, Xiaoyi ; Wang, Bin ; Cao, Yuhang ; Xu, Chao ; Ouyang, Linke ; Zhao, Zhiyuan ; Duan, Haodong ; Zhang, Songyang ; Ding, Shuangrui ; Zhang, Wenwei ; Yan, Hang ; Zhang, Xinyue ; Li, Wei ; Li, Jingwen ; Chen, Kai ; He, Conghui ; Zhang, Xingcheng ; Qiao, Yu ; Lin, Dahua ; Wang, Jiaqi
InternLM-XComposer: A Vision-Language Large Model for Advanced
  Text-image Comprehension and Composition
Abstract

We propose InternLM-XComposer, a vision-language large model that enablesadvanced image-text comprehension and composition. The innovative nature of ourmodel is highlighted by three appealing properties: 1) Interleaved Text-ImageComposition: InternLM-XComposer can effortlessly generate coherent andcontextual articles that seamlessly integrate images, providing a more engagingand immersive reading experience. Simply provide a writing instruction, and oursystem will generate the corresponding manuscript. It can intelligentlyidentify the areas in the text where images would enhance the content andautomatically insert the most appropriate visual candidates. 2) Comprehensionwith Rich Multilingual Knowledge: The text-image comprehension is empowered bytraining on an extensive multi-modal multilingual database with carefullycrafted strategies, resulting in a deep understanding of visual content. 3)State-of-the-art Performance: Our model consistently achieves state-of-the-artresults across various mainstream benchmarks for vision-language foundationalmodels, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench(Chinese Cultural Benchmark), QBench and Tiny LVLM. Owing to the absence ofestablished metrics for quantitatively assessing text-image composition, wehave devised a robust evaluation procedure that comprises both human andGPT4-Vision (GPT4-V) to ensure reliability. Notably, our InternLM-XComposerachieves competitive text-image composition scores compared to publicsolutions, including GPT4-V and GPT3.5. Collectively, InternLM-XComposerseamlessly blends advanced text-image comprehension and composition,revolutionizing vision-language interaction and offering new insights andopportunities. The InternLM-XComposer model series are publicly available athttps://github.com/InternLM/InternLM-XComposer.