HyperAI

A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang
Release Date: 6/16/2025
A High-Quality Dataset and Reliable Evaluation for Interleaved
  Image-Text Generation
Abstract

Recent advancements in Large Multimodal Models (LMMs) have significantlyimproved multimodal understanding and generation. However, these models stillstruggle to generate tightly interleaved image-text outputs, primarily due tothe limited scale, quality and instructional richness of current trainingdatasets. To address this, we introduce InterSyn, a large-scale multimodaldataset constructed using our Self-Evaluation with Iterative Refinement (SEIR)method. InterSyn features multi-turn, instruction-driven dialogues with tightlyinterleaved imagetext responses, providing rich object diversity and rigorousautomated quality refinement, making it well-suited for trainingnext-generation instruction-following LMMs. Furthermore, to address the lack ofreliable evaluation tools capable of assessing interleaved multimodal outputs,we introduce SynJudge, an automatic evaluation model designed to quantitativelyassess multimodal outputs along four dimensions: text content, image content,image quality, and image-text synergy. Experimental studies show that the SEIR method leads to substantially higherdataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains acrossall evaluation metrics, confirming InterSyn's utility for advancing multimodalsystems.