Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model

The rapid expansion of mobile internet has resulted in a substantial increasein user-generated content (UGC) images, thereby making the thorough assessmentof UGC images both urgent and essential. Recently, multimodal large languagemodels (MLLMs) have shown great potential in image quality assessment (IQA) andimage aesthetic assessment (IAA). Despite this progress, effectively scoringthe quality and aesthetics of UGC images still faces two main challenges: 1) Asingle score is inadequate to capture the hierarchical human perception. 2) Howto use MLLMs to output numerical scores, such as mean opinion scores (MOS),remains an open question. To address these challenges, we introduce a noveldataset, named Realistic image Quality and Aesthetic (RealQA), including 14,715UGC images, each of which is annoted with 10 fine-grained attributes. Theseattributes span three levels: low level (e.g., image clarity), middle level(e.g., subject integrity) and high level (e.g., composition). Besides, weconduct a series of in-depth and comprehensive investigations into how toeffectively predict numerical scores using MLLMs. Surprisingly, by predictingjust two extra significant digits, the next token paradigm can achieve SOTAperformance. Furthermore, with the help of chain of thought (CoT) combined withthe learnt fine-grained attributes, the proposed method can outperform SOTAmethods on five public datasets for IQA and IAA with superior interpretabilityand show strong zero-shot generalization for video quality assessment (VQA).The code and dataset will be released.