Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

The explosion of visual content available online underscores the requirementfor an accurate machine assessor to robustly evaluate scores across diversetypes of visual contents. While recent studies have demonstrated theexceptional potentials of large multi-modality models (LMMs) on a wide range ofrelated fields, in this work, we explore how to teach them for visual ratingaligned with human opinions. Observing that human raters only learn and judgediscrete text-defined levels in subjective studies, we propose to emulate thissubjective process and teach LMMs with text-defined rating levels instead ofscores. The proposed Q-Align achieves state-of-the-art performance on imagequality assessment (IQA), image aesthetic assessment (IAA), as well as videoquality assessment (VQA) tasks under the original LMM structure. With thesyllabus, we further unify the three tasks into one model, termed the OneAlign.In our experiments, we demonstrate the advantage of the discrete-level-basedsyllabus over direct-score-based variants for LMMs. Our code and thepre-trained weights are released at https://github.com/Q-Future/Q-Align.