HyperAI

Audio-Aware Large Language Models as Judges for Speaking Styles

Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang
Release Date: 6/9/2025
Audio-Aware Large Language Models as Judges for Speaking Styles
Abstract

Audio-aware large language models (ALLMs) can understand the textual andnon-textual information in the audio input. In this paper, we explore usingALLMs as an automatic judge to assess the speaking styles of speeches. We useALLM judges to evaluate the speeches generated by SLMs on two tasks: voicestyle instruction following and role-playing. The speaking style we considerincludes emotion, volume, speaking pace, word emphasis, pitch control, andnon-verbal elements. We use four spoken language models (SLMs) to complete thetwo tasks and use humans and ALLMs to judge the SLMs' responses. We compare twoALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results andshow that the agreement between Gemini and human judges is comparable to theagreement between human evaluators. These promising results show that ALLMs canbe used as a judge to evaluate SLMs. Our results also reveal that current SLMs,even GPT-4o-audio, still have room for improvement in controlling the speakingstyle and generating natural dialogues.