SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

Social relation reasoning aims to identify relation categories such asfriends, spouses, and colleagues from images. While current methods adopt theparadigm of training a dedicated network end-to-end using labeled image data,they are limited in terms of generalizability and interpretability. To addressthese issues, we first present a simple yet well-crafted framework named{ame}, which combines the perception capability of Vision Foundation Models(VFMs) and the reasoning capability of Large Language Models (LLMs) within amodular framework, providing a strong baseline for social relation recognition.Specifically, we instruct VFMs to translate image content into a textual socialstory, and then utilize LLMs for text-based reasoning. {ame} introducessystematic design principles to adapt VFMs and LLMs separately and bridge theirgaps. Without additional model training, it achieves competitive zero-shotresults on two databases while offering interpretable answers, as LLMs cangenerate language-based explanations for the decisions. The manual promptdesign process for LLMs at the reasoning phase is tedious and an automatedprompt optimization method is desired. As we essentially convert a visualclassification task into a generative task of LLMs, automatic promptoptimization encounters a unique long prompt optimization issue. To addressthis issue, we further propose the Greedy Segment Prompt Optimization (GSPO),which performs a greedy search by utilizing gradient information at the segmentlevel. Experimental results show that GSPO significantly improves performance,and our method also generalizes to different image styles. The code isavailable at https://github.com/Mengzibin/SocialGPT.