HyperAIHyperAI
2 months ago

GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding

Wu, Yiqi ; Hu, Xiaodan ; Fu, Ziming ; Zhou, Siling ; Li, Jiangong
GPT-4o: Visual perception performance of multimodal large language
  models in piglet activity understanding
Abstract

Animal ethology is an crucial aspect of animal research, and animal behaviorlabeling is the foundation for studying animal behavior. This process typicallyinvolves labeling video clips with behavioral semantic tags, a task that iscomplex, subjective, and multimodal. With the rapid development of multimodallarge language models(LLMs), new application have emerged for animal behaviorunderstanding tasks in livestock scenarios. This study evaluates the visualperception capabilities of multimodal LLMs in animal activity recognition. Toachieve this, we created piglet test data comprising close-up video clips ofindividual piglets and annotated full-shot video clips. These data were used toassess the performance of four multimodal LLMs-Video-LLaMA, MiniGPT4-Video,Video-Chat2, and GPT-4 omni (GPT-4o)-in piglet activity understanding. Throughcomprehensive evaluation across five dimensions, including counting, actorreferring, semantic correspondence, time perception, and robustness, we foundthat while current multimodal LLMs require improvement in semanticcorrespondence and time perception, they have initially demonstrated visualperception capabilities for animal activity recognition. Notably, GPT-4o showedoutstanding performance, with Video-Chat2 and GPT-4o exhibiting significantlybetter semantic correspondence and time perception in close-up video clipscompared to full-shot clips. The initial evaluation experiments in this studyvalidate the potential of multimodal large language models in livestock scenevideo understanding and provide new directions and references for futureresearch on animal behavior video understanding. Furthermore, by deeplyexploring the influence of visual prompts on multimodal large language models,we expect to enhance the accuracy and efficiency of animal behavior recognitionin livestock scenarios through human visual processing methods.

GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding | Latest Papers | HyperAI