8 months ago

Abstract

Multimodal Large Language Models (MLLMs) have recently gained immensepopularity. Powerful commercial models like ChatGPT-4V and Gemini, as well asopen-source ones such as LLaVA, are essentially general-purpose models and areapplied to solve a wide variety of tasks, including those in computer vision.These neural networks possess such strong general knowledge and reasoningabilities that they have proven capable of working even on tasks for which theywere not specifically trained. We compared the capabilities of the mostpowerful MLLMs to date: ShareGPT4V, ChatGPT, LLaVA-Next in a specialized taskof age and gender estimation with our state-of-the-art specialized model,MiVOLO. We also updated MiVOLO and provide details and new metrics in thisarticle. This comparison has yielded some interesting results and insightsabout the strengths and weaknesses of the participating models. Furthermore, weattempted various ways to fine-tune the ShareGPT4V model for this specifictask, aiming to achieve state-of-the-art results in this particular challenge.Although such a model would not be practical in production, as it is incrediblyexpensive compared to a specialized model like MiVOLO, it could be very usefulin some tasks, like data annotation.

Source PDF