Elizaveta Goncharova Anton Razzhigaev Matvey Mikhalchuk Maxim Kurkin Irina Abdullaeva Matvey Skripkin Ivan Oseledets Denis Dimitrov Andrey Kuznetsov

Abstract
Last year, multimodal architectures served up a revolution in AI-basedapproaches and solutions, extending the capabilities of large language models(LLM). We propose an OmniFusion model based on a pretrained LLM andadapters for visual modality. We evaluated and compared several architecturedesign principles for better text and visual data coupling: MLP and transformeradapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and theirfusing approach, image encoding method (whole image or tiles encoding) and two7B LLMs (the proprietary one and open-source Mistral). Experiments on 8visual-language benchmarks show the top score for the best OmniFusion setup interms of different VQA tasks in comparison with open-source LLaVA-likesolutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. Wealso propose a variety of situations, where OmniFusion provides highly-detailedanswers in different domains: housekeeping, sightseeing, culture, medicine,handwritten and scanned equations recognition, etc. Mistral-based OmniFusionmodel is an open-source solution with weights, training and inference scriptsavailable at https://github.com/AIRI-Institute/OmniFusion.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-question-answering-on-mm-vet | OmniFusion (grid split + ruDocVQA) | GPT-4 score: 39.40 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.