HyperAI

Abstract

Last year, multimodal architectures served up a revolution in AI-basedapproaches and solutions, extending the capabilities of large language models(LLM). We propose an OmniFusion model based on a pretrained LLM andadapters for visual modality. We evaluated and compared several architecturedesign principles for better text and visual data coupling: MLP and transformeradapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and theirfusing approach, image encoding method (whole image or tiles encoding) and two7B LLMs (the proprietary one and open-source Mistral). Experiments on 8visual-language benchmarks show the top score for the best OmniFusion setup interms of different VQA tasks in comparison with open-source LLaVA-likesolutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. Wealso propose a variety of situations, where OmniFusion provides highly-detailedanswers in different domains: housekeeping, sightseeing, culture, medicine,handwritten and scanned equations recognition, etc. Mistral-based OmniFusionmodel is an open-source solution with weights, training and inference scriptsavailable at https://github.com/AIRI-Institute/OmniFusion.

Benchmarks

Benchmark	Methodology	Metrics
visual-question-answering-on-mm-vet	OmniFusion (grid split + ruDocVQA)	GPT-4 score: 39.40

OmniFusion Technical Report

Elizaveta Goncharova Anton Razzhigaev Matvey Mikhalchuk Maxim Kurkin Irina Abdullaeva Matvey Skripkin Ivan Oseledets Denis Dimitrov Andrey Kuznetsov

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Command Palette

OmniFusion Technical Report

Elizaveta Goncharova Anton Razzhigaev Matvey Mikhalchuk Maxim Kurkin Irina Abdullaeva Matvey Skripkin Ivan Oseledets Denis Dimitrov Andrey Kuznetsov

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters