HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

OmniFusion Technical Report

Elizaveta Goncharova Anton Razzhigaev Matvey Mikhalchuk Maxim Kurkin Irina Abdullaeva Matvey Skripkin Ivan Oseledets Denis Dimitrov Andrey Kuznetsov

OmniFusion Technical Report

Abstract

Last year, multimodal architectures served up a revolution in AI-basedapproaches and solutions, extending the capabilities of large language models(LLM). We propose an OmniFusion model based on a pretrained LLM andadapters for visual modality. We evaluated and compared several architecturedesign principles for better text and visual data coupling: MLP and transformeradapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and theirfusing approach, image encoding method (whole image or tiles encoding) and two7B LLMs (the proprietary one and open-source Mistral). Experiments on 8visual-language benchmarks show the top score for the best OmniFusion setup interms of different VQA tasks in comparison with open-source LLaVA-likesolutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. Wealso propose a variety of situations, where OmniFusion provides highly-detailedanswers in different domains: housekeeping, sightseeing, culture, medicine,handwritten and scanned equations recognition, etc. Mistral-based OmniFusionmodel is an open-source solution with weights, training and inference scriptsavailable at https://github.com/AIRI-Institute/OmniFusion.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
visual-question-answering-on-mm-vetOmniFusion (grid split + ruDocVQA)
GPT-4 score: 39.40

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
OmniFusion Technical Report | Papers | HyperAI