6 months ago

Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel Forrest Huang Dhruti Shah Xianzhi Du Bowen Zhang Yanghao Li

Abstract

We present MM1.5, a new family of multimodal large language models (MLLMs)designed to enhance capabilities in text-rich image understanding, visualreferring and grounding, and multi-image reasoning. Building upon the MM1architecture, MM1.5 adopts a data-centric approach to model training,systematically exploring the impact of diverse data mixtures across the entiremodel training lifecycle. This includes high-quality OCR data and syntheticcaptions for continual pre-training, as well as an optimized visualinstruction-tuning data mixture for supervised fine-tuning. Our models rangefrom 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE)variants, and demonstrate that careful data curation and training strategiescan yield strong performance even at small scales (1B and 3B). Additionally, weintroduce two specialized variants: MM1.5-Video, designed for videounderstanding, and MM1.5-UI, tailored for mobile UI understanding. Throughextensive empirical studies and ablations, we provide detailed insights intothe training processes and decisions that inform our final designs, offeringvaluable guidance for future research in MLLM development.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel Forrest Huang Dhruti Shah Xianzhi Du Bowen Zhang Yanghao Li

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel Forrest Huang Dhruti Shah Xianzhi Du Bowen Zhang Yanghao Li

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel Forrest Huang Dhruti Shah Xianzhi Du Bowen Zhang Yanghao Li13 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel Forrest Huang Dhruti Shah Xianzhi Du Bowen Zhang Yanghao Li13 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel Forrest Huang Dhruti Shah Xianzhi Du Bowen Zhang Yanghao Li13 more

Abstract

Build AI with AI

HyperAI Newsletters

Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel Forrest Huang Dhruti Shah Xianzhi Du Bowen Zhang Yanghao Li

Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel Forrest Huang Dhruti Shah Xianzhi Du Bowen Zhang Yanghao Li

Haotian Zhang Mingfei Gao Zhe Gan Philipp Dufter Nina Wenzel Forrest Huang Dhruti Shah Xianzhi Du Bowen Zhang Yanghao Li