HyperAI

Abstract

The development of video large multimodal models (LMMs) has been hindered bythe difficulty of curating large amounts of high-quality raw data from the web.To address this, we propose an alternative approach by creating a high-qualitysynthetic dataset specifically for video instruction-following, namelyLLaVA-Video-178K. This dataset includes key tasks such as detailed captioning,open-ended question-answering (QA), and multiple-choice QA. By training on thisdataset, in combination with existing visual instruction tuning data, weintroduce LLaVA-Video, a new video LMM. Our experiments demonstrate thatLLaVA-Video achieves strong performance across various video benchmarks,highlighting the effectiveness of our dataset. We plan to release the dataset,its generation pipeline, and the model checkpoints.

Abstract

Yuanhan Zhang Jinming Wu Wei Li Bo Li Zejun Ma Ziwei Liu Chunyuan Li

Abstract

Build AI with AI

HyperAI Newsletters

Yuanhan Zhang Jinming Wu Wei Li Bo Li Zejun Ma Ziwei Liu Chunyuan Li

Abstract

Build AI with AI

HyperAI Newsletters

Yuanhan Zhang Jinming Wu Wei Li Bo Li Zejun Ma Ziwei Liu Chunyuan Li

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Video Instruction Tuning With Synthetic Data

Yuanhan Zhang Jinming Wu Wei Li Bo Li Zejun Ma Ziwei Liu Chunyuan Li

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Video Instruction Tuning With Synthetic Data

Yuanhan Zhang Jinming Wu Wei Li Bo Li Zejun Ma Ziwei Liu Chunyuan Li

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Video Instruction Tuning With Synthetic Data

Yuanhan Zhang Jinming Wu Wei Li Bo Li Zejun Ma Ziwei Liu Chunyuan Li

Abstract

Build AI with AI

HyperAI Newsletters