2 months ago

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Lu, Jiasen ; Clark, Christopher ; Zellers, Rowan ; Mottaghi, Roozbeh ; Kembhavi, Aniruddha

Abstract

We propose Unified-IO, a model that performs a large variety of AI tasksspanning classical computer vision tasks, including pose estimation, objectdetection, depth estimation and image generation, vision-and-language taskssuch as region captioning and referring expression, to natural languageprocessing tasks such as question answering and paraphrasing. Developing asingle unified model for such a large variety of tasks poses unique challengesdue to the heterogeneous inputs and outputs pertaining to each task, includingRGB images, per-pixel maps, binary masks, bounding boxes, and language. Weachieve this unification by homogenizing every supported input and output intoa sequence of discrete vocabulary tokens. This common representation across alltasks allows us to train a single transformer-based architecture, jointly onover 90 diverse datasets in the vision and language fields. Unified-IO is thefirst model capable of performing all 7 tasks on the GRIT benchmark andproduces strong results across 16 diverse benchmarks like NYUv2-Depth,ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with notask-specific fine-tuning. Code and demos for Unified-IO are available at:https://unified-io.allenai.org.