2 months ago

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Wang, Peng ; Yang, An ; Men, Rui ; Lin, Junyang ; Bai, Shuai ; Li, Zhikang ; Ma, Jianxin ; Zhou, Chang ; Zhou, Jingren ; Yang, Hongxia

View Paper Details

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple
Sequence-to-Sequence Learning Framework

Abstract

In this work, we pursue a unified paradigm for multimodal pretraining tobreak the scaffolds of complex task/modality-specific customization. We proposeOFA, a Task-Agnostic and Modality-Agnostic framework that supports TaskComprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks,including image generation, visual grounding, image captioning, imageclassification, language modeling, etc., in a simple sequence-to-sequencelearning framework. OFA follows the instruction-based learning in bothpretraining and finetuning stages, requiring no extra task-specific layers fordownstream tasks. In comparison with the recent state-of-the-art vision &language models that rely on extremely large cross-modal datasets, OFA ispretrained on only 20M publicly available image-text pairs. Despite itssimplicity and relatively small-scale training data, OFA achieves new SOTAs ina series of cross-modal tasks while attaining highly competitive performanceson uni-modal tasks. Our further analysis indicates that OFA can alsoeffectively transfer to unseen tasks and unseen domains. Our code and modelsare publicly available at https://github.com/OFA-Sys/OFA.