HyperAIHyperAI
2 months ago

PaLI-X: On Scaling up a Multilingual Vision and Language Model

Chen, Xi ; Djolonga, Josip ; Padlewski, Piotr ; Mustafa, Basil ; Changpinyo, Soravit ; Wu, Jialin ; Ruiz, Carlos Riquelme ; Goodman, Sebastian ; Wang, Xiao ; Tay, Yi ; Shakeri, Siamak ; Dehghani, Mostafa ; Salz, Daniel ; Lucic, Mario ; Tschannen, Michael ; Nagrani, Arsha ; Hu, Hexiang ; Joshi, Mandar ; Pang, Bo ; Montgomery, Ceslee ; Pietrzyk, Paulina ; Ritter, Marvin ; Piergiovanni, AJ ; Minderer, Matthias ; Pavetic, Filip ; Waters, Austin ; Li, Gang ; Alabdulmohsin, Ibrahim ; Beyer, Lucas ; Amelot, Julien ; Lee, Kenton ; Steiner, Andreas Peter ; Li, Yang ; Keysers, Daniel ; Arnab, Anurag ; Xu, Yuanzhong ; Rong, Keran ; Kolesnikov, Alexander ; Seyedhosseini, Mojtaba ; Angelova, Anelia ; Zhai, Xiaohua ; Houlsby, Neil ; Soricut, Radu
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Abstract

We present the training recipe and results of scaling up PaLI-X, amultilingual vision and language model, both in terms of size of the componentsand the breadth of its training task mixture. Our model achieves new levels ofperformance on a wide-range of varied and complex tasks, including multipleimage-based captioning and question-answering tasks, image-based documentunderstanding and few-shot (in-context) learning, as well as object detection,video question answering, and video captioning. PaLI-X advances thestate-of-the-art on most vision-and-language benchmarks considered (25+ ofthem). Finally, we observe emerging capabilities, such as complex counting andmultilingual object detection, tasks that are not explicitly in the trainingmix.

PaLI-X: On Scaling up a Multilingual Vision and Language Model | Latest Papers | HyperAI