Learning Video Representations from Large Language Models

We introduce LaViLa, a new approach to learning video-languagerepresentations by leveraging Large Language Models (LLMs). We repurposepre-trained LLMs to be conditioned on visual input, and finetune them to createautomatic video narrators. Our auto-generated narrations offer a number ofadvantages, including dense coverage of long videos, better temporalsynchronization of the visual information and text, and much higher diversityof text. The video-text embedding learned contrastively with these additionalauto-generated narrations outperforms the previous state-of-the-art on multiplefirst-person and third-person video tasks, both in zero-shot and finetunedsetups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEAclassification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks.Furthermore, LaViLa trained with only half the narrations from the Ego4Ddataset outperforms baseline models trained on the full set, and shows positivescaling behavior on increasing pre-training data and model size.