HyperAI
Back to Headlines

AI Model Learns to Sync Vision and Sound Without Human Labels, Boosting Multimodal Content Applications

7 days ago

Researchers from MIT and partner institutions have developed an AI model that enhances the ability of machines to learn the connection between sight and sound without human intervention. This advancement holds significant potential for various industries, particularly journalism and film production, where the model can assist in automating the curation of multimodal content, and in the longer term, it could improve robots' understanding of real-world environments. The project builds on previous work from the team, specifically a machine-learning method known as CAV-MAE (Cross-modal Audio-Visual Masked Autoencoder). This model was designed to process audio and visual data simultaneously from unlabeled video clips, encoding them into separate representations called tokens. The initial model learned to map corresponding audio and visual tokens close together within its internal representation space, but it treated audio and visual samples as a single unit, which limited its precision. For example, a 10-second video clip and the sound of a door slamming occurring in just one second were mapped together, leading to a lack of fine-grained alignment. To address this limitation, the team introduced an improved version called CAV-MAE Sync. In this new model, audio is split into smaller windows before the representations are computed, allowing the model to generate more granular audio tokens that correspond to individual video frames. During training, CAV-MAE Sync associates each video frame with the exact audio that occurs in that moment, leading to a finer-grained alignment between audio and visual data. Additionally, the researchers made architectural adjustments to balance two distinct learning objectives within CAV-MAE Sync. These objectives include a contrastive goal, where the model learns to associate similar audio and visual data, and a reconstruction goal, which focuses on recovering specific audio and visual data based on user queries. To achieve this balance, the team introduced two new types of data representations: "global tokens" to aid contrastive learning and "register tokens" to enhance the model's focus on critical details for reconstruction. These enhancements significantly improved the model's performance in video retrieval tasks and in classifying audio-visual scenes. For instance, CAV-MAE Sync can now more accurately and efficiently match the sound of a door slamming with the exact moment it closes in a video clip. It can also better identify and classify actions in audio-visual scenes, such as a dog barking or an instrument playing. The researchers tested CAV-MAE Sync against more complex, state-of-the-art methods that require larger datasets and found that their approach outperformed these methods while using less training data. According to Edson Araujo, a graduate student at Goethe University and lead author of the paper, sometimes simple ideas and patterns observed in the data can yield substantial improvements when integrated into existing models. Looking ahead, the team aims to further refine CAV-MAE Sync by incorporating more advanced models for generating data representations. They also plan to extend the system to handle text data, which could be a crucial step towards developing a comprehensive audiovisual large language model. Such a model would be highly valuable for tasks that require understanding and processing multiple modalities, including interactive AI systems and content generation tools. Industry insiders have praised the development for its potential to transform multimedia content processing. They note that the ability to automatically align audio and visual elements with such precision could streamline workflows in media production, reduce labor costs, and enhance the quality of content. Moreover, integrating audio-visual capabilities into everyday tools like large language models could pave the way for more intuitive and sophisticated AI applications. The researchers involved in this project include Andrew Rouditchenko, an MIT graduate student; Edson Araujo, a graduate student at Goethe University; Yuan Gong, a former MIT postdoc; Saurabh Bhati, a current MIT postdoc; and others from IBM Research and the MIT-IBM Watson AI Lab. The senior author, Hilde Kuehne, is a professor of computer science at Goethe University and an affiliated professor at the MIT-IBM Watson AI Lab. This work is funded by the German Federal Ministry of Education and Research and the MIT-IBM Watson AI Lab.

Related Links