HyperAI

Multimodal Learning

Modality refers to the specific way people receive information. Since multimedia data is often a medium for transmitting multiple types of information (for example, a video often transmits text, visual, and auditory information at the same time), multimodal learning has gradually developed into the main means of multimedia content analysis and understanding.

Multimodal learning mainly includes the following research directions:

  1. Multimodal representation learning: mainly studies how to digitize the semantic information contained in multiple modal data into real-valued vectors.
  2. Inter-modal mapping: mainly studies how to map the information in a specific modality data to another modality.
  3. Alignment: Mainly studies how to identify the correspondence between components and elements between different modes.
  4. Fusion: Mainly studies how to integrate models and features between different modalities.
  5. Collaborative learning: mainly studies how to transfer knowledge learned in information-rich modalities to information-poor modalities, so that the learning of each modality can assist each other. Typical methods include multimodal zero-shot learning and domain adaptation.

References

【1】AI Review Column - Review of Multimodal Learning Research Progress (Zhihu)