Cross Modal Retrieval
Cross-modal retrieval (CMR) is a task that involves retrieving relevant items from different modalities such as images, text, videos, and audio. The core challenge lies in the heterogeneity gap between modalities, meaning that data from different modalities have distinct representation forms, making direct comparison difficult. To address this issue, most CMR methods focus on learning a shared latent embedding space where concepts from different modalities are projected into the same dimension, allowing their similarity to be measured through distance metrics. This task holds significant application value in areas like multimedia information retrieval, recommendation systems, and human-computer interaction.