HyperAI超神经

摘要

本文研究了在无声视频序列中识别口语关键词的任务，也称为视觉关键词识别。为此，我们探讨了基于Transformer的模型，该模型接收两个输入流：视频的视觉编码和关键词的音素编码，并输出关键词的时间位置（如果存在）。我们的贡献如下：(1) 我们提出了一种新颖的架构——Transpotter，该架构在视觉流和音素流之间采用了完全跨模态注意力机制；(2) 通过广泛的评估，我们证明了我们的模型在具有挑战性的LRW、LRS2、LRS3数据集上大幅优于现有的最先进视觉关键词识别和唇读方法；(3) 我们展示了我们的模型在极端条件下（如手语视频中的孤立口型）识别单词的能力。

摘要

K R Prajwal* [email protected] Liliane Momeni* [email protected] Triantafyllos Afouras [email protected] Andrew Zisserman [email protected]

摘要

用 AI 构建 AI

HyperAI Newsletters

K R Prajwal* [email protected] Liliane Momeni* [email protected] Triantafyllos Afouras [email protected] Andrew Zisserman [email protected]

摘要

用 AI 构建 AI

HyperAI Newsletters

K R Prajwal* [email protected] Liliane Momeni* [email protected] Triantafyllos Afouras [email protected] Andrew Zisserman [email protected]

摘要

用 AI 构建 AI

HyperAI Newsletters

Command Palette

基于注意力机制的视觉关键词定位

K R Prajwal* [email protected] Liliane Momeni* [email protected] Triantafyllos Afouras [email protected] Andrew Zisserman [email protected]

摘要

用 AI 构建 AI

HyperAI Newsletters

Command Palette

基于注意力机制的视觉关键词定位

K R Prajwal* [email protected] Liliane Momeni* [email protected] Triantafyllos Afouras [email protected] Andrew Zisserman [email protected]

摘要

用 AI 构建 AI

HyperAI Newsletters

Command Palette

基于注意力机制的视觉关键词定位

K R Prajwal* [email protected] Liliane Momeni* [email protected] Triantafyllos Afouras [email protected] Andrew Zisserman [email protected]

摘要

用 AI 构建 AI

HyperAI Newsletters