13 days ago
Alleviating the Inequality of Attention Heads for Neural Machine Translation
Zewei Sun, Shujian Huang, Xin-Yu Dai, Jiajun Chen

Abstract
Recent studies show that the attention heads in Transformer are not equal. We relate this phenomenon to the imbalance training of multi-head attention and the model dependence on specific heads. To tackle this problem, we propose a simple masking method: HeadMask, in two specific ways. Experiments show that translation improvements are achieved on multiple language pairs. Subsequent empirical analyses also support our assumption and confirm the effectiveness of the method.