Graph Convolutions Enrich the Self-Attention in Transformers!

Transformers, renowned for their self-attention mechanism, have achievedstate-of-the-art performance across various tasks in natural languageprocessing, computer vision, time-series modeling, etc. However, one of thechallenges with deep Transformer models is the oversmoothing problem, whererepresentations across layers converge to indistinguishable values, leading tosignificant performance degradation. We interpret the original self-attentionas a simple graph filter and redesign it from a graph signal processing (GSP)perspective. We propose a graph-filter-based self-attention (GFSA) to learn ageneral yet effective one, whose complexity, however, is slightly larger thanthat of the original self-attention mechanism. We demonstrate that GFSAimproves the performance of Transformers in various fields, including computervision, natural language processing, graph-level tasks, speech recognition, andcode classification.