Command Palette
Search for a command to run...
A Dot Product Attention Free Transformer
A Dot Product Attention Free Transformer
Joshua M. Susskind Ruixiang Zhang Hanlin Goh Chen Huang Nitish Srivastava Walter Talbott Shuangfei Zhai
Abstract
We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers citep{transformer} that eliminates the query-key dot product in self attention. The core idea is to construct a decomposable attention map for each dimension of the query, key and value. This compositionality enables an implementation where the attention tensor does not to be computed or stored explicitly. A DAFT layer has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible with both large input and model sizes. We also introduce DAFT-conv, a model variant that takes advantage of locality and spatial weight sharing while maintaining global connectivity. We conduct experiments on ImageNet-1K classification, as well as CIFAR10 and Enwik8, two autoregressive modeling tasks. We show that DAFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.