2 months ago

Universal Transformers

Dehghani, Mostafa ; Gouws, Stephan ; Vinyals, Oriol ; Uszkoreit, Jakob ; Kaiser, Łukasz

Abstract

Recurrent neural networks (RNNs) sequentially process data by updating theirstate with each new data point, and have long been the de facto choice forsequence modeling tasks. However, their inherently sequential computation makesthem slow to train. Feed-forward and convolutional architectures have recentlybeen shown to achieve superior results on some sequence modeling tasks such asmachine translation, with the added advantage that they concurrently processall inputs in the sequence, leading to easy parallelization and faster trainingtimes. Despite these successes, however, popular feed-forward sequence modelslike the Transformer fail to generalize in many simple tasks that recurrentmodels handle with ease, e.g. copying strings or even simple logical inferencewhen the string or formula lengths exceed those observed at training time. Wepropose the Universal Transformer (UT), a parallel-in-time self-attentiverecurrent sequence model which can be cast as a generalization of theTransformer model and which addresses these issues. UTs combine theparallelizability and global receptive field of feed-forward sequence modelslike the Transformer with the recurrent inductive bias of RNNs. We also add adynamic per-position halting mechanism and find that it improves accuracy onseveral tasks. In contrast to the standard Transformer, under certainassumptions, UTs can be shown to be Turing-complete. Our experiments show thatUTs outperform standard Transformers on a wide range of algorithmic andlanguage understanding tasks, including the challenging LAMBADA languagemodeling task where UTs achieve a new state of the art, and machine translationwhere UTs achieve a 0.9 BLEU improvement over Transformers on the WMT14 En-Dedataset.