HyperAIHyperAI

Command Palette

Search for a command to run...

Subformer: A Parameter Reduced Transformer

Yutaka Matsuo Edison Marrese-Taylor Machel Reid

Abstract

The advent of the Transformer can arguably be described as a driving force behind many of the recent advances in natural language processing. However, despite their sizeable performance improvements, as recently shown, the model is severely over-parameterized, being parameter inefficient and computationally expensive to train. Inspired by the success of parameter-sharing in pre-trained deep contextualized word representation encoders, we explore parameter-sharing methods in Transformers, with a specific focus on encoder-decoder models for sequence-to-sequence tasks such as Machine Translation. We perform an analysis of different parameter sharing/reduction methods and develop the Subformer, a parameter efficient Transformer-based model which combines the newly proposed Sandwich-style parameter sharing technique and self-attentive embedding factorization (SAFE). Experiments on machine translation, abstractive summarization, and language modeling show that the Subformer can outperform the Transformer even when using significantly fewer parameters. On the WMT'14 English-German test set, we show we can perform equally well, and even sometimes outperform (+0.1 BLEU score) the Transformer-base model while using 40% less parameters. We also perform equally well as Transformer-big with 40% less parameters and outperform the model by 0.7 BLEU with 12M less parameters. We also outperform the standard Transformer-XL model, achieving a significant 3.6 lower perplexity with 37% fewer parameters.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp