Language Modelling On Wikitext 103
Metrics
Number of params
Test perplexity
Results
Performance results of various models on this benchmark
Comparison Table
Model Name | Number of params | Test perplexity |
---|---|---|
improving-neural-language-models-by | 257M | 17.4 |
alleviating-sequence-information-loss-with | - | 32.85 |
fast-parametric-learning-with-activation | - | 34.3 |
reformer-the-efficient-transformer-1 | - | 26.0 |
how-much-complexity-does-an-rnn-architecture | - | - |
language-modeling-with-gated-convolutional | - | 44.9 |
transformer-xl-attentive-language-models | 257M | 18.3 |
how-much-complexity-does-an-rnn-architecture | - | - |
subformer-a-parameter-reduced-transformer | 96M | 20.39 |
efficient-content-based-sparse-attention-with-1 | - | 15.8 |
when-attention-meets-fast-recurrence-training | 148M | 18.3 |
when-attention-meets-fast-recurrence-training | 234M | 17.1 |
differentiable-model-compression-via-pseudo | - | 18.0 |
the-information-pathways-hypothesis | - | 17.18 |
language-models-are-unsupervised-multitask | 774M | 22.05 |
primal-attention-self-attention-through | - | 31.0 |
shortformer-better-language-modeling-using | 247M | 17.56 |
hungry-hungry-hippos-towards-language | 355M | 16.9 |
dynamic-evaluation-of-transformer-language | 257M | 16.4 |
how-much-complexity-does-an-rnn-architecture | - | - |
hungry-hungry-hippos-towards-language | - | 18.5 |
generalization-through-memorization-nearest | 247M | 15.79 |
fast-parametric-learning-with-activation | - | 29.7 |
190409408 | 395M | 20.4 |
finetuning-pretrained-transformers-into-rnns | - | 19.6 |
improving-neural-language-modeling-via | - | 28.0 |
an-analysis-of-neural-language-modeling-at | 151M | 33.0 |
on-the-adequacy-of-untuned-warmup-for | - | - |
deep-equilibrium-models | 180M | 29.0 |
transformer-xl-attentive-language-models | 151M | 24.0 |
rethinking-attention-with-performers | - | 26.8 |
adaptive-input-representations-for-neural | 247M | 18.70 |
accessing-higher-level-representations-in | 139M | 18.2 |
augmenting-self-attention-with-persistent | 133M | 20.6 |
all-nlp-tasks-are-generation-tasks-a-general | 10000M | 12.22 |
improving-neural-language-models-with-a | - | 44.8 |
infty-former-infinite-memory-transformer | - | 16.64 |
gateloop-fully-data-controlled-linear | 125M | 13.4 |
improving-neural-language-models-with-a | - | 40.8 |
transformers-are-rnns-fast-autoregressive | - | 25.6 |
general-purpose-long-context-autoregressive | - | 18.4 |
hungry-hungry-hippos-towards-language | 1300M | 12.5 |
efficiently-modeling-long-sequences-with-1 | 249M | 21.28 |
infty-former-infinite-memory-transformer | - | 24.22 |
language-models-are-unsupervised-multitask | 124M | 37.50 |
language-models-are-unsupervised-multitask | 1542M | 17.48 |
dynamic-evaluation-of-transformer-language | 257M | 17.0 |
you-can-t-pick-your-neighbors-or-can-you-when | 247M | 15.5 |
random-feature-attention-1 | - | 30.5 |
infty-former-infinite-memory-transformer | - | 16.61 |
the-information-pathways-hypothesis | - | 17.60 |
language-models-are-unsupervised-multitask | 355M | 26.37 |
hyena-hierarchy-towards-larger-convolutional | - | 18.6 |
improving-transformer-models-by-reordering | 247M | 17.96 |
an-empirical-evaluation-of-generic | - | 45.19 |
language-modeling-with-gated-convolutional | - | 37.2 |
advancing-state-of-the-art-in-language | - | 13.29 |
time-aware-large-kernel-convolutions | 240M | 23.3 |
segabert-pre-training-of-segment-aware-bert | 257M | 17.1 |
convolutional-sequence-modeling-revisited | - | 45.2 |
megatron-lm-training-multi-billion-parameter | 8300M | 10.81 |
improving-neural-language-models-with-a | - | 48.7 |
pay-attention-when-required | - | 22.7 |
compressive-transformers-for-long-range-1 | - | 17.1 |
deep-equilibrium-models | 110M | 23.2 |
mega-moving-average-equipped-gated-attention | 252M | 18.07 |
fnetar-mixing-tokens-with-autoregressive | 144.4M | 25.81 |
pay-attention-when-required | - | 18.4 |
generalization-through-memorization-nearest | 247M | 16.12 |
random-feature-attention-1 | - | 23.5 |
fast-parametric-learning-with-activation | - | 36.4 |
accessing-higher-level-representations-in | 44M | 22.4 |
delight-very-deep-and-light-weight | 99M | 24.14 |
hungry-hungry-hippos-towards-language | 2700M | 10.6 |
memory-efficient-stochastic-methods-for | 122M | 22.91 |
relational-recurrent-neural-networks | - | 31.6 |
all-nlp-tasks-are-generation-tasks-a-general | 10000M | 11.33 |
improving-language-models-by-retrieving-from | 7532M | 2.4 |
deep-equilibrium-models | 138M | 32.4 |
hyena-hierarchy-towards-larger-convolutional | - | 18.5 |
fast-parametric-learning-with-activation | - | 29.2 |
infty-former-infinite-memory-transformer | - | 24.22 |
infty-former-infinite-memory-transformer | - | 24.22 |
shortformer-better-language-modeling-using | 247M | 18.15 |
trellis-networks-for-sequence-modeling | - | 29.19 |
revisiting-simple-neural-probabilistic | 148M | 25.2 |
infty-former-infinite-memory-transformer | - | 16.61 |
infty-former-infinite-memory-transformer | - | 16.61 |
hungry-hungry-hippos-towards-language | 125M | 23.7 |