GPT-3 175B (Few-Shot) | 86.4 | Language Models are Few-Shot Learners | |
Megatron-Turing NLG 530B (Few-Shot) | Megatron-Turing NLG 530B (Few-Shot) | Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model | - |
GPT-3 13B (Zero-Shot) | 72.5 | Language Models are Few-Shot Learners | |
Gated-Attention Reader (+ features) | 49.0 | Broad Context Language Modeling as Reading Comprehension | - |
GPT-3 2.7B (Zero-Shot) | 67.1 | Language Models are Few-Shot Learners | |
LLaMA-30B+CFG (zero-shot) | 83.9 | Stay on topic with Classifier-Free Guidance | - |
Universal Transformer (w/ dynamic halting) | 56.25 | Universal Transformers | |
SparseGPT (175B, 2:4 Sparsity) | 79.47 | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | |
SparseGPT (175B, 50% Sparsity) | 76.51 | SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot | |
Residual Shuffle-Exchange network | 54.34 | Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences | |