Language Models are Unsupervised Multitask Learners
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typicallyapproached with supervised learning on taskspecific datasets. We demonstrate that languagemodels begin to learn these tasks without any explicit supervision when trained on a new datasetof millions of webpages called WebText. Whenconditioned on a document plus questions, the answers generated by the language model reach 55F1 on the CoQA dataset - matching or exceedingthe performance of 3 out of 4 baseline systemswithout using the 127,000+ training examples.The capacity of the language model is essentialto the success of zero-shot task transfer and increasing it improves performance in a log-linearfashion across tasks. Our largest model, GPT-2,is a 1.5B parameter Transformer that achievesstate of the art results on 7 out of 8 tested language modeling datasets in a zero-shot settingbut still underfits WebText. Samples from themodel reflect these improvements and contain coherent paragraphs of text. These findings suggesta promising path towards building language processing systems which learn to perform tasks fromtheir naturally occurring demonstrations.