Amazon Engineers Carefully Selected, More Than 40 LLM Papers Are Collected

In 2023, the big language model is still a "topic maker", whether it is OpenAI Whether it is the "palace fighting drama" of the big companies, the "fighting of gods" of new models and new products of various big companies, or the rapid development of large models in the industry, all of these indicate that large language models have huge room for development. ChatGPT After becoming an overnight sensation, we often hear news about bigwigs from all walks of life joining the game, and start-ups with different funding or technical backgrounds have sprung up like mushrooms after a rain.
Obviously, such a lively scene will not easily cool down in 2024. More and more companies and traditional industries are beginning to explore how to apply large language models to their own businesses. The rapidly expanding market demand has also driven the further deepening and innovation of research in related fields, and the updates of papers on platforms such as arXiv have become more frequent.
Which papers are worth reading? What knowledge points are behind the complex paper titles?
In order to help you retrieve high-value papers faster,Amazon engineer Eugene Yan and others have established a reading list of language model papers and continue to share cutting-edge papers. They have currently compiled more than 40 high-quality papers.
Collection link:
https://eugeneyan.com/writing/llm-reading-list/
Follow the official account and reply "LLM Papers" to download the collection of papers.
Transformer pioneering paper
Attention Is All You Need

*author:NEAR co-founder Illia Polosukhin (former Google AI team member) and others
*original:https://arxiv.org/abs/1706.03762
Mainstream sequence conversion models are based on complex encoder-decoder configurations of recursive or convolutional neural networks. High-performance models also connect encoders and decoders through an attention mechanism. This study proposes a new simple network architecture, Transformer, which is completely based on the attention mechanism and completely eliminates the process of recursive and convolutional neural network configuration. Experiments on two machine translation tasks show that these models have better quality, are more parallelizable, and require significantly less training time.
GPT: Improving Language Understanding Through Generative Pre-training
Improving Language Understanding by Generative Pre-Training

*author:OpenAI
*original:https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Natural language understanding covers a wide range of tasks, such as text association, question answering, and semantic similarity assessment. Although there is a large amount of unlabeled text corpus, the labeled data for learning these specific tasks is limited, which makes it difficult for models trained for discrimination to fully function. In this regard, Ilya led the researchers at OpenAI to propose that this phenomenon can be improved by pre-training the language model on a rich unlabeled text corpus and performing differential fine-tuning on each specific task. The researchers used task-aware input transformations in the fine-tuning process, which made the model architecture less adjustable and achieved effective transfer learning.
Comparative experimental results on general tasks show that the model achieved a performance improvement of 8.9% in common sense reasoning (Stories Cloze Test), 5.7% in question answering (RACE), and 1.5% in text association (MultiNLI).
BERT: Pre-training Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

*author:Google DeepMind
*original:https://arxiv.org/abs/1810.04805
The researchers proposed a new language representation model, BERT (Bidirectional Encoder Representations from Transformers), which pre-trains deep bidirectional representations by taking context into account in all layers. Therefore, the pre-trained BERT model can be fine-tuned by simply adding an output layer, thereby creating advanced models for multiple tasks such as question answering and language reasoning without making extensive modifications to the architecture of specific tasks.
BERT has achieved significant improvements in 11 natural language processing tasks, including increasing the GLUE score to 80.5% (a relative improvement of 7.7%), MultiNLI accuracy to 86.7% (a relative improvement of 4.6%), SQuAD v1.1 question answering test F1 to 93.2 (a relative improvement of 1.5%), and SQuAD v2.0 test F1 to 83.1 (a relative improvement of 5.1%).
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

*author:Google DeepMind
*original:https://arxiv.org/abs/1910.10683
The researchers introduced a unified framework to convert all text-based language questions into a text-to-text format, further exploring the transfer learning technology of NLP. The study compared pre-training objectives, architectures, unlabeled datasets, transfer methods, and other factors for dozens of language understanding tasks. By combining the comparative results and experimental results with the team's newly proposed Colossal Clean Crawled Corpus corpus, the study achieved state-of-the-art results in multiple benchmarks such as summarization, question answering, and text classification.
GPT2: Language model is an unsupervised multi-task learner
Language Models are Unsupervised Multitask Learners

*author:OpenAI
*original:https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
The study demonstrated that when trained on a new dataset, WebText, consisting of millions of web pages, language models can learn natural language processing tasks without explicit supervision. When given a document + question, the answers generated by the language model are CoQA The F1 score on the dataset reached 55, matching or exceeding 3 of the 4 baseline systems, while not requiring more than 127,000 training examples. GPT-2 is a 1.5 billion parameter Transformer that achieved state-of-the-art performance on 7 of the 8 language modeling datasets tested in a zero-shot setting, but it is still not fully adapted to WebText.
GPT-3: Language models are few-shot learners
Language Models are Few-Shot Learners

*author:Anthropic founder Dario Amodei, OpenAI co-founder Ilya Sutskever, and others
*original:https://arxiv.org/abs/2005.14165
The researchers trained the autoregressive language model GPT-3 and tested its performance in a few scenarios. In all tasks, GPT-3 did not perform any gradient updates or fine-tuning, and the tasks and few-shot demonstrations were purely achieved through textual interactions with the model. GPT-3 achieved good performance on most NLP datasets, including translation, question answering, and some tasks that require immediate reasoning or domain adaptation, such as word decoding, using new words in sentences, or performing three-digit operations. In addition, the researchers found that GPT-3 can generate news articles that are difficult for people to distinguish.
Scaling laws for neural language models: training bigger models on less data
Scaling Laws for Neural Language Models

*author:Anthropic founder Dario Amodei and OpenAI researchers
*original:https://arxiv.org/abs/2001.08361
The researchers studied the scaling of language model performance on cross-entropy loss. The loss scales power-law with model size, dataset size, and computation used for training, with some scaling trends exceeding 7 orders of magnitude. The dependence of overfitting on model/dataset size, and the dependence of training speed on model size, are governed by simple equations. Based on this, the researchers proposed that larger models are more sample efficient, so training for optimal computational efficiency requires training larger models on relatively small amounts of data and stopping significantly before convergence.
Chinchilla: Training large language models with optimal computational efficiency
Training Compute-Optimal Large Language Models

*author:Google DeepMind
*original:https://arxiv.org/abs/2203.15556
The researchers proposed that the model size and the number of training tokens should increase proportionally, and verified this hypothesis by training a predicted computationally efficient model, Chinchilla. Chinchilla uses the same computing power as Gopher, but with a parameter scale of 7 billion and a 4-fold increase in data volume. Chinchilla significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) in various downstream evaluation tasks. This also means that Chinchilla uses significantly less computing resources during fine-tuning and inference, greatly facilitating downstream applications.
LLaMA: An open and efficient basic language model
LLaMA: Open and Efficient Foundation Language Models

*author:Guillaume Lample, co-founder of Mistral AI (formerly worked at Meta AI) and others
*original:https://arxiv.org/abs/2302.13971
LLaMA is a collection of basic language models with parameters ranging from 7B to 65B. Meta AI researchers trained the model on trillions of tokens, using only publicly available datasets and no proprietary and inaccessible datasets. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, while LLaMA-65B is comparable to Chinchilla-70B and PaLM-540B.
InstructGPT: Training a language model to follow instructions via human feedback
Training language models to follow instructions with human feedback

*author:OpenAI
*original:https://arxiv.org/abs/2203.02155
The researchers showed that fine-tuning with human feedback across a variety of tasks can align language models with user intent. The resulting model, which the researchers call InstructGPT, outperforms the output of the 1.3B InstructGPT model over the output of the 175B GPT-3 model in an evaluation of the prompt distribution. InstructGPT also improves on authenticity and reduces toxic output.
LoRA: Low-rank adaptation of large language models
LoRA: Low-Rank Adaptation of Large Language Models

*author:Microsoft
*original:https://arxiv.org/abs/2106.09685
Microsoft researchers proposed LoRA (Low-Rank Adaptation), which can freeze the weights of the pre-trained model and inject a trainable rank decomposition matrix into each layer of the Transformer architecture, thereby greatly reducing the number of trainable parameters for downstream tasks. Compared with GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and GPU memory requirements by 3 times.
QLoRA: Efficient Fine-tuning of Quantized Large Language Models
QLoRA: Efficient Finetuning of Quantized LLMs

*author:Researchers at the University of Washington
*original:https://arxiv.org/abs/2305.14314
QLoRA is an efficient fine-tuning method that can reduce memory usage and fine-tune a 65B parameter model on a single 48GB GPU while maintaining full 16-bit fine-tuning task performance. QLoRA back-propagates gradients to LoRA through a frozen 4-bit quantized pre-trained language model. The researchers named the best performing model based on QLoRA Guanaco, which outperformed all previously publicly released models on the Vicuna benchmark, reaching the performance level of ChatGPT 99.3%, and only required 24 hours of fine-tuning on a single GPU.
DPR: Dense Passage Retrieval for Open Domain Question Answering
Dense Passage Retrieval for Open-Domain Question Answering

*author:FAIR at Meta
*original:https://arxiv.org/abs/2004.04906
In this study, researchers showed how to achieve retrieval using only dense representations, i.e. learning embeddings from a small number of questions and passages through a simple dual encoder framework. When evaluated on a wide range of open domain question answering datasets, the retriever improves Lucene-BM25 by 9%-19% in top-20 passage retrieval accuracy.
RAG: Retrieval-augmented Generation for Knowledge-Intensive NLP Tasks
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

*author:Researchers from Meta, UCL and University College London
*original:https://arxiv.org/abs/2005.11401
The researchers proposed a general fine-tuning method RAG (retrieval-augmented generation) that combines pre-trained parametric and non-parametric methods for language generation. The study introduced the RAG model, in which the parameter memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index (DPR) of Wikipedia, which can be accessed through a pre-trained neural retriever. The researchers compared two RAG schemes, one conditioned on the same paragraph retrieved throughout the generated sequence and the other conditioned on different paragraphs for each token. In the language generation task, the researchers found that the language generated by the RAG model is more specific, diverse, and realistic than the language generated by the most advanced pure parameter seq2seq baseline model.
RETRO: Improving Language Model Performance by Retrieving from Trillions of Tokens
Improving language models by retrieving from trillions of tokens

*author:Google DeepMind
*original:https://arxiv.org/abs/2112.04426
With a 2 trillion labeled database, the Retrieval-Enhanced Transformer (RETRO) achieves comparable performance on Pile to GPT-3 and Jurassic-1 despite using 25x fewer parameters. RETRO combines a frozen BERT retriever, a differentiable encoder, and a chunked cross-attention mechanism to predict labels based on an order of magnitude more data than consumed during training.
Building Internet-enhanced language models with a few hints for open-domain question answering
Internet-augmented language models through few-shot prompting for open-domain question answering

*author:Google DeepMind
*original:https://arxiv.org/abs/2203.05115
The study aims to leverage the unique small number of prompts of large-scale language models (LSLMs) to overcome the challenges they face in fact-based and up-to-date information. The researchers found that network-based language models outperform closed-book models of similar or even larger model size in open-domain question answering. In addition, by generating multiple answers using multiple retrieval evidences and then reranking them using the scores generated by the same LMs, the model's inference computation time can be improved, thereby improving performance and alleviating the problem of lower performance of a small number of LMs.
HyDE: Zero-Shot Dense Retrieval without Relevance Labels
Precise Zero-Shot Dense Retrieval without Relevance Labels

*author:Researchers from Carnegie Mellon University and the University of Waterloo
*original:https://arxiv.org/abs/2212.10496
In this experiment, HyDE (Hypothetical Document Embeddings) first guides an instruction-following language model (e.g., InstructGPT) to generate a hypothetical document in a zero-shot manner. The document captures the relevance pattern, but is fictitious and may contain spurious details. Then, an unsupervised contrastively learned encoder (e.g., Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space where similar real documents are retrieved based on vector similarity. Experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever across a variety of tasks and languages, and exhibits strong performance comparable to fine-tuned retrievers.
FlashAttention: Accurate Attention Algorithm with IO-Awareness
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

*author:Researchers from Stanford University and the State University of New York
*original:https://arxiv.org/abs/2205.14135
FlashAttention is an IO-aware, precise Attention algorithm that uses tiling to reduce the number of memory reads and writes between the GPU high bandwidth memory (HBM) and the GPU on-chip SRAM. FlashAttention and Block Sparse FlashAttention enable longer contexts in Transformers, resulting in higher quality models and innovative features.
Attention linear bias to achieve input length extrapolation
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

*author:Research teams from the University of Washington, FAIR, etc.
*original:https://arxiv.org/abs/2108.12409
The researchers proposed a simpler and more efficient position representation method - ALiBi (Attention with Linear Biases), which can train a 1.3 billion parameter model on an input sequence of length 1024 and can infer an input sequence of length 2048. It achieves the same performance as the sinusoidal position embedding model trained on an input sequence of length 2048, but is 11% faster in training and uses 11% less memory.
Codex: Evaluating Large Language Models Trained on Code
Evaluating Large Language Models Trained on Code

*author:OpenAI
*original:https://arxiv.org/abs/2107.03374
The researchers introduced the GPT language model Codex, which was fine-tuned based on GitHub public code, and studied its Python code writing capabilities. At the same time, the researchers also released a new evaluation set HumanEval, which is used to measure the functional correctness of programs synthesized from document scripts. On this evaluation set, Codex solved 28.8% problems, while GPT-3 solved 0% and GPT-J solved 11.4%.
Layer Normalization
Layer Normalization

*author:Researchers at the University of Toronto
*original:https://arxiv.org/abs/1607.06450
We researchers convert batch normalization to layer normalization, which is to achieve normalization by calculating the mean and variance of the sum of all inputs to neurons in a layer in a single training sample. Unlike batch normalization, layer normalization performs exactly the same calculations during training and testing. Experience shows that layer normalization can significantly reduce training time compared to previously published techniques.
Layer Normalization in the Transformer Architecture
On Layer Normalization in the Transformer Architecture

*author:Microsoft
*original:https://arxiv.org/abs/2002.04745
The researchers used mean field theory to prove that in the initialization phase, for the original design of the Post-LN Transformer, the expected gradient of the parameters near the output layer is large, and using a high learning rate based on this will make the training unstable. In addition, if the layer normalization is placed in the Post-LN Transformer, the gradient is good at initialization. The study shows that removing the preheating phase of the Pre-LN Transformer can achieve comparable results to the baseline in practical applications, while reducing training time and hyperparameter tuning.
PPO: Proximal Policy Optimization Algorithm
Proximal Policy Optimization Algorithms

*author:OpenAI
*original:https://arxiv.org/abs/1707.06347
The researchers proposed PPO (proximal policy optimization) which has similar advantages to TRPO (rust region policy optimization), but is simpler, more general, and has better sample complexity. The researchers tested PPO on a series of benchmark tasks and the results showed that PPO outperformed other online policy gradient methods and generally achieved a good balance between sample complexity, simplicity, and wall time.
WizardCoder: Using Evol-Instruct to enhance the ability of code large language models
WizardCoder: Empowering Code Large Language Models with Evol-Instruct

*author:Researchers from Microsoft and Hong Kong Baptist University
*original:https://arxiv.org/abs/2306.08568
The researchers proposed WizardCoder, which adapts the Evol-Instruct method to the code domain, to enable Code LLM to fine-tune complex instructions. Experiments on four code generation benchmarks, HumanEval, HumanEval+, MBPP, and DS-1000, show that WizardCoder greatly surpasses all other open source Code LLMs. In addition, on HumanEval and HumanEval+, WizardCoder even surpasses Anthropic's Claude and Google's Bard.
Llama 2: Open-source and fine-tuned chat models
Llama 2: Open Foundation and Fine-Tuned Chat Models

*author:GenAI, Meta
*original:https://arxiv.org/abs/2307.09288
Llama 2 is a large language model that has been pre-trained and fine-tuned, ranging in size from 7 billion to 70 billion parameters. The researchers' fine-tuned LLM is called Llama 2-Chat and is optimized for conversational applications. The paper details the researchers' approach to fine-tuning and security improvements to Llama 2-Chat.
RWKV: Redefining Recurrent Neural Networks (RNNs) for the Transformer Era
RWKV: Reinventing RNNs for the Transformer Era

*author:EleutherAI, University of Barcelona and other research teams
*original:https://arxiv.org/abs/2305.13048
The researchers proposed a novel model architecture, called Receptance Weighted Key Value (RWKV), which combines the efficient parallel training of Transformer and the efficient reasoning of RNN. The method uses a linear attention mechanism to formulate the model as Transformer or RNN, thereby parallelizing the computation during training and maintaining constant computation and memory complexity during reasoning. The researchers expanded the model to 14 billion parameters, making it the largest dense RNN model to date.
RLAIF: Harmless AI Feedback
Constitutional AI: Harmlessness from AI Feedback

*author:Anthropic
*original:https://arxiv.org/abs/2212.08073
The researchers tried to train an AI assistant through self-improvement and called this method Constitutional AI. The training process includes two stages: supervised learning and reinforcement learning. In the supervised learning stage, the researchers sampled from the initial model, then generated self-criticisms and revisions, and finally fine-tuned the original model on the revised responses.
In the reinforcement learning phase, the researchers sampled from the fine-tuned model, used the model to evaluate which of the two samples was better, and then trained the preference model from the AI's preferred dataset. The researchers then used the preference model as a reward signal for RL training, using "RL from AI Feedback (RLAIF)".
Very large-scale neural networks
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

*author:Google Brain (merged with DeepMind)
*original:https://arxiv.org/abs/1701.06538
The researchers introduced a sparsely gated MoE (Mixture-of-Experts) consisting of up to thousands of feed-forward sub-networks, applying MoE to language modeling and machine translation tasks. In these tasks, model capacity is critical to absorb the large amount of knowledge in the training corpus. The researchers proposed a model architecture in which MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, the model significantly outperforms the state-of-the-art at a lower computational cost.
CLIP: Learning transferable vision models from natural language supervision
Learning Transferable Visual Models From Natural Language Supervision

*author:OpenAI
*original:https://arxiv.org/abs/2103.00020
The researchers proposed a pre-training task of predicting which caption goes with which image, an efficient and scalable way to learn state-of-the-art image representations from scratch. The study used a dataset of 400 million pairs of images and text collected from the Internet. After pre-training, natural language is used to reference the learned visual concepts (or describe new concepts), enabling zero-shot transfer of the model to downstream tasks.
ViT: Transformer for Image Recognition at Scale
An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

*author:Google Research, Brain Team (merged with DeepMind)
*original:https://arxiv.org/abs/2010.11929
The application of convolution operations is usually accompanied by limitations of global structures and long-distance dependencies, so more parameters and deeper networks are needed to solve these problems. The researchers proposed an image recognition model based entirely on Transformer, called ViT (Vision Transformer), which adopts the core idea of Transformer and can capture global information.
Generative Agents: Interactive Simulation of Human Behavior
Generative Agents: Interactive Simulacra of Human Behavior

*author:Stanford University, Google DeepMind researchers
*original:https://arxiv.org/abs/2304.03442
To build generative agents, the researchers proposed an architecture that extends a large language model to store the complete record of agents' experience using natural language, gradually synthesize these memories into higher-level reflections, and dynamically retrieve them to plan behavior. This study introduces architecture and interaction patterns by integrating large language models with computational and interactive agents, achieving simulation of credible human behavior.
DPO: Direct Preference Optimization Algorithm
Direct Preference Optimization: Your Language Model is Secretly a Reward Model

*author:Stanford University researchers
*original:https://arxiv.org/abs/2305.18290
The Direct Preference Optimization (DPO) algorithm proposed by the researchers is stable, efficient, and computationally lightweight. It does not require fitting a reward model, sampling from LMs during fine-tuning, or performing significant hyperparameter adjustments. Experiments show that DPO can fine-tune LMs to align with human preferences. Experiments show that fine-tuning with DPO is superior to RLHF (reinforcement learning from human feedback) in controlling the generated emotions.
Consistency Model
Consistency Models

*author:OpenAI
*original:https://arxiv.org/abs/2303.01469
The consistency model proposed in this study is a new model that generates high-quality samples by directly mapping noise to data. It supports fast one-step generation, while also making a trade-off between computation and sample quality through multi-step sampling. The model also supports zero-sample data editing, such as image inpainting, coloring, and super-resolution, without explicit training on these tasks.
Potential consistency model
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

*author:Researchers from Tsinghua University
*original:https://arxiv.org/abs/2310.04378
The latent consistency models (LCMs) proposed by the researchers can perform fast inference with minimal steps on any pre-trained latent diffusion models (LDMs), including stable diffusion (rombach et al). Experimental results show that by efficiently extracting from the pre-trained classifier-free guided diffusion model, a high-quality 768 x 768 2~4 step LCM only needs 32 hours of training on the A100 GPU.
LCM-LoRA: Universal Stable Diffusion Acceleration Module
LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

*author:Tsinghua University, Hugging Face
*original:https://arxiv.org/abs/2311.05556
This study further expands the potential of LCMs. First, the researchers extended the scope of LCM to large models with less memory consumption by applying LoRA to Stable-Diffusion models including SD-V1.5, SSD-1B, and SDXL, achieving superior image generation quality. Second, the researchers identified the LoRA parameters obtained by LCM distillation as a general Stable-Diffusion acceleration module and named it LCM-LoRA. LCM-LoRA can be directly plugged into various Stable-Diffusion fine-tuned models or LoRAs without training, and therefore represents a general accelerator suitable for diverse image generation tasks.
Chain-of-Note: Improving the robustness of retrieval-enhanced language models
Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models

*author:Tencent AI Lab
*original:https://arxiv.org/abs/2311.09210
The researchers proposed Chain-of-Noting (CoN), which can improve the robustness of retrieval-augmented language models (RALM) in the face of noisy, irrelevant documents and in handling unknown scenarios. CoN can generate sequential reading annotations for retrieved documents to thoroughly assess their relevance to a given question and integrate this information into the process of formulating the final answer.
Emerging Capabilities of Large Language Models
Emergent Abilities of Large Language Models

*author:Google Research, Stanford University, UNC, DeepMind
*original:https://arxiv.org/abs/2206.07682
The researchers proposed emergent capabilities of large language models, defining them as capabilities that are absent in smaller models but present in large models, measured by the amount of training computation and the number of model parameters.
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q Functions
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions

*author:Google DeepMind
*original:https://arxiv.org/abs/2309.10150
Researchers propose a scalable reinforcement learning method, Q-Transformer, for training multi-task policies that can leverage human demonstrations and autonomously collected data from large-scale offline datasets. The method uses Transformer to provide a scalable representation of the Q function, trained with offline temporal difference backup.
Llama Guard
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

*author:Meta GenAI
*original:https://arxiv.org/abs/2312.06674
Llama Guard is an LLM-based input and output protection model that is fine-tuned based on the Llama2-7b model on the dataset collected by Meta. Despite the small amount of data, it performs well in existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, and its performance matches or outperforms currently available content review tools.
ReSTEM: Beyond Human Data
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

*author:Google DeepMind, Mila
*original:https://arxiv.org/abs/2312.06585
The researchers proposed an expectation-maximization-based self-training method, called ReSTEM, which generates samples from the model and filters them using binary feedback, then fine-tunes these samples and repeats this process several times. When using the PaLM-2 model for MATH reasoning and APPS encoding benchmarks, the researchers found that ReSTEM's performance is proportional to the model size and significantly outperforms the fine-tuning method alone on human data.
Mixed Expert Models

*source:Hugging Face
*original:https://huggingface.co/blog/moe
SPIN: Self-game fine-tuning transforms weak language models into strong language models
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

*author:Researchers from UCLA, Tsinghua University, and the University of California
*original:https://arxiv.org/abs/2401.01335
The researchers proposed a new fine-tuning method called Self-Play fine-tuning (SPIN), which is centered on a self-play mechanism. The language model generates training data from its previous iteration and further adjusts its strategy by distinguishing these self-generated responses from responses obtained from manually annotated data.
Self-Instruct: Aligning language models with automatically generated instruction language
Self-Instruct: Aligning Language Models with Self-Generated Instructions

*author:University of Washington, etc.
*original:https://arxiv.org/abs/2212.10560
Self-Instruct proposed by the researchers can use the content generated by the pre-trained language model itself to improve its ability to follow instructions. The researchers generated instructions, input and output samples from the language model. Before using the samples to fine-tune the original model, invalid or similar samples were filtered out. The researchers applied this method to GPT-3 and verified it on Super-NaturalInstructions. The results showed an improvement of 33% over the original model, which is comparable to the performance of InstructGPT-001 trained with private user data and manual annotations.
Follow the official account and reply "LLM Papers" to download the collection of papers.
References: