Zephyr: Direct Distillation of LM Alignment

We aim to produce a smaller language model that is aligned to user intent.Previous research has shown that applying distilled supervised fine-tuning(dSFT) on larger models significantly improves task accuracy; however, thesemodels are unaligned, i.e. they do not respond well to natural prompts. Todistill this property, we experiment with the use of preference data from AIFeedback (AIF). Starting from a dataset of outputs ranked by a teacher model,we apply distilled direct preference optimization (dDPO) to learn a chat modelwith significantly improved intent alignment. The approach requires only a fewhours of training without any additional sampling during fine-tuning. The finalresult, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7Bparameter models, and requires no human annotation. In particular, results onMT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-accessRLHF-based model. Code, models, data, and tutorials for the system areavailable at https://github.com/huggingface/alignment-handbook.