Command Palette
Search for a command to run...
M2Lingual Multi-language Multi-round Instruction Fine-tuning Dataset
M2Lingual is a multi-lingual, multi-round instruction fine-tuning (IFT) dataset designed to improve the performance of large language models (LLMs) in following instructions, especially on diverse languages and tasks. The dataset was proposed in 2024 by a research team from ServiceNow and the University of Illinois at Chicago.
The main features of the M2Lingual dataset include:
- Multi-language coverage: M2Lingual covers 70 different languages, providing more training data for low-resource languages.
- Multi-turn dialogue: The dataset contains multiple rounds of instructions and responses, which enhances the model's ability to handle complex dialogue scenarios.
- Task-oriented: M2Lingual includes 17 natural language processing (NLP) tasks, such as summarization, question answering, and general command-response pairs.
- Large scale: The dataset contains a total of 182,000 instruction fine-tuning pairs, providing rich training samples.
- Synthetic Dataset:M2Lingual is a completely synthetic dataset generated using a specific evolutionary taxonomy, ensuring the diversity and complexity of the data.
- Performance Improvements: LLM fine-tuned using M2Lingual shows superior performance over existing multilingual IFT datasets on multiple evaluation benchmarks.
The introduction of M2Lingual provides a new solution to the problem of multi-language and multi-round instruction alignment, which helps to improve the practicality and accuracy of large language models in multi-language environments.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.