HyperAI

M2Lingual is a multi-lingual, multi-round instruction fine-tuning (IFT) dataset designed to improve the performance of large language models (LLMs) in following instructions, especially on diverse languages and tasks. The dataset was proposed in 2024 by a research team from ServiceNow and the University of Illinois at Chicago.

The main features of the M2Lingual dataset include:

Multi-language coverage: M2Lingual covers 70 different languages, providing more training data for low-resource languages.
Multi-turn dialogue: The dataset contains multiple rounds of instructions and responses, which enhances the model's ability to handle complex dialogue scenarios.
Task-oriented: M2Lingual includes 17 natural language processing (NLP) tasks, such as summarization, question answering, and general command-response pairs.
Large scale: The dataset contains a total of 182,000 instruction fine-tuning pairs, providing rich training samples.
Synthetic Dataset：M2Lingual is a completely synthetic dataset generated using a specific evolutionary taxonomy, ensuring the diversity and complexity of the data.
Performance Improvements: LLM fine-tuned using M2Lingual shows superior performance over existing multilingual IFT datasets on multiple evaluation benchmarks.

The introduction of M2Lingual provides a new solution to the problem of multi-language and multi-round instruction alignment, which helps to improve the practicality and accuracy of large language models in multi-language environments.

M2Lingual Multi-language Multi-round Instruction Fine-tuning Dataset

Build AI with AI

Hyper Newsletters

Command Palette

M2Lingual Multi-language Multi-round Instruction Fine-tuning Dataset

Build AI with AI

Hyper Newsletters