HyperAIHyperAI

Command Palette

Search for a command to run...

SMOL Multilingual Translation Parallel Dataset

Date

an hour ago

Organization

Google

Paper URL

2502.12301

License

CC BY 4.0

SMOL (Set for Maximal Overall Leverage) is a professional translation dataset released by Google in 2025. It aims to train translation models for low-resource languages and provide high-quality parallel data. Related research papers include... SMOL: Professionally translated parallel data for 115 under-represented languages . This dataset includes professionally translated texts in 221 languages, including Amharic, Swahili, and Afar, as well as less commonly annotated languages/regional languages with scarce data. It covers a wide range of language pairs, including texts contributed by professional translators and volunteers, and adds vertical data and factual annotations from the medical field for some languages.

Dataset composition:

  • SmolDoc: Document-level translation, covering 130 language pairs (129 independent languages);
  • SmolSent: Sentence-level translation, covering 114 language pairs (116 independent languages);
  • GATITOS: A word-level translation tool covering 181 language pairs (183 independent languages), primarily used as a multilingual dictionary;
  • SmolDoc-factuality-annotations: Factual annotations and reasoning for 661 documents in SmolDoc.

Citations

@misc{caswell2025smol,
title={{SMOL: Professionally translated parallel data for 115 under-represented languages}},
author={Isaac Caswell and Elizabeth Nielsen and Jiaming Luo and Colin Cherry and Geza Kovacs and Hadar Shemtov and Partha Talukdar and Dinesh Tewari and Baba Mamadi Diane and Koulako Moussa Doumbouya and Djibrila Diane and Solo Farabado Cissé and Edoardo Ferrante and Alessandro Guasoni and Mamadou K. Keita and Sudhamoy DebBarma and Ali Kuzhuget and David Anugraha and Muhammad Ravi Shulthan Habibi and Sina Ahmadi and Mingfei Lau and Jonathan Eng},
year={2025},
eprint={2502.12301},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.12301},
}
@inproceedings{jones-etal-2023-gatitos,
title = {{"GATITOS: Using a New Multilingual Lexicon for Low-resource Machine Translation"}},
author = "Jones, Alexander  and
Caswell, Isaac  and
Firat, Orhan  and
Saxena, Ishank",
editor = "Bouamor, Houda  and
Pino, Juan  and
Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.26/",
doi = "10.18653/v1/2023.emnlp-main.26",
pages = "371--405",
abstract = "Modern machine translation models and language models are able to translate without having been trained on parallel data, greatly expanding the set of languages that they can serve. However, these models still struggle in a variety of predictable ways, a problem that cannot be overcome without at least some trusted bilingual data. This work expands on a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Based on results from (3), we develop and open-source GATITOS, a high-quality, curated dataset in 168 tail languages, one of the first human-translated resources to cover many of these languages."
}

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp