@misc{caswell2025smol, title={{SMOL: Professionally translated parallel data for 115 under-represented languages}}, author={Isaac Caswell and Elizabeth Nielsen and Jiaming Luo and Colin Cherry and Geza Kovacs and Hadar Shemtov and Partha Talukdar and Dinesh Tewari and Baba Mamadi Diane and Koulako Moussa Doumbouya and Djibrila Diane and Solo Farabado Cissé and Edoardo Ferrante and Alessandro Guasoni and Mamadou K. Keita and Sudhamoy DebBarma and Ali Kuzhuget and David Anugraha and Muhammad Ravi Shulthan Habibi and Sina Ahmadi and Mingfei Lau and Jonathan Eng}, year={2025}, eprint={2502.12301}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.12301}, } @inproceedings{jones-etal-2023-gatitos, title = {{"GATITOS: Using a New Multilingual Lexicon for Low-resource Machine Translation"}}, author = "Jones, Alexander and Caswell, Isaac and Firat, Orhan and Saxena, Ishank", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.26/", doi = "10.18653/v1/2023.emnlp-main.26", pages = "371--405", abstract = "Modern machine translation models and language models are able to translate without having been trained on parallel data, greatly expanding the set of languages that they can serve. However, these models still struggle in a variety of predictable ways, a problem that cannot be overcome without at least some trusted bilingual data. This work expands on a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Based on results from (3), we develop and open-source GATITOS, a high-quality, curated dataset in 168 tail languages, one of the first human-translated resources to cover many of these languages." }

日付

2ヶ月前

データセット構成

Paper URL

2502.12301

ライセンス

CC BY 4.0

タグ

機械学習

翻訳

SMOL（Set for Maximal Overall Leverage）は、Googleが2025年に公開したプロフェッショナル向け翻訳データセットです。リソースの少ない言語向けの翻訳モデルのトレーニングと、高品質な並列データの提供を目的としています。関連する研究論文には、以下のようなものがあります。 SMOL：115のマイナー言語に対応した、専門家による翻訳済みの並行データ。このデータセットには、アムハラ語、スワヒリ語、アファール語を含む221言語の専門翻訳テキストに加え、データが少ない地域言語や、注釈が付けられることが少ない言語も含まれています。専門翻訳者やボランティアが提供したテキストなど、幅広い言語ペアを網羅しており、一部の言語については、医療分野からの専門的なデータや事実に基づいた注釈も追加されています。

データセットの構成：

SmolDoc：文書レベルの翻訳に対応し、130の言語ペア（129の独立した言語）をカバー。
SmolSent：文レベルの翻訳、114の言語ペア（116の独立した言語）に対応。
GATITOS：181の言語ペア（183の独立した言語）を網羅する単語レベルの翻訳ツールで、主に多言語辞書として使用されます。
SmolDoc-factuality-annotations: SmolDoc内の661件の文書に対する事実に関する注釈と理由。

引用文献

@misc{caswell2025smol,
title={{SMOL: Professionally translated parallel data for 115 under-represented languages}},
author={Isaac Caswell and Elizabeth Nielsen and Jiaming Luo and Colin Cherry and Geza Kovacs and Hadar Shemtov and Partha Talukdar and Dinesh Tewari and Baba Mamadi Diane and Koulako Moussa Doumbouya and Djibrila Diane and Solo Farabado Cissé and Edoardo Ferrante and Alessandro Guasoni and Mamadou K. Keita and Sudhamoy DebBarma and Ali Kuzhuget and David Anugraha and Muhammad Ravi Shulthan Habibi and Sina Ahmadi and Mingfei Lau and Jonathan Eng},
year={2025},
eprint={2502.12301},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.12301},
}
@inproceedings{jones-etal-2023-gatitos,
title = {{"GATITOS: Using a New Multilingual Lexicon for Low-resource Machine Translation"}},
author = "Jones, Alexander  and
Caswell, Isaac  and
Firat, Orhan  and
Saxena, Ishank",
editor = "Bouamor, Houda  and
Pino, Juan  and
Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.26/",
doi = "10.18653/v1/2023.emnlp-main.26",
pages = "371--405",
abstract = "Modern machine translation models and language models are able to translate without having been trained on parallel data, greatly expanding the set of languages that they can serve. However, these models still struggle in a variety of predictable ways, a problem that cannot be overcome without at least some trusted bilingual data. This work expands on a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Based on results from (3), we develop and open-source GATITOS, a high-quality, curated dataset in 168 tail languages, one of the first human-translated resources to cover many of these languages."
}

このデータセットはコミュニティユーザーによって提供されており、教育および情報提供のみを目的としています。著作権侵害に関わるコンテンツがある場合は、[email protected]までご連絡ください。速やかに確認し、削除いたします。

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

データセットの構成：

引用文献

AIでAIを構築

HyperAI Newsletters

データセットの構成：

引用文献

関連データセット

TACKターゲットキメラ知識ベースデータセット

世界の大気汚染とAQIデータセット

MemLensマルチモーダル長コンテキストベンチマークデータセット

乳がん：マルチモーダル融合データセット

QCalEval 量子較正グラフの理解データセット

AIでAIを構築

HyperAI Newsletters

データセットの構成：

引用文献

関連データセット

TACKターゲットキメラ知識ベースデータセット

世界の大気汚染とAQIデータセット

MemLensマルチモーダル長コンテキストベンチマークデータセット

乳がん：マルチモーダル融合データセット

QCalEval 量子較正グラフの理解データセット

AIでAIを構築

HyperAI Newsletters

関連データセット

TACKターゲットキメラ知識ベースデータセット

世界の大気汚染とAQIデータセット

MemLensマルチモーダル長コンテキストベンチマークデータセット

乳がん：マルチモーダル融合データセット

QCalEval 量子較正グラフの理解データセット

関連データセット

TACKターゲットキメラ知識ベースデータセット

世界の大気汚染とAQIデータセット

MemLensマルチモーダル長コンテキストベンチマークデータセット

乳がん：マルチモーダル融合データセット

QCalEval 量子較正グラフの理解データセット

Command Palette

SMOL多言語翻訳並列データセット

データセットの構成：

引用文献

AIでAIを構築

HyperAI Newsletters

Command Palette

SMOL多言語翻訳並列データセット

データセットの構成：

引用文献

関連データセット

TACKターゲットキメラ知識ベースデータセット

世界の大気汚染とAQIデータセット

MemLensマルチモーダル長コンテキストベンチマークデータセット

乳がん：マルチモーダル融合データセット

QCalEval 量子較正グラフの理解データセット

AIでAIを構築

HyperAI Newsletters

Command Palette

SMOL多言語翻訳並列データセット

データセットの構成：

引用文献

関連データセット

TACKターゲットキメラ知識ベースデータセット

世界の大気汚染とAQIデータセット

MemLensマルチモーダル長コンテキストベンチマークデータセット

乳がん：マルチモーダル融合データセット

QCalEval 量子較正グラフの理解データセット

AIでAIを構築

HyperAI Newsletters

関連データセット

TACKターゲットキメラ知識ベースデータセット

世界の大気汚染とAQIデータセット

MemLensマルチモーダル長コンテキストベンチマークデータセット

乳がん：マルチモーダル融合データセット

QCalEval 量子較正グラフの理解データセット

関連データセット

TACKターゲットキメラ知識ベースデータセット

世界の大気汚染とAQIデータセット

MemLensマルチモーダル長コンテキストベンチマークデータセット

乳がん：マルチモーダル融合データセット

QCalEval 量子較正グラフの理解データセット