Kuwain 1.5B: An Arabic SLM via Language Injection

Enhancing existing models with new knowledge is a crucial aspect of AIdevelopment. This paper introduces a novel method for integrating a newlanguage into a large language model (LLM). Our approach successfullyincorporates a previously unseen target language into an existing LLM withoutcompromising its prior knowledge. We trained a tiny model with 1.5 billionparameters named Kuwain by injecting the Arabic language into a smallopen-source model mainly trained in English. Our method demonstratessignificant improvements in Arabic language performance, with an average 8%improvement across various benchmarks, while retaining the model's existingknowledge with a minimum amount of the original model's data. This offers acost-effective alternative to training a comprehensive model in both Englishand Arabic. The results highlight the potential for efficient, targetedlanguage model expansion without extensive retraining or resource-intensiveprocesses.