HyperAIHyperAI
6 days ago

SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation

Iman Barati, Mostafa Amiri, Heshaam Faili
SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based
  Instruction Dataset Creation
Abstract

Supervised Fine-Tuning (SFT) is essential for training large language models(LLMs), significantly enhancing critical capabilities such as instructionfollowing and in-context learning. Nevertheless, creating suitable trainingdatasets tailored for specific domains remains challenging due to unique domainconstraints and data scarcity. In this paper, we propose SearchInstruct, aninnovative method explicitly designed to construct high quality instructiondatasets for SFT. Our approach begins with a limited set of domain specific,human generated questions, which are systematically expanded using a largelanguage model. Subsequently, domain relevant resources are dynamicallyretrieved to generate accurate and contextually appropriate answers for eachaugmented question. Experimental evaluation demonstrates that SearchInstructenhances both the diversity and quality of SFT datasets, leading to measurableimprovements in LLM performance within specialized domains. Additionally, weshow that beyond dataset generation, the proposed method can also effectivelyfacilitate tasks such as model editing, enabling efficient updates to existingmodels. To facilitate reproducibility and community adoption, we provide fullimplementation details, the complete set of generated instruction responsepairs, and the source code in a publicly accessible Git repository:https://github.com/mostafaamiri/SearchInstruct