DISC-Law-SFT High-quality Chinese Legal Supervision Fine-tuning Dataset
Date
Size
Publish URL
Tags
Categories
* This dataset supports online use.Click here to jump.
The DISC-Law-SFT dataset is a high-quality supervised fine-tuning (SFT) dataset built by the Data Intelligence and Social Computing Laboratory (Fudan-DISC) of Fudan University in 2023. It is used to train and improve the application capabilities of large language models (LLMs) in the legal field, and contains nearly 300,000 training data. This dataset is designed specifically for the Chinese legal field, aiming to improve the model's capabilities in legal text processing, legal reasoning thinking, and knowledge retrieval and compliance in the judicial field. The relevant paper is "DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services"
The dataset contains two subsets: DISC-Law-SFT-Pair and DISC-Law-SFT-Triplet. The DISC-Law-SFT-Pair subset introduces legal reasoning capabilities through the instruction pair construction method of legal syllogism, while the DISC-Law-SFT-Triplet subset enhances the model's ability to utilize external knowledge by constructing triples containing input, output, and reference information.
The data sources of the dataset mainly include three parts: public datasets of NLP judicial tasks related to Chinese law, original legal texts from the real world, and general open source datasets. Such data sources ensure the diversity and richness of the dataset.