LawInstruct: The First Large-scale Dataset of Legal Instructions
Date
a year ago
Size
9.84 GB
Publish URL
LawInstruct is the first large-scale instruction dataset for the legal field. The dataset was jointly created by Stanford University, Johns Hopkins University and other institutions and will be released in April 2024. LawInstruct was created to fill the gaps in existing legal task datasets and accelerate the development of models in the legal field.
- Dataset characteristics:
- Coverage: LawInstruct covers 17 jurisdictions and 24 languages, ensuring broad applicability and diversity of the dataset.
- Scale and diversity: Contains 12 million training examples, covering a variety of legal tasks such as question answering, entailment, summarization, and information extraction.
- Dataset structure:
- Each example is presented in a customized instruction format, ensuring data consistency and operability.
- It integrates 58 high-quality annotated datasets from different legal tasks and professional fields.
- Technical Implementation:
- We used MultiLegalPile, a 689GB multilingual legal corpus, to provide rich pre-training materials for the model.
- Performance Improvements:
- By adjusting instructions on LawInstruct, the balanced accuracy of the Flan-T5 XL model on LegalBench is significantly improved, verifying the positive impact of the dataset on model performance.
- Research and Papers:
- Related research results were published in the paperFLawN-T5: An Empirical Examination of Effective Instruction Tuning Data Mixtures for Legal Reasoning", which records in detail the construction process and experimental results of the LawInstruct dataset.
LawInstruct.torrent
Seeding 1Downloading 1Completed 89Total Downloads 199