HyperAI

LawInstruct: The First Large-scale Dataset of Legal Instructions

Date

a year ago

Size

9.84 GB

Organization

Stanford University

Publish URL

huggingface.co

LawInstruct is the first large-scale instruction dataset for the legal field. The dataset was jointly created by Stanford University, Johns Hopkins University and other institutions and will be released in April 2024. LawInstruct was created to fill the gaps in existing legal task datasets and accelerate the development of models in the legal field.

  1. Dataset characteristics:
    • Coverage: LawInstruct covers 17 jurisdictions and 24 languages, ensuring broad applicability and diversity of the dataset.
    • Scale and diversity: Contains 12 million training examples, covering a variety of legal tasks such as question answering, entailment, summarization, and information extraction.
  2. Dataset structure:
    • Each example is presented in a customized instruction format, ensuring data consistency and operability.
    • It integrates 58 high-quality annotated datasets from different legal tasks and professional fields.
  3. Technical Implementation:
    • We used MultiLegalPile, a 689GB multilingual legal corpus, to provide rich pre-training materials for the model.
  4. Performance Improvements:
    • By adjusting instructions on LawInstruct, the balanced accuracy of the Flan-T5 XL model on LegalBench is significantly improved, verifying the positive impact of the dataset on model performance.
  5. Research and Papers:
LawInstruct.torrent
Seeding 1Downloading 1Completed 89Total Downloads 199
  • LawInstruct/
    • README.md
      2.09 KB
    • README.txt
      4.18 KB
      • data/
        • lawinstruct.zip
          9.84 GB