HyperAI

Firefly Chinese Llama2 Incremental Pre-training Dataset

Date

a year ago

Size

9.02 GB

Publish URL

huggingface.co

The dataset is Firefly-LLaMA2-Chinese project The incremental pre-training data totals about 22GB of text, mainly including open source data sets such as CLUE, ThucNews, CNews, COIG, Wikipedia, and ancient poems, prose, classical Chinese, etc. collected by the research team. The data distribution is shown in the figure below.

firefly-pretrain-dataset.torrent
Seeding 1Downloading 1Completed 79Total Downloads 109
  • firefly-pretrain-dataset/
    • README.md
      1.04 KB
    • README.txt
      2.09 KB
      • data/
        • firefly-pretrain-dataset.zip
          9.02 GB