Firefly Chinese Llama2 Incremental Pre-training Dataset
Date
a year ago
Size
9.02 GB
Publish URL
Tags
Categories
The dataset is Firefly-LLaMA2-Chinese project The incremental pre-training data totals about 22GB of text, mainly including open source data sets such as CLUE, ThucNews, CNews, COIG, Wikipedia, and ancient poems, prose, classical Chinese, etc. collected by the research team. The data distribution is shown in the figure below.

firefly-pretrain-dataset.torrent
Seeding 1Downloading 1Completed 79Total Downloads 109