MusicPile Large Music Dataset
Date
Size
Publish URL
Categories
MusicPile is a large-scale music-language pre-training dataset jointly launched by the Multimodal Art Projection Research Community, Skywork AI, and the Hong Kong University of Science and Technology. The dataset contains 5.17 million samples and approximately 4.16 billion tokens, from sources including online corpora, encyclopedias, music books, YouTube music subtitles, ABC notation works, mathematical content, and code. The dataset contains three fields: id, text, and src, and each text has no more than 2,048 tokens. MusicPile covers a wide range of music common sense, knowledge questions and answers, and typical music theory content, which plays a key role in improving the music understanding and creation capabilities of large models.