A Collection of Super-large Model Resources | 30 High-quality NLP Datasets and Models, 8 Demos for One-click Use, Recommended for Collection!

In the past two years, the popularity of large models has continued to rise, and they have begun to be explored in a wider range of fields. With the rapid development of the industry as a whole, more and more open source large models have entered the market, further promoting the expansion of upper-level applications.
For developers, selecting high-quality large models and data sets is crucial for their subsequent research and development and model fine-tuning. In order to facilitate everyone to select and download models and data sets that are suitable for development needs,HyperAI has compiled some resources related to large models for you:
* High-quality public datasets: 15
* High-quality open source models: 15
* High-quality tutorial selection: 8
For more large model resources, please visit the official website:hyper.ai
Dataset Selection
1. seq-monkey sequence monkey open source dataset 1.0
The Sequence Monkey dataset is a dataset used to train the Sequence Monkey model, covering areas including: Chinese general text corpus, ancient poetry translation corpus, and text generation corpus.
Direct use:https://hyper.ai/datasets/30139
2. IEPile Large-Scale Information Extraction Corpus
IEPile is a large-scale, high-quality bilingual (Chinese and English) Information Extraction (IE) instruction fine-tuning dataset developed by Zhejiang University, covering multiple fields such as medicine and finance.
Direct use:https://hyper.ai/datasets/30064
3. LongAlign-10K Large Model Long Context Alignment Dataset
LongAlign-10k was proposed by Tsinghua University. It is a dataset designed to address the challenges faced by large models in long-context alignment tasks. It contains 10,000 long instruction data with a length between 8k and 64k.
Direct use:https://hyper.ai/datasets/30247
4. Dianping Dataset
This dataset contains 4.4 million reviews or ratings from 540,000 users on 240,000 restaurants. It can be used for tasks such as recommendation systems, sentiment/opinion/review tendency analysis, etc.
Direct use:https://hyper.ai/datasets/29993
5. Amazon User Review Dataset
The dataset contains 7.2 million reviews or ratings from 1.42 million users on 520,000 products in more than 1,100 categories on the Amazon website. It can be used for tasks such as recommendation systems and sentiment/opinion/review tendency analysis.
Direct use:https://hyper.ai/datasets/30009
6. PD&CFT People’s Daily Chinese Reading Comprehension Dataset
This dataset is the first Chinese reading comprehension dataset, which includes People's Daily and Children's Fairy Tale (PD&CFT).
Direct use:https://hyper.ai/datasets/29260
7. Toutiao Chinese Text Classification Dataset
This dataset is a classification dataset of Toutiao Chinese news (short text). The data source is Toutiao client. It contains 15 categories and 382,688 texts.
Direct use:https://hyper.ai/datasets/29517
8. FewJoint Benchmark Dataset
This dataset comes from the iFlytek AIUI open platform. It contains corpus from real users and corpus constructed by experts (in a ratio of about 3:7), with a total of 59 real domains. It is one of the conversation datasets with the most domains currently.
Direct use:https://hyper.ai/datasets/29239
9. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification
The dataset contains 23,659 human-translated PAWS evaluation pairs and 296,406 machine-translated training pairs in 6 different languages: French, Spanish, German, Chinese, Japanese, and Korean. All translation pairs are derived from examples in the PAWS-Wiki.
Direct use:https://hyper.ai/datasets/29264
10. Wikipedia
The dataset is built from a Wikipedia dump and contains 56 languages, one subset per language, and each subset contains one training split. Each example contains the content of a complete Wikipedia article, cleaned to remove markup and unwanted parts (references, etc.).
Direct use:https://hyper.ai/datasets/28528
11. RJUA-QA: The first Chinese medical specialty question answering reasoning dataset
The RJUA-QA dataset contains a total of 2,132 question-answer pairs. Each question-answer pair consists of a question written by a doctor based on clinical experience, an answer provided by an expert, and related reasoning context. The context information is derived from the Chinese Guidelines for the Diagnosis and Treatment of Urological and Andrological Diseases.
Direct use:https://hyper.ai/datasets/28970
12. ShareGPT 90k Chinese and English bilingual human-machine question answering dataset
ShareGPT-Chinese-English-90k is a high-quality human-machine question-answering dataset in parallel Chinese and English, covering user questions in real and complex scenarios. It can be used to train high-quality dialogue models.
Direct use:
https://hyper.ai/datasets/29523
13. SMP-2017 Chinese Conversation Intent Recognition Dataset
This dataset is the SMP2017 Chinese Human-Computer Dialogue Technology Evaluation (ECDT) Task 1 dataset.
Direct use:https://hyper.ai/datasets/29515
14. Chinese-Poetry Chinese Classical Poetry Collection Database
This dataset is the most comprehensive database of Chinese classical literature, including 55,000 Tang poems, 260,000 Song poems, 21,000 Song poems and other classical literature.
Direct use:https://hyper.ai/datasets/29257
15. MCFEND A multi-source benchmark dataset for Chinese fake news detection
This dataset is a multi-source Chinese fake news detection benchmark dataset jointly constructed by Hong Kong Baptist University, the Chinese University of Hong Kong and other institutions.
Direct use:https://hyper.ai/datasets/30429
For more public datasets, please visit:
Large Model Selection
1. Mixtral-8x7B
This model is a large language model launched by Mistral AI based on Mistral 7B.
Direct use:https://openbayes.com/console/public/models/f1Ze9ci0tAZ/1/overview
2. C4AI Command-R
C4AI Command-R is a high-performance generative model with 35 billion parameters jointly developed by Cohere and Cohere For AI. The combination of multilingual generation capabilities and high-performance RAG capabilities gives Command-R a unique advantage in cross-language tasks and knowledge-intensive tasks.
Direct use:https://openbayes.com/console/public/models/moNFtsf3XUe/1/overview
3. Financial big model deepmoney-34B-chat
The model is trained based on Yi-34B-200K and is divided into two stages: pt (full parameter training) and sft (lora fine-tuning).
Direct use:https://openbayes.com/console/public/models/yiEoQipyFCK/1/overview
4. ChatGLM3 Series
ChatGLM3 is a conversation pre-training model jointly released by Zhipu AI and Tsinghua University KEG Laboratory.
ChatGLM3-6B
This model is an open source model in the ChatGLM3 series, which retains many excellent features of the previous two generations of models, such as smooth conversation and low deployment threshold.
Direct use:https://openbayes.com/console/public/models/mHwG5TYJVTU/1/overview
ChatGLM3-6B-Base
This model is the basic model of ChatGLM3-6B, which adopts more diverse training data, more sufficient training steps and more reasonable training strategies.
Direct use:https://openbayes.com/console/public/models/7CzPfTweYvU/1/overview
5. LLaVA-v1.5 Series
LLaVA is a model capable of multimodal conversion between vision and language, consisting of a visual encoder and a large language model (Vicuna v1.5 13B).
LLaVA-v1.5-7B
The model is a 7 billion parameter model from the LLaVA-v1.5 family.
Direct use:https://openbayes.com/console/public/models/ZRdv9aF1hGF/1/overview
LLaVA-v1.5-13B
The model is a 13 billion parameter model from the LLaVA-v1.5 family.
Direct use:https://openbayes.com/console/public/models/PagJNrY85MC/1/overview
6. Yi-34B series
The Yi series models are open source large language models trained from scratch by 01.AI. The following model series are related models of its 34B size.
Yi-34B-chat
This model is from the Yi-34B series and is a chat model suitable for a variety of conversation scenarios.
Direct use:https://openbayes.com/console/public/models/6FUjDvKGZNT/1/overview
Yi-34B-Chat-GGUF
This model is the GGUF format of the Yi-34B-Chat.
Direct use:https://openbayes.com/console/public/models/1QqoTcU07zG/1/overview
Yi-34B-Chat-4bits
This model is a 4-bit quantized version of the Yi-34B-Chat model and can be used directly on consumer-grade graphics cards (such as RTX3090).
Direct use:https://openbayes.com/console/public/models/JJCjA8x48ev/1/overview
7. Qwen Tongyi Qianwen Large Model Series
Qwen is a series of super-large-scale language models launched by Alibaba Cloud, including different models with different numbers of parameters. It includes Qwen (basic pre-trained language model) and Qwen-Chat (chat model), and the chat model is fine-tuned using human alignment technology.
Qwen1.5-1.8B-Chat
Qwen1.5 is the beta version of Qwen2, which is a smaller chat model version in the Qwen2 series with a parameter size of 1.8 billion.
Direct use:
https://openbayes.com/console/public/models/A83bxItlb1M/1/overview
Qwen-14B-Chat-Int4
Qwen-14B-Chat is a chat model with 14 billion parameters in the Tongyi Qianwen large model series. This model is its Int4 quantized model.
Direct use:https://openbayes.com/console/public/models/tlA61MKMb7C/1/overview
Qwen-72B-Chat
This model is a 72 billion parameter model in the Tongyi Qianwen large model series.
Direct use:https://openbayes.com/console/public/models/IyhI1wCMCvU/1/overview
Qwen-72B-Chat-Int4
This model is the Int4 quantized model of Qwen-72B-Chat.
Direct use:https://openbayes.com/console/public/models/XVAkUec0H5e/1/overview
Qwen-72B-Chat-Int8
This model is the Int8 quantized model of Qwen-72B-Chat.
Direct use:https://openbayes.com/console/public/models/LEnvRTil8Xe/1/overview
High-quality tutorial selection
1. Run Qwen1.5-MoE online
Qwen1.5-MoE-A2.7B is the first MoE model of the Qwen series launched by the Tongyi Qianwen team. This tutorial is its Demo container. You can use Gradio link to experience the large model by cloning it with one click.
Run online:https://openbayes.com/console/public/tutorials/1xfftSx42TR
2. Qwen-14B-Chat-Int4 Model Gradio Demo
This tutorial is a demo container of Qwen-14B-Chat-Int4. You can clone it with one click and use Gradio link to experience the large model.
Run online:https://openbayes.com/console/public/tutorials/hqe2P86oMDA
3. Qwen-1.8B-Chat-API-FT Model Demo
This tutorial mainly demonstrates how to run the Qwen-1.8B-Chat model and the main process of fine-tuning.
Run online:https://openbayes.com/console/public/tutorials/C8OUoAlBR1m
4. Qwen-72B-Chat-Int4 Model Gradio Demo
This tutorial is a demo container of Qwen-72B-Chat-Int4. You can clone it with one click and use Gradio link to experience the large model.
Run online:https://openbayes.com/console/public/tutorials/Gr4tiYYq24K
5. Run the quantization model of Yi-34B-Chat online
This tutorial mainly demonstrates the main process of using LlamaEdge to run the Yi-34B-Chat quantitative model.
Run online:https://openbayes.com/console/public/tutorials/v6ZVAzejUCM
6. Running the financial model Deepmoney-34B-full online
Deepmoney is a large language model project focusing on financial investment. Deepmoney-34B-full is trained based on the Yi-34B-200K model and is divided into two stages: pt (full parameter training) and sft (lora fine-tuning). It can now be cloned and used on the Super Neural Network official website.
Run online:https://openbayes.com/console/public/tutorials/uBYYEnxdpce
7. One-click to run Yi-9B Demo
Yi-9B is the model with the strongest code and mathematical capabilities in the Yi series. This tutorial is a demo container of Yi-9B.
Run online:https://openbayes.com/console/public/tutorials/BitjtzfuNLb
8. Quick deployment of ChatGLM2-6B
This tutorial is a demo container of ChatGLM2-6B. You can clone it with one click and use Gradio link to experience the large model.
Run online:https://openbayes.com/console/public/tutorials/KD5azt9z9tn
The above is all the content selected by the big model editor. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!
About HyperAI
HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:
* Provide domestic accelerated download nodes for 1200+ public data sets
* Includes 300+ classic and popular online tutorials
* Interpretation of 100+ AI4Science paper cases
* Support 500+ related terms search
* Hosting the first complete Apache TVM Chinese documentation in China
Visit the official website to start your learning journey: