HyperAI

A Collection of Super-large Model Resources | 30 High-quality NLP Datasets and Models, 8 Demos for One-click Use, Recommended for Collection!

特色图像

In the past two years, the popularity of large models has continued to rise, and they have begun to be explored in a wider range of fields. With the rapid development of the industry as a whole, more and more open source large models have entered the market, further promoting the expansion of upper-level applications.

For developers, selecting high-quality large models and data sets is crucial for their subsequent research and development and model fine-tuning. In order to facilitate everyone to select and download models and data sets that are suitable for development needs,HyperAI has compiled some resources related to large models for you:

* High-quality public datasets: 15

* High-quality open source models: 15

* High-quality tutorial selection: 8

For more large model resources, please visit the official website:hyper.ai

Dataset Selection

1. seq-monkey sequence monkey open source dataset 1.0

The Sequence Monkey dataset is a dataset used to train the Sequence Monkey model, covering areas including: Chinese general text corpus, ancient poetry translation corpus, and text generation corpus.

Direct use:https://hyper.ai/datasets/30139

2. IEPile Large-Scale Information Extraction Corpus 

IEPile is a large-scale, high-quality bilingual (Chinese and English) Information Extraction (IE) instruction fine-tuning dataset developed by Zhejiang University, covering multiple fields such as medicine and finance.

Direct use:https://hyper.ai/datasets/30064

3. LongAlign-10K Large Model Long Context Alignment Dataset 

LongAlign-10k was proposed by Tsinghua University. It is a dataset designed to address the challenges faced by large models in long-context alignment tasks. It contains 10,000 long instruction data with a length between 8k and 64k.

Direct use:https://hyper.ai/datasets/30247

4. Dianping Dataset

This dataset contains 4.4 million reviews or ratings from 540,000 users on 240,000 restaurants. It can be used for tasks such as recommendation systems, sentiment/opinion/review tendency analysis, etc.

Direct use:https://hyper.ai/datasets/29993

5. Amazon User Review Dataset

The dataset contains 7.2 million reviews or ratings from 1.42 million users on 520,000 products in more than 1,100 categories on the Amazon website. It can be used for tasks such as recommendation systems and sentiment/opinion/review tendency analysis.

Direct use:https://hyper.ai/datasets/30009

6. PD&CFT People’s Daily Chinese Reading Comprehension Dataset 

This dataset is the first Chinese reading comprehension dataset, which includes People's Daily and Children's Fairy Tale (PD&CFT).

Direct use:https://hyper.ai/datasets/29260

7. Toutiao Chinese Text Classification Dataset

This dataset is a classification dataset of Toutiao Chinese news (short text). The data source is Toutiao client. It contains 15 categories and 382,688 texts.

Direct use:https://hyper.ai/datasets/29517

8. FewJoint Benchmark Dataset 

This dataset comes from the iFlytek AIUI open platform. It contains corpus from real users and corpus constructed by experts (in a ratio of about 3:7), with a total of 59 real domains. It is one of the conversation datasets with the most domains currently.

Direct use:https://hyper.ai/datasets/29239

9. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification 

The dataset contains 23,659 human-translated PAWS evaluation pairs and 296,406 machine-translated training pairs in 6 different languages: French, Spanish, German, Chinese, Japanese, and Korean. All translation pairs are derived from examples in the PAWS-Wiki.

Direct use:https://hyper.ai/datasets/29264

10. Wikipedia

The dataset is built from a Wikipedia dump and contains 56 languages, one subset per language, and each subset contains one training split. Each example contains the content of a complete Wikipedia article, cleaned to remove markup and unwanted parts (references, etc.).

Direct use:https://hyper.ai/datasets/28528

11. RJUA-QA: The first Chinese medical specialty question answering reasoning dataset 

The RJUA-QA dataset contains a total of 2,132 question-answer pairs. Each question-answer pair consists of a question written by a doctor based on clinical experience, an answer provided by an expert, and related reasoning context. The context information is derived from the Chinese Guidelines for the Diagnosis and Treatment of Urological and Andrological Diseases.

Direct use:https://hyper.ai/datasets/28970

12. ShareGPT 90k Chinese and English bilingual human-machine question answering dataset 

ShareGPT-Chinese-English-90k is a high-quality human-machine question-answering dataset in parallel Chinese and English, covering user questions in real and complex scenarios. It can be used to train high-quality dialogue models.

Direct use:

https://hyper.ai/datasets/29523

13. SMP-2017 Chinese Conversation Intent Recognition Dataset

This dataset is the SMP2017 Chinese Human-Computer Dialogue Technology Evaluation (ECDT) Task 1 dataset.

Direct use:https://hyper.ai/datasets/29515

14. Chinese-Poetry Chinese Classical Poetry Collection Database

This dataset is the most comprehensive database of Chinese classical literature, including 55,000 Tang poems, 260,000 Song poems, 21,000 Song poems and other classical literature.

Direct use:https://hyper.ai/datasets/29257

15. MCFEND A multi-source benchmark dataset for Chinese fake news detection

This dataset is a multi-source Chinese fake news detection benchmark dataset jointly constructed by Hong Kong Baptist University, the Chinese University of Hong Kong and other institutions.

Direct use:https://hyper.ai/datasets/30429

For more public datasets, please visit:

https://hyper.ai/datasets

Large Model Selection

1. Mixtral-8x7B

This model is a large language model launched by Mistral AI based on Mistral 7B.

Direct use:https://openbayes.com/console/public/models/f1Ze9ci0tAZ/1/overview

2. C4AI Command-R

C4AI Command-R is a high-performance generative model with 35 billion parameters jointly developed by Cohere and Cohere For AI. The combination of multilingual generation capabilities and high-performance RAG capabilities gives Command-R a unique advantage in cross-language tasks and knowledge-intensive tasks.

Direct use:https://openbayes.com/console/public/models/moNFtsf3XUe/1/overview

3. Financial big model deepmoney-34B-chat

The model is trained based on Yi-34B-200K and is divided into two stages: pt (full parameter training) and sft (lora fine-tuning).

Direct use:https://openbayes.com/console/public/models/yiEoQipyFCK/1/overview

4. ChatGLM3 Series

ChatGLM3 is a conversation pre-training model jointly released by Zhipu AI and Tsinghua University KEG Laboratory.

ChatGLM3-6B

This model is an open source model in the ChatGLM3 series, which retains many excellent features of the previous two generations of models, such as smooth conversation and low deployment threshold.

Direct use:https://openbayes.com/console/public/models/mHwG5TYJVTU/1/overview

ChatGLM3-6B-Base

This model is the basic model of ChatGLM3-6B, which adopts more diverse training data, more sufficient training steps and more reasonable training strategies.

Direct use:https://openbayes.com/console/public/models/7CzPfTweYvU/1/overview

5. LLaVA-v1.5 Series

LLaVA is a model capable of multimodal conversion between vision and language, consisting of a visual encoder and a large language model (Vicuna v1.5 13B).

LLaVA-v1.5-7B

The model is a 7 billion parameter model from the LLaVA-v1.5 family.

Direct use:https://openbayes.com/console/public/models/ZRdv9aF1hGF/1/overview

LLaVA-v1.5-13B

The model is a 13 billion parameter model from the LLaVA-v1.5 family.

Direct use:https://openbayes.com/console/public/models/PagJNrY85MC/1/overview

6. Yi-34B series

The Yi series models are open source large language models trained from scratch by 01.AI. The following model series are related models of its 34B size.

Yi-34B-chat

This model is from the Yi-34B series and is a chat model suitable for a variety of conversation scenarios.

Direct use:https://openbayes.com/console/public/models/6FUjDvKGZNT/1/overview

Yi-34B-Chat-GGUF

This model is the GGUF format of the Yi-34B-Chat.

Direct use:https://openbayes.com/console/public/models/1QqoTcU07zG/1/overview

Yi-34B-Chat-4bits

This model is a 4-bit quantized version of the Yi-34B-Chat model and can be used directly on consumer-grade graphics cards (such as RTX3090).

Direct use:https://openbayes.com/console/public/models/JJCjA8x48ev/1/overview

7. Qwen Tongyi Qianwen Large Model Series

Qwen is a series of super-large-scale language models launched by Alibaba Cloud, including different models with different numbers of parameters. It includes Qwen (basic pre-trained language model) and Qwen-Chat (chat model), and the chat model is fine-tuned using human alignment technology.

Qwen1.5-1.8B-Chat

Qwen1.5 is the beta version of Qwen2, which is a smaller chat model version in the Qwen2 series with a parameter size of 1.8 billion.

Direct use:

https://openbayes.com/console/public/models/A83bxItlb1M/1/overview

Qwen-14B-Chat-Int4

Qwen-14B-Chat is a chat model with 14 billion parameters in the Tongyi Qianwen large model series. This model is its Int4 quantized model.

Direct use:https://openbayes.com/console/public/models/tlA61MKMb7C/1/overview

Qwen-72B-Chat

This model is a 72 billion parameter model in the Tongyi Qianwen large model series.

Direct use:https://openbayes.com/console/public/models/IyhI1wCMCvU/1/overview

Qwen-72B-Chat-Int4

This model is the Int4 quantized model of Qwen-72B-Chat.

Direct use:https://openbayes.com/console/public/models/XVAkUec0H5e/1/overview

Qwen-72B-Chat-Int8

This model is the Int8 quantized model of Qwen-72B-Chat.

Direct use:https://openbayes.com/console/public/models/LEnvRTil8Xe/1/overview

High-quality tutorial selection

1. Run Qwen1.5-MoE online

Qwen1.5-MoE-A2.7B is the first MoE model of the Qwen series launched by the Tongyi Qianwen team. This tutorial is its Demo container. You can use Gradio link to experience the large model by cloning it with one click.

Run online:https://openbayes.com/console/public/tutorials/1xfftSx42TR

2. Qwen-14B-Chat-Int4 Model Gradio Demo

This tutorial is a demo container of Qwen-14B-Chat-Int4. You can clone it with one click and use Gradio link to experience the large model.

Run online:https://openbayes.com/console/public/tutorials/hqe2P86oMDA

3. Qwen-1.8B-Chat-API-FT Model Demo

This tutorial mainly demonstrates how to run the Qwen-1.8B-Chat model and the main process of fine-tuning.

Run online:https://openbayes.com/console/public/tutorials/C8OUoAlBR1m

4. Qwen-72B-Chat-Int4 Model Gradio Demo

This tutorial is a demo container of Qwen-72B-Chat-Int4. You can clone it with one click and use Gradio link to experience the large model.

Run online:https://openbayes.com/console/public/tutorials/Gr4tiYYq24K

5. Run the quantization model of Yi-34B-Chat online

This tutorial mainly demonstrates the main process of using LlamaEdge to run the Yi-34B-Chat quantitative model.

Run online:https://openbayes.com/console/public/tutorials/v6ZVAzejUCM

6. Running the financial model Deepmoney-34B-full online

Deepmoney is a large language model project focusing on financial investment. Deepmoney-34B-full is trained based on the Yi-34B-200K model and is divided into two stages: pt (full parameter training) and sft (lora fine-tuning). It can now be cloned and used on the Super Neural Network official website.

Run online:https://openbayes.com/console/public/tutorials/uBYYEnxdpce

7. One-click to run Yi-9B Demo

Yi-9B is the model with the strongest code and mathematical capabilities in the Yi series. This tutorial is a demo container of Yi-9B.

Run online:https://openbayes.com/console/public/tutorials/BitjtzfuNLb

8. Quick deployment of ChatGLM2-6B

This tutorial is a demo container of ChatGLM2-6B. You can clone it with one click and use Gradio link to experience the large model.

Run online:https://openbayes.com/console/public/tutorials/KD5azt9z9tn

The above is all the content selected by the big model editor. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!

About HyperAI

HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:

* Provide domestic accelerated download nodes for 1200+ public data sets

* Includes 300+ classic and popular online tutorials

* Interpretation of 100+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai