HyperAI

Weekly Editor's Picks | FewJoint Benchmark Dataset Launched, Ministry of Science and Technology Supervision Department Releases New AI Regulations

a year ago
Information
zhaorui
特色图像

Few-shot learning refers to the ability to learn and master new tasks with very few samples, just like humans. This field has become a hot topic in the machine learning community and is considered one of the key directions to push machine intelligence closer to human intelligence.Harbin Institute of Technology launched the FewJoint benchmark dataset, which provides a public evaluation benchmark for NLP small sample evaluation.This dataset is now available on hyper.ai. There are more NLP datasets for Chinese large model training available for download on hyper.ai. Let’s take a look!

From January 29 to February 2, hyper.ai official website updates:

* High-quality public datasets: 10

* AI4S paper cases: 3

* Popular encyclopedia entries: 10

Visit the official website:hyper.ai

Selected public datasets

1FewJoint small sample benchmark dataset

The FewJoint benchmark dataset is a collection of real user corpus and expert-constructed corpus from the iFlytek AIUI open platform (in a ratio of approximately 3:7). It covers 59 real domains and is currently one of the conversation datasets with the most domains.

Direct use:

https://hyper.ai/datasets/29239

2. 100 PoisonMpts Chinese large model governance dataset

100 PoisonMpts is the industry's first open-source Chinese language model data set. Dozens of well-known experts and scholars form the first batch of "100 bottles of poison for AI" annotation engineers. Each annotator asks 100 tricky questions that induce bias and discriminatory answers, and annotates the answers of the large model, completing the attack and defense with AI from "poisoning" to "detoxification".

Direct use:

https://hyper.ai/datasets/29203

3. CLUE Chinese Language Understanding Evaluation Benchmark Dataset

CLUE (A Chinese Language Understanding Evaluation Benchmark) is a dataset used for training, verification, and testing of Chinese grammar understanding tasks.

Direct use:

https://hyper.ai/datasets/29094

4. Wikipedia Wikipedia dataset

This dataset is constructed from Wikipedia dumps, with one subset per language and one column split per subset. Each example contains the content of a complete Wikipedia article, cleaned to remove tags and unwanted parts (like "references", etc.).

Direct use:

https://hyper.ai/datasets/28528

5. CCI Chinese Internet Corpus

The Chinese Corpora Internet (CCI) is composed of high-quality, trustworthy sources from Internet websites in mainland China. CCI has undergone rigorous data cleaning and deduplication, and has conducted targeted testing and filtering in terms of content quality.

Direct use:

https://hyper.ai/datasets/29186

6. PKU  Simplified Chinese word segmentation dataset

The SIGHAN 2005 dataset, International Chinese Automatic Word Segmentation Evaluation (SIGHAN Evaluation for short), integrates word segmentation datasets from multiple institutions. This dataset was jointly released by Microsoft Research China, Peking University, City University of Hong Kong, and Academia Sinica in Taiwan, and is used for training and evaluating Chinese word segmentation models. PKU is a simplified Chinese word segmentation dataset.

Direct use:

https://hyper.ai/datasets/29168

7. Chinese-Poetry The most comprehensive database of Chinese classical poetry

This dataset is the most complete Chinese classical literature database, including 55,000 Tang poems, 260,000 Song poems, 21,000 Song poems and other classical literature. The poets include nearly 14,000 ancient poets from the Tang and Song dynasties, and 1.5k ancient poets from the Song Dynasty. The data comes from the Internet.

Direct use:

https://hyper.ai/datasets/29257

8. PD&CFT Chinese reading comprehension dataset

This dataset is the first Chinese reading comprehension dataset, which includes text content from People's Daily and Children's Fairy Tale (PD&CFT).

Direct use:

https://hyper.ai/datasets/29260

For more updated datasets this week, please visit:

https://hyper.ai/datasets

ScienceAI  Selected Case Studies

1.The accuracy of early diagnosis of Parkinson's disease has been improved to 90.2%. Shenzhen Institute of Advanced Technology and Zhongshan First Hospital jointly proposed the GSP-GCNs model

A research team from the First Affiliated Hospital of Sun Yat-sen University and the Institute of Advanced Technology of USTC proposed a deep learning model, Graph Signal Processing-Graph Convolutional Networks (GSP-GCNs), to diagnose Parkinson's disease using event-related EEG data obtained from specific tasks involving tone regulation. The related paper has been published in the journal Nature.

View the full report:

https://hyper.ai/news/29189

2. The Ministry of Science and Technology has taken action! The AIGC user manual for researchers is here, and the academic community is beginning to guard against AI gunmen

On December 21, 2023, the Supervision Department of the Ministry of Science and Technology issued the "Guidelines for Responsible Research Conduct (2023)", which regulates the application of AI and other technologies in scientific research in response to hot issues of social concern such as artificial intelligence and the release of major results.

View the full report:

https://hyper.ai/news/29228

3. The paper of the Institute of Semiconductors of the Chinese Academy of Sciences was published in the top journal of TNNLS again, contributing a new perspective to explore mathematical expressions

Researchers from the Institute of Semiconductors of the Chinese Academy of Sciences regard the solution of expression structure as a classification problem and solve it through supervised learning. They proposed a symbolic network called DeepSymNet to represent symbolic expressions. Compared with several popular SR algorithms based on supervised learning, DeepSymNet uses shorter labels, reduces the search space for prediction, and improves the robustness of the algorithm. The relevant paper has been published in the "IEEE" journal.

View the full report:

https://hyper.ai/news/29243

Popular Encyclopedia Articles

1. Representation learning

2. Long and short-term memory Long Short-Term Memory

3. The least square method

4. Grid Computing Grid Computing

5. Reciprocal Rank Fusion (RRF)

Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:

https://hyper.ai/wiki

The above is all the content of this week’s editor’s selection. If you have resources that you would like to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!

See you next week!

About HyperAI

HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:

* Provide domestic accelerated download nodes for 1200+ public data sets

* Includes 300+ classic and popular online tutorials

* Interpretation of 100+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai/