Weekly Editor's Picks | FewJoint Benchmark Dataset Launched, Ministry of Science and Technology Supervision Department Releases New AI Regulations

Few-shot learning refers to the ability to learn and master new tasks with very few samples, just like humans. This field has become a hot topic in the machine learning community and is considered one of the key directions to push machine intelligence closer to human intelligence.Harbin Institute of Technology launched the FewJoint benchmark dataset, which provides a public evaluation benchmark for NLP small sample evaluation.This dataset is now available on hyper.ai. There are more NLP datasets for Chinese large model training available for download on hyper.ai. Let’s take a look!

From January 29 to February 2, hyper.ai official website updates:

* High-quality public datasets: 10

* AI4S paper cases: 3

* Popular encyclopedia entries: 10

Visit the official website:hyper.ai

Selected public datasets

1. FewJoint small sample benchmark dataset

The FewJoint benchmark dataset is a collection of real user corpus and expert-constructed corpus from the iFlytek AIUI open platform (in a ratio of approximately 3:7). It covers 59 real domains and is currently one of the conversation datasets with the most domains.

Direct use:

https://hyper.ai/datasets/29239

2. 100 PoisonMpts Chinese large model governance dataset

100 PoisonMpts is the industry's first open-source Chinese language model data set. Dozens of well-known experts and scholars form the first batch of "100 bottles of poison for AI" annotation engineers. Each annotator asks 100 tricky questions that induce bias and discriminatory answers, and annotates the answers of the large model, completing the attack and defense with AI from "poisoning" to "detoxification".

Direct use:

https://hyper.ai/datasets/29203

3. CLUE Chinese Language Understanding Evaluation Benchmark Dataset

CLUE (A Chinese Language Understanding Evaluation Benchmark) is a dataset used for training, verification, and testing of Chinese grammar understanding tasks.

Direct use:

https://hyper.ai/datasets/29094

4. Wikipedia Wikipedia dataset

This dataset is constructed from Wikipedia dumps, with one subset per language and one column split per subset. Each example contains the content of a complete Wikipedia article, cleaned to remove tags and unwanted parts (like "references", etc.).

Direct use:

https://hyper.ai/datasets/28528

5. CCI Chinese Internet Corpus

The Chinese Corpora Internet (CCI) is composed of high-quality, trustworthy sources from Internet websites in mainland China. CCI has undergone rigorous data cleaning and deduplication, and has conducted targeted testing and filtering in terms of content quality.

Direct use:

https://hyper.ai/datasets/29186

6. PKU Simplified Chinese word segmentation dataset

The SIGHAN 2005 dataset, International Chinese Automatic Word Segmentation Evaluation (SIGHAN Evaluation for short), integrates word segmentation datasets from multiple institutions. This dataset was jointly released by Microsoft Research China, Peking University, City University of Hong Kong, and Academia Sinica in Taiwan, and is used for training and evaluating Chinese word segmentation models. PKU is a simplified Chinese word segmentation dataset.

Direct use:

https://hyper.ai/datasets/29168

7. Chinese-Poetry The most comprehensive database of Chinese classical poetry

This dataset is the most complete Chinese classical literature database, including 55,000 Tang poems, 260,000 Song poems, 21,000 Song poems and other classical literature. The poets include nearly 14,000 ancient poets from the Tang and Song dynasties, and 1.5k ancient poets from the Song Dynasty. The data comes from the Internet.

Direct use:

https://hyper.ai/datasets/29257

8. PD&CFT Chinese reading comprehension dataset

This dataset is the first Chinese reading comprehension dataset, which includes text content from People's Daily and Children's Fairy Tale (PD&CFT).

Direct use:

https://hyper.ai/datasets/29260

For more updated datasets this week, please visit:

https://hyper.ai/datasets

ScienceAI Selected Case Studies

1.The accuracy of early diagnosis of Parkinson's disease has been improved to 90.2%. Shenzhen Institute of Advanced Technology and Zhongshan First Hospital jointly proposed the GSP-GCNs model

A research team from the First Affiliated Hospital of Sun Yat-sen University and the Institute of Advanced Technology of USTC proposed a deep learning model, Graph Signal Processing-Graph Convolutional Networks (GSP-GCNs), to diagnose Parkinson's disease using event-related EEG data obtained from specific tasks involving tone regulation. The related paper has been published in the journal Nature.

View the full report:

https://hyper.ai/news/29189

2. The Ministry of Science and Technology has taken action! The AIGC user manual for researchers is here, and the academic community is beginning to guard against AI gunmen

On December 21, 2023, the Supervision Department of the Ministry of Science and Technology issued the "Guidelines for Responsible Research Conduct (2023)", which regulates the application of AI and other technologies in scientific research in response to hot issues of social concern such as artificial intelligence and the release of major results.

View the full report:

https://hyper.ai/news/29228

3. The paper of the Institute of Semiconductors of the Chinese Academy of Sciences was published in the top journal of TNNLS again, contributing a new perspective to explore mathematical expressions

Researchers from the Institute of Semiconductors of the Chinese Academy of Sciences regard the solution of expression structure as a classification problem and solve it through supervised learning. They proposed a symbolic network called DeepSymNet to represent symbolic expressions. Compared with several popular SR algorithms based on supervised learning, DeepSymNet uses shorter labels, reduces the search space for prediction, and improves the robustness of the algorithm. The relevant paper has been published in the "IEEE" journal.

View the full report:

https://hyper.ai/news/29243

Popular Encyclopedia Articles

1. Representation learning

2. Long and short-term memory Long Short-Term Memory

3. The least square method

4. Grid Computing Grid Computing

5. Reciprocal Rank Fusion (RRF)

Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:

https://hyper.ai/wiki

The above is all the content of this week’s editor’s selection. If you have resources that you would like to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!

See you next week!

About HyperAI

HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:

* Provide domestic accelerated download nodes for 1200+ public data sets

* Includes 300+ classic and popular online tutorials

* Interpretation of 100+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai/

HyperAI

Weekly Editor's Picks | FewJoint Benchmark Dataset Launched, Ministry of Science and Technology Supervision Department Releases New AI Regulations

2 years ago

Information

AI for Science

Dataset

From January 29 to February 2, hyper.ai official website updates:

* High-quality public datasets: 10

* AI4S paper cases: 3

* Popular encyclopedia entries: 10

Visit the official website:hyper.ai

Selected public datasets

1. FewJoint small sample benchmark dataset

Direct use:

https://hyper.ai/datasets/29239

2. 100 PoisonMpts Chinese large model governance dataset

Direct use:

https://hyper.ai/datasets/29203

3. CLUE Chinese Language Understanding Evaluation Benchmark Dataset

CLUE (A Chinese Language Understanding Evaluation Benchmark) is a dataset used for training, verification, and testing of Chinese grammar understanding tasks.

Direct use:

https://hyper.ai/datasets/29094

4. Wikipedia Wikipedia dataset

Direct use:

https://hyper.ai/datasets/28528

5. CCI Chinese Internet Corpus

Direct use:

https://hyper.ai/datasets/29186

6. PKU Simplified Chinese word segmentation dataset

Direct use:

https://hyper.ai/datasets/29168

7. Chinese-Poetry The most comprehensive database of Chinese classical poetry

Direct use:

https://hyper.ai/datasets/29257

8. PD&CFT Chinese reading comprehension dataset

This dataset is the first Chinese reading comprehension dataset, which includes text content from People's Daily and Children's Fairy Tale (PD&CFT).

Direct use:

https://hyper.ai/datasets/29260

For more updated datasets this week, please visit:

https://hyper.ai/datasets

ScienceAI Selected Case Studies

1.The accuracy of early diagnosis of Parkinson's disease has been improved to 90.2%. Shenzhen Institute of Advanced Technology and Zhongshan First Hospital jointly proposed the GSP-GCNs model

View the full report:

https://hyper.ai/news/29189

2. The Ministry of Science and Technology has taken action! The AIGC user manual for researchers is here, and the academic community is beginning to guard against AI gunmen

View the full report:

https://hyper.ai/news/29228

View the full report:

https://hyper.ai/news/29243

Popular Encyclopedia Articles

1. Representation learning

2. Long and short-term memory Long Short-Term Memory

3. The least square method

4. Grid Computing Grid Computing

5. Reciprocal Rank Fusion (RRF)

Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:

https://hyper.ai/wiki

See you next week!

About HyperAI

* Provide domestic accelerated download nodes for 1200+ public data sets

* Includes 300+ classic and popular online tutorials

* Interpretation of 100+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai/

Weekly Editor's Picks | FewJoint Benchmark Dataset Launched, Ministry of Science and Technology Supervision Department Releases New AI Regulations

2 years ago

Information

AI for Science

Dataset

From January 29 to February 2, hyper.ai official website updates:

* High-quality public datasets: 10

* AI4S paper cases: 3

* Popular encyclopedia entries: 10

Visit the official website:hyper.ai

Selected public datasets

1. FewJoint small sample benchmark dataset

Direct use:

https://hyper.ai/datasets/29239

2. 100 PoisonMpts Chinese large model governance dataset

Direct use:

https://hyper.ai/datasets/29203

3. CLUE Chinese Language Understanding Evaluation Benchmark Dataset

CLUE (A Chinese Language Understanding Evaluation Benchmark) is a dataset used for training, verification, and testing of Chinese grammar understanding tasks.

Direct use:

https://hyper.ai/datasets/29094

4. Wikipedia Wikipedia dataset

Direct use:

https://hyper.ai/datasets/28528

5. CCI Chinese Internet Corpus

Direct use:

https://hyper.ai/datasets/29186

6. PKU Simplified Chinese word segmentation dataset

Direct use:

https://hyper.ai/datasets/29168

7. Chinese-Poetry The most comprehensive database of Chinese classical poetry

Direct use:

https://hyper.ai/datasets/29257

8. PD&CFT Chinese reading comprehension dataset

This dataset is the first Chinese reading comprehension dataset, which includes text content from People's Daily and Children's Fairy Tale (PD&CFT).

Direct use:

https://hyper.ai/datasets/29260

For more updated datasets this week, please visit:

https://hyper.ai/datasets

ScienceAI Selected Case Studies

1.The accuracy of early diagnosis of Parkinson's disease has been improved to 90.2%. Shenzhen Institute of Advanced Technology and Zhongshan First Hospital jointly proposed the GSP-GCNs model

View the full report:

https://hyper.ai/news/29189

2. The Ministry of Science and Technology has taken action! The AIGC user manual for researchers is here, and the academic community is beginning to guard against AI gunmen

View the full report:

https://hyper.ai/news/29228

View the full report:

https://hyper.ai/news/29243

Popular Encyclopedia Articles

1. Representation learning

2. Long and short-term memory Long Short-Term Memory

3. The least square method

4. Grid Computing Grid Computing

5. Reciprocal Rank Fusion (RRF)

Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:

https://hyper.ai/wiki

See you next week!

About HyperAI

* Provide domestic accelerated download nodes for 1200+ public data sets

* Includes 300+ classic and popular online tutorials

* Interpretation of 100+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai/

Command Palette

Weekly Editor's Picks | FewJoint Benchmark Dataset Launched, Ministry of Science and Technology Supervision Department Releases New AI Regulations

Command Palette

Weekly Editor's Picks | FewJoint Benchmark Dataset Launched, Ministry of Science and Technology Supervision Department Releases New AI Regulations

Related News

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

Anima V1, a brand-new Raw Image Model, Has Been Released, Focusing on anime-style Image Generation; the MemLens Multimodal long-range Memory Evaluation Dataset Covers cross-conversation text-to-image Reasoning and Knowledge Update mechanisms.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Extremely Lightweight, yet With Undiminished Image Quality! ERNIE-Image-Turbo: Say Goodbye to Long Waits, lightning-fast Speed; Introducing dual-dimensional Metrics of Perception and Cognition: Alibaba's Unified Multimodal Parsing and Evaluation Dataset OmniParsingBench Is Now online.

MiniCPM5-1B, Trained Using RL+OPD, Achieves state-of-the-art (SOTA) Performance on Multiple Complex Tasks; the CHI-Bench Dataset for Evaluating Medical Agents, Designed for Automation of Complex Healthcare Processes, Has Been released.

Command Palette

Weekly Editor's Picks | FewJoint Benchmark Dataset Launched, Ministry of Science and Technology Supervision Department Releases New AI Regulations

Related News

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

Anima V1, a brand-new Raw Image Model, Has Been Released, Focusing on anime-style Image Generation; the MemLens Multimodal long-range Memory Evaluation Dataset Covers cross-conversation text-to-image Reasoning and Knowledge Update mechanisms.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Extremely Lightweight, yet With Undiminished Image Quality! ERNIE-Image-Turbo: Say Goodbye to Long Waits, lightning-fast Speed; Introducing dual-dimensional Metrics of Perception and Cognition: Alibaba's Unified Multimodal Parsing and Evaluation Dataset OmniParsingBench Is Now online.

MiniCPM5-1B, Trained Using RL+OPD, Achieves state-of-the-art (SOTA) Performance on Multiple Complex Tasks; the CHI-Bench Dataset for Evaluating Medical Agents, Designed for Automation of Complex Healthcare Processes, Has Been released.

Related News

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

Anima V1, a brand-new Raw Image Model, Has Been Released, Focusing on anime-style Image Generation; the MemLens Multimodal long-range Memory Evaluation Dataset Covers cross-conversation text-to-image Reasoning and Knowledge Update mechanisms.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Extremely Lightweight, yet With Undiminished Image Quality! ERNIE-Image-Turbo: Say Goodbye to Long Waits, lightning-fast Speed; Introducing dual-dimensional Metrics of Perception and Cognition: Alibaba's Unified Multimodal Parsing and Evaluation Dataset OmniParsingBench Is Now online.

MiniCPM5-1B, Trained Using RL+OPD, Achieves state-of-the-art (SOTA) Performance on Multiple Complex Tasks; the CHI-Bench Dataset for Evaluating Medical Agents, Designed for Automation of Complex Healthcare Processes, Has Been released.

Related News

4-step Image output/4K quality/6x Speedup, PiD Uses Pixel Diffusion to Unify Decoding and super-resolution Output; SA-3DAO: a Dataset Containing 1000 Pairs of Real Images Paired With Handcrafted 3D Meshes by artists.

ByteDance open-sources Lance, a 3B Model Encompassing Understanding, Generation, and Editing; the National University of Singapore Proposes the ViMU Dataset: Covering 588 Videos and non-verbal Question answering.

Fast and Accurate! Cohere Releases open-source Transcription Model; Accurate Parsing of Complex Scenarios: Chandra-ocr-2 Visual Language Model Achieves Precise OCR.

Supports live-action/animation/animal-driven Video Generation; Meituan's open-source multi-style audio-driven Video Generation Framework LongCat 1.5 Enhances VLM's Chart Reconstruction and Table Extraction Capabilities Using the million-level Chart Understanding Dataset ChartNet.

Anima V1, a brand-new Raw Image Model, Has Been Released, Focusing on anime-style Image Generation; the MemLens Multimodal long-range Memory Evaluation Dataset Covers cross-conversation text-to-image Reasoning and Knowledge Update mechanisms.

Tencent open-sources Hy-MT1.5 Translation Model: 440MB Achieves top-tier Translation Capabilities; MIT Jointly Releases MathNet: a Multimodal Mathematical Inference Benchmark Covering 27,000 Real Olympiad Math problems.

Zero-sampling TTS Breakthrough! A Few Seconds of Reference Audio, OmniVoice Helps You Easily Clone Hundreds of Languages; 17 Languages All in One Go: MDPbench Solves the Major Problem of Parsing low-resource Text systems.

Extremely Lightweight, yet With Undiminished Image Quality! ERNIE-Image-Turbo: Say Goodbye to Long Waits, lightning-fast Speed; Introducing dual-dimensional Metrics of Perception and Cognition: Alibaba's Unified Multimodal Parsing and Evaluation Dataset OmniParsingBench Is Now online.

MiniCPM5-1B, Trained Using RL+OPD, Achieves state-of-the-art (SOTA) Performance on Multiple Complex Tasks; the CHI-Bench Dataset for Evaluating Medical Agents, Designed for Automation of Complex Healthcare Processes, Has Been released.