HyperAI

Weekly Editor's Picks | COIG-CQIA Dataset Is Online, and ComfyUI Wenshengtu Workflow Is Running Online

a year ago
Information
zhaorui
特色图像

In order to fill the gap in high-quality Chinese datasets, 10 institutions including the Chinese Academy of Sciences, Zero One Everything, and Peking University jointly developed the COIG-CQIA dataset. Surprisingly,The data quality of "Retarded Post Bar" in this data set is much higher than that of knowledge communities such as Zhihu, Douban, and Sifou. The COIG-CQIA dataset is now available on the hyper.ai website. Come and take a look!

From April 8 to April 12, hyper.ai official website updates:

* High-quality public datasets: 10

* Selected high-quality tutorials: 2

* Community article selection: 5 articles

* Popular encyclopedia entries: 5

Visit the official website:hyper.ai

Selected public datasets

1. COIG-CQIA High-quality Chinese instruction fine-tuning dataset

COIG-CQIA stands for Chinese Open Instruction Generalist – Quality is All You Need. It is an open source high-quality instruction fine-tuning dataset that aims to provide the Chinese NLP community with high-quality instruction fine-tuning data that conforms to human interaction behavior.

Direct use:https://go.hyper.ai/Pg37L

2. EgoExoLearn Cross-Perspective Skill Learning Dataset

The EgoExoLearn dataset contains 120 hours of video data collected from daily life scenes and professional laboratories. The videos in the dataset include not only demonstration videos, but also videos recorded by the performer using his or her own first-person perspective (egocentric) after watching the demonstration.

Direct use:https://go.hyper.ai/cYsPM

3. S2S-SIM Ship Collaborative Perception Simulation Dataset

The S2S-Sim dataset is the first ship collaborative perception simulation dataset developed by Shanghai University. The dataset contains 7,000 frames of data, with 96,881 accurate annotations of ship bounding boxes. Its purpose is to support effective collaborative perception between ships, with a particular focus on research in the field of autonomous driving systems and ship collaborative perception. 

Direct use: https://go.hyper.ai/AVWp2

4. Common Corpus-zh Chinese public domain dataset

Common Corpus was jointly created by Pleias, HuggingFace and other institutions. It is currently the largest public domain dataset dedicated to training large language models (LLMs). The dataset brings together 500 billion words from diverse cultural heritage projects around the world, including English, French, Chinese, Spanish, German, Italian and other languages. It is the most comprehensive language resource library to date.

Direct use: https://go.hyper.ai/hvuV5

5. TriviaQA A large dataset for reading comprehension and question answering

TriviaQA is a reading comprehension dataset containing more than 650,000 question-answer evidence triplets. TriviaQA includes 95K question-answer pairs in 662K documents from Wikipedia and the web.

Direct use: https://go.hyper.ai/aant8

6. HalluQA Chinese Large Model Hallucination Evaluation Dataset

The HalluQA dataset contains 450 adversarial questions spanning multiple fields and involving Chinese history, culture, customs, and social phenomena.

Direct use: https://go.hyper.ai/pWyqe

7. Flood analysis and prediction datasets generated by AI models

This dataset is the research data of the paper "Global prediction of extreme floods in ungauged watersheds". The main content is the flood re-analysis (1984-2021) and re-forecast (2014-2021) data generated by the AI model and the corresponding GloFAS benchmark data.

Direct use: https://go.hyper.ai/bpsG3

8. MASSTAR Multimodal Large Scene Dataset

MASSTAR is a multimodal large-scale scene dataset jointly proposed by Sun Yat-sen University, Hong Kong University of Science and Technology and other institutions. It contains more than 1,000 scene-level 3D mesh models, some of which are from the real world.

Direct use:https://go.hyper.ai/eLZUy

9. VideoBadminton badminton video action recognition dataset

VideoBadminton is a high-quality video dataset for badminton created by Auburn University and National Central University. The dataset contains badminton video data of 19 male and female athletes from the National Central University team, covering 18 badminton moves, a total of 7,822 video clips, and a total duration of 145 minutes.

Direct use:https://go.hyper.ai/w5ToD

10. FineFake: A fine-grained multi-domain fake news detection dataset

FineFake is a dataset for fine-grained multi-domain fake news detection, jointly created by Beihang University and Beijing University of Posts and Telecommunications. The dataset contains 16,909 data samples, covering 6 semantic topics and 8 different platforms. Each news sample contains multiple forms of content, including text, pictures, and potential social context information.

Direct use:https://go.hyper.ai/CNWIn

For more public datasets, please visit:

https://hyper.ai/datasets

Selected Public Tutorials

1. The cost can be reduced by up to 16 times. The ComfyUI Stable Cascade tutorial is now online and can be deployed with one click!

This tutorial is a tutorial on how to use the ComfyUI Stable Casecade workflow for AI painting. The tutorial has set up a good environment and built-in the Stable Cascade default text workflow. It directly connects the nodes to simplify the usage process and can produce a picture in 2 seconds.

Run online:https://go.hyper.ai/lJGLF

2. Crop Disease Image Classification Tutorial

This tutorial is about using PyTorch for crop disease image classification, which helps to train machine learning models to detect plant diseases or develop automatic plant diagnosis algorithm learning.

Run online:https://go.hyper.ai/

Community Articles

1. A comprehensive collection of large model resources | 30 high-quality NLP datasets and models, 8 demos for one-click use, recommended for collection!

This article summarizes the resources related to large models, including 15 datasets, 15 models and 8 large model demos, with download and usage links.

View the full article:https://go.hyper.ai/sYC6h

2. Effectively identify 630,000 three-dimensional spatial configurations, Tsinghua University led the release of the Uni-MOF model to predict MOF adsorption capacity

Professor Lu Diannan's team from the Department of Chemical Engineering at Tsinghua University has led the proposal of a machine learning model, Uni-MOF, for predicting the adsorption behavior of three-dimensional MOF materials. This model can not only identify and restore the three-dimensional structure of nanoporous materials through pre-training, but also further considers operating conditions such as temperature, pressure, and different gas molecules, making it suitable for both scientific research and practical applications. The relevant results have been published in the journal "Nature".

View the full report:https://go.hyper.ai/VWFVo

3. Blood routine tests, urine tests and other indicators can identify ovarian cancer! Liu Jihong's team from Sun Yat-sen University led the team, and four major medical schools jointly built an AI fusion model

Sun Yat-sen University, Southern Medical University, Huazhong University of Science and Technology and Zhejiang University jointly built the MCF artificial intelligence fusion model for ovarian cancer diagnosis. The risk of ovarian cancer can be calculated by inputting routine laboratory test data and age. The model's accuracy is better than traditional biomarkers such as CA125 and HE4. Related results have been published in The Lancet Digital Health.

View the full report:https://go.hyper.ai/prEbC

4. Insight into Insilico: The leap, dilemma and breakthrough of the AI pharmaceutical star company

Insilicon Valley Smart, which has raised $407.5 million, failed to enter the Hong Kong stock market in January this year and submitted its second listing application on March 27. Under the Sino-US technology game, the situation of "American startup, Chinese co-CEO, American shareholders, Chinese headquarters..." has doubled its pressure. How to break through and whether it can successfully win the title of "the first AI pharmaceutical stock" remains unknown. This article introduces this AI pharmaceutical star enterprise in depth from the aspects of technology development, team composition, and business development.

View the full report:https://go.hyper.ai/llREq

Popular Encyclopedia Articles

1. Lang Chain

2. Mixture of Experts Model MoE

3. Group Query Attention GQA

4. Reciprocal ranking fusion RRF

5. Recall Rate

Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:

https://hyper.ai/wiki

Station B live broadcast preview

datetimecontent
Monday, April 1510:00Google IO conferences over the years
Tuesday, April 1610:00MIT Deep Learning Course 2020
Wednesday, April 1710:00MIT Deep Learning Course 2021
Thursday, April 1810:00Comprehensive course for beginners on Python API development
Friday, April 1910:00Flutter courses for beginners
Saturday, April 2010:00Harvard CS50 Python Artificial Intelligence Course
Sunday, April 2110:00Stanford HAI Symposium

Super Neuro TV broadcasts live 24/7. Click to get the "electronic pickles" in the AI field:

http://live.bilibili.com/26483094

The above is all the content of this week’s editor’s selection. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!

See you next week!

About HyperAI

HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:

* Provide domestic accelerated download nodes for 1200+ public data sets

* Includes 300+ classic and popular online tutorials

* Interpretation of 100+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai