Weekly Editor's Picks | COIG-CQIA Dataset Is Online, and ComfyUI Wenshengtu Workflow Is Running Online

In order to fill the gap in high-quality Chinese datasets, 10 institutions including the Chinese Academy of Sciences, Zero One Everything, and Peking University jointly developed the COIG-CQIA dataset. Surprisingly,The data quality of "Retarded Post Bar" in this data set is much higher than that of knowledge communities such as Zhihu, Douban, and Sifou. The COIG-CQIA dataset is now available on the hyper.ai website. Come and take a look!
From April 8 to April 12, hyper.ai official website updates:
* High-quality public datasets: 10
* Selected high-quality tutorials: 2
* Community article selection: 5 articles
* Popular encyclopedia entries: 5
Visit the official website:hyper.ai
Selected public datasets
1. COIG-CQIA High-quality Chinese instruction fine-tuning dataset
COIG-CQIA stands for Chinese Open Instruction Generalist – Quality is All You Need. It is an open source high-quality instruction fine-tuning dataset that aims to provide the Chinese NLP community with high-quality instruction fine-tuning data that conforms to human interaction behavior.
Direct use:https://go.hyper.ai/Pg37L
2. EgoExoLearn Cross-Perspective Skill Learning Dataset
The EgoExoLearn dataset contains 120 hours of video data collected from daily life scenes and professional laboratories. The videos in the dataset include not only demonstration videos, but also videos recorded by the performer using his or her own first-person perspective (egocentric) after watching the demonstration.
Direct use:https://go.hyper.ai/cYsPM
3. S2S-SIM Ship Collaborative Perception Simulation Dataset
The S2S-Sim dataset is the first ship collaborative perception simulation dataset developed by Shanghai University. The dataset contains 7,000 frames of data, with 96,881 accurate annotations of ship bounding boxes. Its purpose is to support effective collaborative perception between ships, with a particular focus on research in the field of autonomous driving systems and ship collaborative perception.
Direct use: https://go.hyper.ai/AVWp2
4. Common Corpus-zh Chinese public domain dataset
Common Corpus was jointly created by Pleias, HuggingFace and other institutions. It is currently the largest public domain dataset dedicated to training large language models (LLMs). The dataset brings together 500 billion words from diverse cultural heritage projects around the world, including English, French, Chinese, Spanish, German, Italian and other languages. It is the most comprehensive language resource library to date.
Direct use: https://go.hyper.ai/hvuV5
5. TriviaQA A large dataset for reading comprehension and question answering
TriviaQA is a reading comprehension dataset containing more than 650,000 question-answer evidence triplets. TriviaQA includes 95K question-answer pairs in 662K documents from Wikipedia and the web.
Direct use: https://go.hyper.ai/aant8
6. HalluQA Chinese Large Model Hallucination Evaluation Dataset
The HalluQA dataset contains 450 adversarial questions spanning multiple fields and involving Chinese history, culture, customs, and social phenomena.
Direct use: https://go.hyper.ai/pWyqe
7. Flood analysis and prediction datasets generated by AI models
This dataset is the research data of the paper "Global prediction of extreme floods in ungauged watersheds". The main content is the flood re-analysis (1984-2021) and re-forecast (2014-2021) data generated by the AI model and the corresponding GloFAS benchmark data.
Direct use: https://go.hyper.ai/bpsG3
8. MASSTAR Multimodal Large Scene Dataset
MASSTAR is a multimodal large-scale scene dataset jointly proposed by Sun Yat-sen University, Hong Kong University of Science and Technology and other institutions. It contains more than 1,000 scene-level 3D mesh models, some of which are from the real world.
Direct use:https://go.hyper.ai/eLZUy
9. VideoBadminton badminton video action recognition dataset
VideoBadminton is a high-quality video dataset for badminton created by Auburn University and National Central University. The dataset contains badminton video data of 19 male and female athletes from the National Central University team, covering 18 badminton moves, a total of 7,822 video clips, and a total duration of 145 minutes.
Direct use:https://go.hyper.ai/w5ToD
10. FineFake: A fine-grained multi-domain fake news detection dataset
FineFake is a dataset for fine-grained multi-domain fake news detection, jointly created by Beihang University and Beijing University of Posts and Telecommunications. The dataset contains 16,909 data samples, covering 6 semantic topics and 8 different platforms. Each news sample contains multiple forms of content, including text, pictures, and potential social context information.
Direct use:https://go.hyper.ai/CNWIn
For more public datasets, please visit:
Selected Public Tutorials
This tutorial is a tutorial on how to use the ComfyUI Stable Casecade workflow for AI painting. The tutorial has set up a good environment and built-in the Stable Cascade default text workflow. It directly connects the nodes to simplify the usage process and can produce a picture in 2 seconds.
Run online:https://go.hyper.ai/lJGLF
2. Crop Disease Image Classification Tutorial
This tutorial is about using PyTorch for crop disease image classification, which helps to train machine learning models to detect plant diseases or develop automatic plant diagnosis algorithm learning.
Run online:https://go.hyper.ai/
Community Articles
This article summarizes the resources related to large models, including 15 datasets, 15 models and 8 large model demos, with download and usage links.
View the full article:https://go.hyper.ai/sYC6h
Professor Lu Diannan's team from the Department of Chemical Engineering at Tsinghua University has led the proposal of a machine learning model, Uni-MOF, for predicting the adsorption behavior of three-dimensional MOF materials. This model can not only identify and restore the three-dimensional structure of nanoporous materials through pre-training, but also further considers operating conditions such as temperature, pressure, and different gas molecules, making it suitable for both scientific research and practical applications. The relevant results have been published in the journal "Nature".
View the full report:https://go.hyper.ai/VWFVo
Sun Yat-sen University, Southern Medical University, Huazhong University of Science and Technology and Zhejiang University jointly built the MCF artificial intelligence fusion model for ovarian cancer diagnosis. The risk of ovarian cancer can be calculated by inputting routine laboratory test data and age. The model's accuracy is better than traditional biomarkers such as CA125 and HE4. Related results have been published in The Lancet Digital Health.
View the full report:https://go.hyper.ai/prEbC
4. Insight into Insilico: The leap, dilemma and breakthrough of the AI pharmaceutical star company
Insilicon Valley Smart, which has raised $407.5 million, failed to enter the Hong Kong stock market in January this year and submitted its second listing application on March 27. Under the Sino-US technology game, the situation of "American startup, Chinese co-CEO, American shareholders, Chinese headquarters..." has doubled its pressure. How to break through and whether it can successfully win the title of "the first AI pharmaceutical stock" remains unknown. This article introduces this AI pharmaceutical star enterprise in depth from the aspects of technology development, team composition, and business development.
View the full report:https://go.hyper.ai/llREq
Popular Encyclopedia Articles
1. Lang Chain
2. Mixture of Experts Model MoE
3. Group Query Attention GQA
4. Reciprocal ranking fusion RRF
5. Recall Rate
Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:
Station B live broadcast preview
date | time | content |
Monday, April 15 | 10:00 | Google IO conferences over the years |
Tuesday, April 16 | 10:00 | MIT Deep Learning Course 2020 |
Wednesday, April 17 | 10:00 | MIT Deep Learning Course 2021 |
Thursday, April 18 | 10:00 | Comprehensive course for beginners on Python API development |
Friday, April 19 | 10:00 | Flutter courses for beginners |
Saturday, April 20 | 10:00 | Harvard CS50 Python Artificial Intelligence Course |
Sunday, April 21 | 10:00 | Stanford HAI Symposium |
Super Neuro TV broadcasts live 24/7. Click to get the "electronic pickles" in the AI field:
http://live.bilibili.com/26483094
The above is all the content of this week’s editor’s selection. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!
See you next week!
About HyperAI
HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:
* Provide domestic accelerated download nodes for 1200+ public data sets
* Includes 300+ classic and popular online tutorials
* Interpretation of 100+ AI4Science paper cases
* Support 500+ related terms search
* Hosting the first complete Apache TVM Chinese documentation in China
Visit the official website to start your learning journey: