Weekly Editor's Picks | MathPile Mathematical Reasoning Corpus Open Source, Union Eye Hospital Leads the Use of AI to Assist in the Detection of 13 Fundus Diseases

Recently, Shanghai Jiao Tong University's Generative Artificial Intelligence Research Laboratory (GAIR)The high-quality and diverse pre-trained dataset MathPile, which is tailored for the field of mathematics, and its commercial version MathPile-Commercial are open source.Now you can download it on the hyper.ai official website! MathVista , Math23K and other popular mathematical data sets are waiting for you to use~
From February 19 to February 23, hyper.ai official website updates:
* High-quality public datasets: 10
* AI4S paper cases: 4
* Popular encyclopedia entries: 10
Visit the official website:hyper.ai
Selected public datasets
1. MathPile Mathematical Reasoning Pre-trained Corpus
The Generative Artificial Intelligence Laboratory of Shanghai Jiao Tong University has launched the MathPile dataset, a high-quality, diverse pre-trained corpus specifically for the field of mathematics, containing approximately 9.5 billion tokens, designed to enhance the mathematical reasoning capabilities of large models.
Direct use:
https://hyper.ai/datasets/29543
2. MathPile-Commercial Mathematical Reasoning Pre-training Corpus (Commercial Version)
MathPile-Commercial is a commercial version of MathPile, obtained by removing the documents in MathPile that prohibit commercial use (the latest version, v0.2). Specifically, the research team performed non-commercial use detection on the source data, using the license information in the metadata of the arXiv source, and using keyword matching on other sources.
Direct use:
https://hyper.ai/datasets/29545
3. AI-generated image datasets
This dataset contains 19 images of boys generated by Copilot, an AI companion that creates imaginative and innovative content. These images are suitable for face and pose detection tasks because they vary in facial expressions, poses, backgrounds, lighting, and occlusions.
Direct use:
https://hyper.ai/datasets/29527
4. A diverse AI-generated portrait dataset
The dataset contains 140 high-quality images carefully crafted by advanced AI algorithms, including 70 female portraits and 70 male portraits. Each image in the dataset demonstrates the extraordinary ability of artificial intelligence in mimicking the complexity of human appearance.
Direct use:
https://hyper.ai/datasets/29529
5. THUCNews Chinese text classification dataset
THUCNews is generated by filtering and filtering the historical data of Sina News RSS subscription channel from 2005 to 2011, including 740,000 news documents (2.19 GB), all in UTF-8 plain text format. Based on the original Sina News classification system, the research team re-integrated and divided 14 candidate classification categories: finance, lottery, real estate, stocks, home, education, technology, society, fashion, current affairs, sports, constellations, games, and entertainment.
Direct use:
https://hyper.ai/datasets/29521
6. ShareGPT 90k Chinese and English bilingual human-machine question answering dataset
ShareGPT-Chinese-English-90k is a high-quality human-machine question-answering dataset in parallel Chinese and English, covering user question data in real and complex scenarios. This dataset can be used to train high-quality dialogue models.
Direct use:
https://hyper.ai/datasets/29523
7. SMP-2017 Chinese Conversation Intent Recognition Dataset
This dataset is the SMP2017 Chinese Human-Computer Dialogue Technology Evaluation (ECDT) Task 1 dataset. This evaluation aims to promote the development of research related to Chinese human-computer dialogue systems.
Direct use:
https://hyper.ai/datasets/29515
8. Toutiao text classification dataset
This dataset is a classification dataset of Toutiao Chinese news (short text). The data source is Toutiao client. It contains 15 categories and 382,688 texts. The collection time is May 2018.
Direct use:
https://hyper.ai/datasets/29517
For more updated datasets this week, please visit:
ScienceAI Paper Case Studies
The diagnosis of ophthalmic diseases is highly dependent on image recognition, and ophthalmology is very suitable for the application of technologies such as deep learning. In order to further explore the potential value of deep learning in the diagnosis of fundus diseases, Chen Youxin, director of the Department of Ophthalmology at Peking Union Medical College Hospital, led a deep learning system developed by 5 ophthalmology centers across the country in cooperation with Beijing Zhiyuan Huitu Technology Co., Ltd. and Professor Li Xirong of the School of Information at Renmin University of China. The system helped primary ophthalmologists improve the diagnostic consistency by about 12% and provided a new method for the automatic detection of 13 major fundus diseases. The relevant paper has been published in the journal "Nature".
View the full report:
The ecological environment has a subtle impact on human health. Professor Wu Xifeng's research team at the School of Public Health of Zhejiang University used a convolutional neural network model to evaluate visible green exposure based on the green view index of street view images, and then explored whether there is a beneficial association between the level of visible greenery in the workplace and metabolic syndrome in adults. The research team used a logistic regression model to evaluate the level of outdoor visible greenery in the working environment of more than 50,000 adults in Hangzhou, confirming the beneficial association between the two. The relevant results have been published in the journal "Environment International".
View the full report:
Shanghai Jiao Tong University Institute of Artificial Intelligence AI for Science Professor Yang Xiaokang and others from the team proposed a concept for the construction of intelligent scientific facilities, forming innovative functions such as large-scale models in scientific fields, generative simulation and inversion, autonomous intelligent unmanned experiments, and large-scale trusted scientific research collaboration. The relevant research results have been published in the "Journal of the Chinese Academy of Sciences".
View the full report:
4. Selected by Amazon engineers, a collection of over 40 LLM papers
More and more companies and traditional industries are beginning to explore how to apply large language models to their own businesses. The rapidly expanding market demand has also driven the further deepening and innovation of research in related fields, and the papers on platforms such as arXiv are being updated more frequently. In order to help everyone retrieve high-value papers faster, Amazon engineer Eugene Yan and others have established a language model paper reading list to continuously share cutting-edge papers. Currently, more than 40 high-quality papers have been compiled.
View the full paper summary:
Popular Encyclopedia Articles
1. Recall Recall Rate
2. Human Feedback Reinforcement Learning RLHF
3. Artificial General Intelligence (AGI)
4. Retrieval Enhancement Generates RAG
5. Neural Radiance Field (NeRF)
Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:
The above is all the content of this week’s editor’s selection. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!
See you next week!
About HyperAI
HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:
* Provide domestic accelerated download nodes for 1200+ public data sets
* Includes 300+ classic and popular online tutorials
* Interpretation of 100+ AI4Science paper cases
* Support 500+ related terms search
* Hosting the first complete Apache TVM Chinese documentation in China
Visit the official website to start your learning journey: