HyperAI

Weekly Editor's Picks | Microsoft Open-sources Orca-Math High-quality Mathematical Dataset, Tsinghua University Research Team Releases Conditional Denoising Diffusion Model SPDiff

a year ago
Information
zhaorui
特色图像

Orca-Math is a mathematical reasoning model released by Microsoft Research.This model demonstrates the value of smaller, specialized models in specific domains, where they can match or even exceed the performance of larger models.Microsoft recently open-sourced the Orca-Math-200K math word problem dataset used to train Orca-Math. It is now available for download on the hyper.ai official website. Come and experience it!

From March 11th to March 15th, hyper.ai official website updates:

* High-quality public datasets: 10

* High-quality public tutorials: 2

* Community article selection: 3 articles

* Popular encyclopedia entries: 10

Visit the official website:hyper.ai

Selected public datasets

1. Orca-Math-200K Microsoft Math Word Problems Dataset

Orca-Math-200K is a high-quality synthetic dataset created by Microsoft that contains approximately 200,000 elementary school math questions. All answers in this dataset are generated using Azure GPT4-Turbo.

Direct use:

https://my5353.com/30060

2. MULTI-Benchmark: A leaderboard for multimodal understanding with text and images

The dataset is the multimodal benchmark test MULTI released by Shanghai Jiao Tong University, which aims to evaluate the ability of large multimodal models to understand complex tables and images and to reason about long texts. The test provides multimodal inputs and requires answers to be precise or open-ended, reflecting the style of real-life exams. MULTI contains more than 18,000 questions, covering a variety of tasks from formula derivation to image analysis and cross-modal reasoning.

Direct use:

https://my5353.com/30062

3. IEPile Large-Scale Information Extraction Corpus 

IEPile is a large-scale, high-quality bilingual (Chinese and English) information extraction (IE) instruction fine-tuning dataset developed by Zhejiang University, covering three core subtasks: named entity recognition (NER), relation extraction (RE), and event extraction (EE). The dataset contains about 2 million instruction samples, totaling about 320 million tokens, covering multiple fields such as general, medical, and financial.

Direct use:

https://my5353.com/30064

4. FFHQ-UV-Intrinsic Facial Attributes Dataset for 3D Face Reconstruction

FFHQ-UV-Intrinsic is an intrinsic facial attribute dataset built by Ubisoft LaForge based on the FFHQ-UV dataset. The dataset contains the facial intrinsic attributes of 10,000 subjects, including diffuse reflection, specular reflection, ambient occlusion, and translucency maps. It is the first public, large-scale, high-resolution facial dataset that provides intrinsic attributes.

Direct use:

https://my5353.com/30113

5. GITQA Multimodal Graph Reasoning Question Answering Dataset

GITQA is the first reasoning question answering dataset containing visual graphs built by Hong Kong University of Science and Technology and Southern University of Science and Technology. The dataset contains more than 423K question-answering instances, each of which contains corresponding graph structure-text-visual information and its corresponding question-answer pair.

Direct use:

https://my5353.com/30116

6. SMolInstruct Chemical Instruction Fine-tuning Dataset

SMolInstruct is a large-scale, comprehensive, and high-quality chemical instruction fine-tuning dataset proposed by Ohio State University. The dataset contains 14 different chemical tasks, a total of more than 3 million samples, and covers 1.6 million unique molecules.

Direct use:

https://my5353.com/30133

7. MusicPile Large Music Dataset

MusicPile is a large-scale music-language pre-training dataset jointly launched by Multimodal Art Projection Research Community, Skywork AI and Hong Kong University of Science and Technology. The dataset contains 5.17 million samples and about 4.16 billion tokens, from sources including music books, YouTube music subtitles, ABC notation works, etc. MusicPile covers a wide range of music common sense, knowledge questions and answers, and typical music theory content, which plays a key role in improving the music understanding and creation ability of large models.

Direct use:

https://my5353.com/30136

8. seq-monkey sequence monkey open source dataset 1.0

Sequence Monkey is a large-scale language model provided by Mobvoi. The Sequence Monkey dataset is a data set used to train the Sequence Monkey model. Some of the datasets have been extracted and opened to the public: the fields involved include: Chinese general text corpus, ancient poetry translation corpus, and text generation corpus.

Direct use:

https://my5353.com/30139

9. Douban Movie Short Review Dataset V2

This dataset contains more than 2 million short reviews of 28 movies from Douban Movie website. It can be used for text classification, text clustering, sentiment analysis, semantic network construction and other fields related to network mining or NLP.

Direct use:

https://my5353.com/30011

10. AdaDR - Dataset from the paper "Drug Repositioning Based on Adaptive GCN Method"

This dataset is the dataset used in the paper "Drug Repositioning Based on Adaptive GCN Method". In order to comprehensively evaluate the performance of the proposed model, the research team used four benchmark datasets: Gdataset (Gottlieb et al. 2011), Cdataset (Luo et al. 2016), Ldataset (Yu et al. 2021) and LRSSL (Liang et al. 2017), which can be applied to drug repositioning tasks.

Direct use:

https://my5353.com/30057

For more updated datasets this week, please visit:

https://hyper.ai/datasets

Selected Public Tutorials

1. Flower Classification Using Transfer Learning

This tutorial demonstrates how to use transfer learning to perform image classification on a dataset of flower images. It uses a pre-trained convolutional neural network (CNN) as a feature extractor and builds a custom classifier on top of it to predict the species of the flower.

Run the tutorial online:

https://my5353.com/n30069

2. Quantizing Vision Transformers (Vit) for Efficient Deployment: Strategies and Best Practices

As the demand for advanced computer vision systems continues to surge across industries, the deployment of Vision Transformers has become a focus for researchers and practitioners. However, to fully realize the potential of these models, a deep understanding of their architecture is required. In addition, developing optimization strategies to effectively deploy these models is equally important.

This tutorial provides a comprehensive exploration of the Vision Transformer architecture, its key components, and the fundamentals that make them unique. At the end of the tutorial, some optimization strategies are discussed with code walkthroughs to make the model more compact for easier deployment.

Run the tutorial online:

https://my5353.com/n30119

Community Articles

1. Only 5% training samples are needed to achieve optimal performance. The Tsinghua University research team released the conditional denoising diffusion model SPDiff to achieve long-range human flow simulation

A research team from Tsinghua University proposed a novel conditional denoising diffusion model SPDiff, which can effectively utilize interaction dynamics to simulate crowd behavior through a diffusion process guided by social forces. The related paper has been published in AAAI 2024.

View the full report:

https://my5353.com/n30069

2. The Beijing Normal University research team established the ECA-Net model to predict China's wind energy utilization potential in the next 70 years

Recently, a research team from the School of Environment at Beijing Normal University published a paper evaluating how my country's wind energy potential will change under the background of global warming. The study used 22 CMIP6 global climate models as output to reliably assess the uncertainty between models. The results show that my country's overall wind energy density will show a slight downward trend this century. The relevant paper has been published in "ACS Publications".

View the full report:

https://my5353.com/n30119

3Countdown to Nvidia 2024 GTC, will Huang Renxun bring new initiatives for the Chinese market?

The 2024 GTC AI conference is scheduled for March 18-21. Huang Renxun will bring his annual sharing from 4:00 to 6:00 a.m. on March 19, Beijing time, with the theme of "Witnessing the Transformation Moment of AI". HyperAI made a bold prediction of Huang's speech topic based on his recent speech interviews and industry trends.

View the full report:

https://my5353.com/n30151

EncyclopediaSelected entries

1. Average Precision (mAP)

2. Instance Segmentation

3. Intersection over Union (IoU)

4. Polynomial Interopolation

5. Reciprocal Rank Fusion (RRF)

Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:

https://hyper.ai/wiki

Station B live broadcast preview

datetimecontent
March 18
Monday
10:0017:00MIT Deep Learning Course 2020MIT Deep Learning Course 2021
Tuesday, March 1910:00Python API Development - Comprehensive Course for Beginners
Wednesday, March 2010:0014:00SQL Tutorial - Beginner Course Generative AI Full Course
Thursday, March 2121:00Flutter courses for beginners
Friday, March 2210:00Flutter courses for beginners
Saturday, March 2310:00Harvard CS50—Python Artificial Intelligence Course
Sunday, March 2410:00Learn PyTorch for Deep Learning in One Day

Super Neuro TV broadcasts live 24/7, continuously delivering AI industry insights. Let’s learn together:

http://live.bilibili.com/26483094

The above is all the content of this week’s editor’s selection. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!

See you next week!

About HyperAI

HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:

* Provide domestic accelerated download nodes for 1200+ public data sets

* Includes 300+ classic and popular online tutorials

* Interpretation of 100+ AI4Science paper cases

* Support 500+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai/