NeurIPS 2024 Dataset Summary|Cover Cloud Removal/Chemical Spectroscopy/Singing Audio/Autonomous Driving/Insect Specimens······

NeurIPS, the full name of Neural Information Processing Systems Conference, is an annual academic conference on neural information processing systems. The conference started in 1987, when it was called NIPS. With the rapid development of the field of artificial intelligence, its influence has gradually expanded, and it has been paid attention to and known by more and more researchers and companies. In order to better reflect the wide range of fields covered by the conference, NIPS was officially renamed NeurIPS in 2017.
Today, NeurIPS has become one of the most authoritative academic conferences in the field of artificial intelligence in the world, attracting scholars, entrepreneurs, and researchers from all over the world.
This year is the 38th NeurIPS (NeurIPS 2024), and the academic achievements are still grand. It is reported that this year a total of 15,671 valid submissions were received, and about 4,000 papers were finally accepted.
HyperAI has compiled 9 high-quality open source datasets from the datasets received at the conference.Covering cloud removal, chemical spectra, singing audio, autonomous driving, insect specimens and many other aspects, you can download it as needed~
Click here to learn more about the summit:
https://go.hyper.ai/vWvAW
Scan the QR code and remark "dataset" to join the discussion group↓

NeurIPS 2024 Dataset Summary
1 , AllClear Public Cloud Removal Dataset
Publishing Agency:Cornell University, Columbia University
Estimated size:22.42 GB
Download address:https://go.hyper.ai/iRqtm
Clouds in satellite images pose a significant challenge for downstream applications. A major problem facing current cloud removal research is the lack of comprehensive benchmarks and sufficiently large and diverse training datasets. AllClear is currently the largest public cloud removal dataset, containing 23,742 globally distributed regions of interest (ROIs), covering a variety of land use patterns, and a total of 4 million images.
2. Muharaf Handwritten Arabic Dataset
Publishing Agency:North Carolina State University, Holy Spirit University of Kaslik, Lebanese Historical Society
Estimated size:9.83 GB
Download address:https://go.hyper.ai/yztH6
The Muharaf dataset is a machine learning dataset focused on handwritten Arabic recognition, containing more than 1.6k images of historical handwritten pages transcribed by archival Arabic experts. Each document image is accompanied by the spatial polygon coordinates of its text lines and information about basic page elements, aiming to advance the state of the art in the field of handwritten text recognition (HTR).
3 ,Chemical Multimodal Spectroscopic Datasets
Publishing Agency:IBM Research, University of Zurich, EPFL, NCCR Catalysis
Estimated size:9.7 GB
Download address:https://go.hyper.ai/ZdXk8
The dataset contains simulated 1H-NMR, 13C-NMR, HSQC-NMR, infrared and mass spectrometry (positive and negative ion modes) spectral data of 790,000 molecules extracted from chemical reactions in patent data. The core value of this dataset lies in its ability to integrate information from multiple spectral modalities and simulate the method of human experts analyzing molecular structures, which is expected to automate structural analysis and simplify the molecular discovery process from synthesis to structure determination.
4 , GTSinger singing audio dataset
Publishing Agency:Zhejiang University
Estimated size:28.94 GB
Download address:https://go.hyper.ai/7jdi2
The dataset contains 80.59 hours of singing recorded in professional studios by 20 professional singers in 9 different languages, including Chinese, English, Japanese, Korean, etc., providing researchers with a resource library with extremely rich timbres and styles.
5 , DrivingDojo Autonomous Driving Dataset
Publishing Agency:Chinese Academy of Sciences, Meituan, Artificial Intelligence and Robotics Center of the Hong Kong Innovation Institute of the Chinese Academy of Sciences
Download address:https://go.hyper.ai/W3eDT
The dataset contains about 18k video clips, covering cities such as Beijing, Shenzhen, and Xuzhou, and recorded under different weather conditions and daylight conditions. It includes not only longitudinal operations such as acceleration, emergency braking, and stop-start, but also lateral operations such as U-turns, overtaking, and lane changes. In addition, the dataset is specially designed with videos containing a large number of multi-agent interaction trajectories, aiming to improve the prediction and control capabilities of the world model in complex driving environments.
6 ,Multimodal insect biodiversity dataset
Publishing Agency:Centre for Biodiversity Genomics, University of Guelph, University of Waterloo, etc.
Estimated size:37.71 GB
Download address:https://go.hyper.ai/Ljjwp
The BIOSCAN-5M dataset contains detailed information on more than 5 million insect specimens, significantly expanding existing image-based biological datasets. It not only includes classification labels, raw nucleotide barcode sequences, assigned barcode index numbers and geographic information, but also covers multimodal information such as specimen size, aiming to understand and monitor global insect biodiversity.
7 , OpenSatMap high-resolution satellite dataset
Publishing Agency:Chinese Academy of Sciences, Artificial Intelligence and Robotics Research Center, Hong Kong Institute of Information Systems, Chinese Academy of Sciences, Tencent Maps and Beijing University of Posts and Telecommunications
Estimated size:57.7 GB
Download address:https://go.hyper.ai/g54aa
This dataset is a high-resolution satellite dataset designed for large-scale map construction. It features fine-grained instance-level annotations and high-resolution images, and contains 3,787 high-resolution satellite images, including images of not only multiple cities in China, but also images of more than 50 cities and 18 countries around the world.
8 ,Natural Species Sound Dataset
Publishing Agency:University of Massachusetts Amherst, iNaturalist
Estimated size:131.26 GB
Download address:https://go.hyper.ai/lyTcc
The dataset contains 230,000 audio files capturing sounds from more than 5,500 species contributed by more than 27,000 recorders worldwide. The dataset contains the sounds of birds, mammals, insects, reptiles, and amphibians, with audio and species labels derived from observations submitted to iNaturalist.
9 , MINT-1T Text-Image Pair Multimodal Dataset
Publishing Agency:University of Washington, Stanford University, Salesforce Research, etc.
Download address:https://go.hyper.ai/kROfu
The dataset contains 1 trillion text tags and 3.4 billion images, which is 10 times larger than the previous largest open source dataset. It includes not only HTML documents, but also PDF documents and ArXiv papers, and its diversity significantly improves the coverage of scientific documents.
10 , AudioSetCaps audio subtitle dataset
Publishing Agency:Northwestern Polytechnical University, Xi'an Lianfeng Acoustic Technology Co., Ltd., Nanyang Technological University, Institute of Acoustics, Chinese Academy of Sciences, etc.
Download address:https://go.hyper.ai/rTKdU
AudioSetCaps is an audio-caption dataset, which comes from AudioSet, YouTube-8M and VGGSound, and contains 6,117,099 10-second audio files. Each audio file is accompanied by a descriptive title and 3 Q&A pairs as metadata for generating the final title (a total of 18,414,789 pairs of Q&A data).
The above is the NeurIPS 2024 dataset compiled by HyperAI. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit a contribution to tell us!
About HyperAI
HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:
* Provide domestic accelerated download nodes for 1300+ public data sets
* Includes 400+ classic and popular online tutorials
* Interpretation of 200+ AI4Science paper cases
* Support 500+ related terms search
* Hosting the first complete Apache TVM Chinese documentation in China
Visit the official website to start your learning journey: