HyperAI

Meta Releases Open Source OMat24 Dataset, Including 110 Million DFT Calculation Results

特色图像

As the global demand for renewable energy grows, energy storage technology, as a solution that can store energy and release it when needed, is gaining more and more attention. However, many renewable energy storage technologies have high initial investment costs and are difficult to operate and maintain, and are currently still in the research and development or demonstration stage.

In view of this,In 2020, the Facebook Artificial Intelligence Research Lab (FAIR), which had not yet been renamed, and Carnegie Mellon University jointly launched the Open Catalyst Project.Its goal is to use AI to explore new catalysts for renewable energy storage. Along with the release of the project, the research team launched the catalyst simulation dataset OC20.

OC20 dataset download address:
https://go.hyper.ai/dYeNS
In 2022, the research team expanded and launched the Open Catalyst 2022 (OC22) Dataset based on the OC20 dataset, making model training more accurate.
OC22 dataset download address
https://go.hyper.ai/9FhFL

Recently, Meta has once again made a new breakthrough in the field of materials science, releasing the Open Materials 2024 (OMat24) large-scale open source dataset and a set of supporting pre-trained models. The OMat24 dataset contains more than 110 million density functional theory (DFT) calculations focusing on structural and compositional diversity. The pre-trained model is trained using the EquformerV2 (eqV2) model, where the eqV2-M model has achieved state-of-the-art performance on the Matbench Discovery leaderboard, enabling prediction of ground state stability and formation energy, setting a new benchmark for predicting material stability.

Research highlights:
* The OMat24 dataset is built on the basis of open source datasets such as MPtrj, Materials Project, and Alexandria. The elements contained in the dataset cover almost the entire periodic table. 

* The pre-trained models are available in three sizes: eqV2-S, eqV2-M, and eqV2-L. The eqV2-M model has an F1 score of 0.916 on the Matbench Discovery leaderboard, with a mean absolute error of only 20 meV/atom


Paper address:
https://arxiv.org/pdf/2410.12771
Follow the official account and reply "OMat24" to get the full paper PDF

OMat24 dataset download address:
https://go.hyper.ai/gALHP

The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:

https://github.com/hyperai/awesome-ai4s

The OMat24 dataset contains more than 110 million DFT calculation results covering different atomic configurations.

The OMat24 dataset is one of the largest open source datasets currently used for training DFT substitution models of materials.The dataset consists of DFT single-point calculations, structural relaxations and molecular dynamic trajectories for a range of inorganic bulk materials.In total, the researchers calculated about 118 million structures annotated with total energy, forces (forces norm) and unit cell stress (stress), using more than 400 million core hours of computing resources.

These structures were generated by three techniques: Boltzmann sampling of rattled structures, ab initio molecular dynamics (AIMD), and relaxations of rattled structures.

Overview of OMat24 dataset generation, application areas, and sampling strategies

The OMat24 dataset has a wide range of energy, force, and stress distributions. The following figure shows the distribution of total energy (in eV/atom), forces (in eV/A), and stress (in GPa) labels for the OMat24 dataset, the MPtrj dataset, and the Alexandria dataset.

* The MPtrj dataset (Materials Project Trajectory Dataset) contains DFT calculation results of more than 1.5 million inorganic structures. Due to its large scale and diversity, it has important application value in the fields of materials science and computational materials science. 

* The Alexandria dataset is a quantum chemistry database that provides a large amount of molecular property data for force field development and density functional development and evaluation.

The orange dotted line represents the MPtrj dataset, the blue dotted line represents the Alexandria dataset, and the green solid line represents the OMat24 dataset.

It can be seen that the energy distribution of the OMat24 dataset is slightly higher than that of the Alexandria dataset used as the input structure, and significantly higher than that of the MPtrj dataset; the force and unit cell stress distribution of the OMat24 dataset is much higher than that of the MPtrj and Alexandria datasets.

It is worth mentioning that the elements included in the OMat24 dataset almost cover the periodic table.As shown in the following figure:

Distribution of elements in the OMat24 dataset

Although the OMat24 dataset is superior to other datasets, the researchers also pointed out that the dataset still has limitations. The dataset is based on DFT calculations at the PBE and PBE+U levels. It only contains periodic bulk structures and does not consider the important effects of point defects, surfaces, non-stoichiometric ratios, and low-dimensional structures. Therefore, there are inherent approximation errors, but these errors have been solved to a certain extent in other functionals.

As shown in the figure below, the researchers compared the calculation results in the WBM data set with the single-point calculation results using the OMat24 DFT setting and found that the average absolute error between the two was 52.25 meV/atom.
* The WBM dataset is a large-scale computational materials database that contains the electronic structure and thermodynamic properties of a large number of materials calculated using DFT, such as formation energy, entropy change, specific heat capacity, etc.

Schematic diagram comparing the calculation results of the WBM data set with the single-point calculation results of the OMat24 DFT setting

Using EquformerV2 as the model architecture, model training is performed based on three major data sets

The researchers used the OMat24 dataset along with the MPtrj dataset and the Alexandria dataset to train the model.Since there are similar structures in the Alexandria dataset and the WBM dataset used for testing, the researchers subsampled the Alexandria dataset for training to ensure that there are no omissions between the training dataset and the test dataset.

First, the researchers removed all matches between the WBM initial structure and the relaxed structure, creating a new subset of Alexandria (sAlexandria). To reduce the dataset, the researchers removed structures with total energy > 0 eV, force norm > 50 eV/Å, and stress > 80 GPa. Finally, only structures with energy differences greater than 10 meV/atom in the remaining trajectories were sampled. The resulting datasets for training and validation had 10 million and 500,000 structures, respectively.

For the model architecture, the researchers chose EquiformerV2, which is currently the best performing model on the OC20, OC22, and ODAC23 leaderboards.

For model training, the researchers explored three strategies:

* EquiformerV2 models are trained only on the OMat24 dataset, with and without denoising augmentation objectives. These models are most physically meaningful as they are only fit to datasets containing significant updates to the underlying pseudopotentials relative to the old Materials Project setup.

* EquiformerV2 models trained only on the MPtrj dataset, with and without the denoising augmentation objective, can be used for direct comparison with the Matbench Discovery leaderboard (marked as compliant models).

* Further fine-tuning OMat24 or OC20 on the MPtrj or sAlexandria datasets to train the EquiformerV2 model, making it the best performing model on the Matbench Discovery leaderboard (marked as non-compliant model).

The following table shows the total number of parameters and inference throughput of the model trained based on the EquiformerV2 architecture and the models of different specifications:

Different specifications for model training

The model trained based on EquiformerV2 performs best in the Matbench-Discovery ranking

The researchers used the Matbench-Discovery benchmark to evaluate the EquiformerV2 model, and the results showed that both the compliant (trained only with MPtrj) and non-compliant (trained with additional data) models performed well.The EquiformerV2 model achieved the best performance on the leaderboard (F1 score is the main evaluation indicator).

The following figure shows the performance of other non-compliant models on the Matbench-Discovery leaderboard.

Image source: Matbench-Discovery official website

The results show that the eqV2-M model has an F1 score of 0.916, a mean absolute error (MAE) of 20 meV/atom, and a root mean square error (RMSE) of 72 meV/atom, setting a new benchmark for the prediction of material stability.

In addition, the EquiformerV2 model trained only on the MPtraj dataset also performs well, thanks to effective data augmentation strategies such as denoising the unbalanced structure (DeNS). As can be seen from the table above, the model pre-trained on the OMat24 dataset outperforms the traditional model in terms of accuracy, especially when dealing with unbalanced configurations.

Open source becomes an accelerator for the integration of materials science and AI

In today's data-driven era, AI is reshaping the research paradigm of materials science with its unprecedented speed and precision. In particular, open source AI knowledge, tools, and data around materials science give more researchers, developers, and even enthusiasts the opportunity to participate in the innovation process and work together to advance the development of materials science.

Regarding the release of the OMat24 open source dataset and its model,Max Welling, a machine learning expert and chief scientist at Microsoft Research, said on social media, "I am particularly excited about the new OMat24 dataset, which has spawned a new SOTA-level machine learning force field foundation model."

In fact, as early as 2011, the Berkeley National Laboratory (LBNL) of the United States released the Materials Project.This dataset contains a large amount of computational data on inorganic materials, such as crystal structure, electronic structure, and thermodynamic properties, and has become an important data resource for current materials science research.
Paper address:
https://go.hyper.ai/KExvK

Materials Project dataset download address:

https://go.hyper.ai/BOQS0

For example, Northwestern University in the United States released the open source quantum materials dataset OQMD in 2013.It contains the calculated results of thermodynamic and structural properties of 1,226,781 materials and is widely used for high-throughput DFT analysis of various material applications.
Paper address:
https://www.nature.com/articles/npjcompumats201510

OQMD dataset download address:
https://go.hyper.ai/X4fE5

In 2018, Massachusetts Institute of Technology (MIT) released the CGCNN model.This model is widely used in materials science and uses graph neural networks to predict material properties, such as the band gap, magnetism, and thermodynamic stability of crystalline materials.
Paper address:
https://arxiv.org/pdf/1710.10324

In 2020, the National Institute of Standards and Technology (NIST) released the JARVIS open source platform.Focused on predicting material properties and electronic structure. JARVIS-ML is its machine learning module, which provides a rich data set and machine learning-based material screening tools, supports DFT, molecular dynamics simulation and machine learning, and can help researchers quickly screen and discover new materials.
Paper address:
https://arxiv.org/abs/2007.01831

In 2021, NIST released the ALIGNN model.This model can effectively improve the accuracy of material property prediction by introducing line graphs to capture the complex interactions between atoms.
Paper address:
https://www.nature.com/articles/s41524-021-00650-1

It can be seen that from high-throughput screening to automated material design, open source has become an important accelerator for promoting the integration of materials science and AI, and is leading materials science into a new era of greater intelligence and efficiency.

References:

1.https://www.marktechpost.com/2024/10/20/meta-ai-releases-metas-open-materials-2024-omat24-inorganic-materials-dataset-and-models/

2.https://www.notebookcheck.net/Meta-unveils-OMat24-AI-powered-materials-discovery-goes-open-source.904139.0.htm

3.https://www.technologyreview.com/2024/10/18/1105880/the-race-to-find-new-materials-with-ai-needs-more-data-meta-is-giving-massive-amounts-away-for-free/