The Efficiency of Stable Material Generation Has Been Improved by 300%! Meta FAIR Released the Material Generation Model FlowLLM, With a Data Set Covering More Than 45,000 Materials

Crystalline materials are a class of materials with regularly arranged atomic, ionic or molecular structures, and play an important role in industry and technology.
However, the generation and design process of crystalline materials is not simple, and usually requires the consideration of a combination of discrete and continuous variables. Discrete variables define the basic framework of the material (such as atomic type and initial lattice structure), while continuous variables allow fine-tuning and optimization within this basic framework to ultimately generate crystalline materials with specific physical and chemical properties.
With the interdisciplinary application of AI technology,How to effectively combine discrete and continuous variables in the model to obtain high-quality crystal material generation effects has become a core problem in the field of crystal material generation.
Although existing methods, including autoregressive large language models (LLMs) and denoising models such as denoising diffusion models and flow matching models, have achieved some success in this field, they all have their own limitations.
Specifically, LLM performs well in modeling discrete values, especially in dealing with discrete elements such as atomic types, but it is difficult to accurately describe lattice geometry and the positions between atoms. The denoising model has more advantages in dealing with continuous variables and can better maintain the equivariance in the crystal structure, but it faces obstacles in modeling discrete elements such as atomic types.
Based on this, Meta's FAIR Laboratory and the University of Amsterdam jointly released the material generation model FlowLLM.This is a new generative model that combines the large language model (LLM) and Riemannian flow matching (RFM). It is more than 300% efficient in generating stable materials and about 50% efficient in generating SUN materials, while retaining the ability of LLM to provide natural language prompts.
* SUN materials refer to stable, unique, and novel materials generated by AI technology in the field of materials science. This concept was proposed by Microsoft when discussing the MatterGen model.
The related research, titled "FlowLLM: Flow Matching for Material Generation with Large Language Models as Base Distributions", has been uploaded to the preprint website arXiv and accepted by NeurIPS 2024.
Research highlights:
* FlowLLM combines LLM and RFM, effectively bridging the gap between discrete and continuous modeling, greatly improving the efficiency of generating stable, unique and novel materials
* FlowLLM significantly outperforms CD-VAE, DiffCSP, FlowMM, CrystalLLM and other models in generating novel and stable materials. Its stability rate is about 300% higher than the previous best model, and its SUN rate is about 50% higher

Paper address:
https://arxiv.org/pdf/2410.23405
Follow the official account and reply "FlowLLM" to get the complete PDF
The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:
https://github.com/hyperai/awesome-ai4s
Dataset: Contains 45,231 materials, and the model is trained on the MP-20 dataset
The FlowLLM model is trained on the inorganic crystalline materials dataset MP-20. MP-20 contains 45,231 materials,A subset of the Materials Project containing up to 20 atoms that are considered metastable.
First, the researchers independently trained LLM using the MP-20 dataset and fine-tuned it in PyTorch and Transformers using the LoRA (Low-Rank Adapters) method. After that, the researchers used the fine-tuned LLM (weight frozen) as the base distribution and the MP-20 dataset as the target distribution to further train the RFM model.
Complementary advantages: Combining the two models of LLM + RFM, a new generation model FlowLLM came into being
FlowLLM is a novel generative model that combines the Large Language Model (LLM) and the Riemannian Flow Matching (RFM) model.It is a further study based on previous work, creatively combining LLM and RFM.
The LLM used here comes from the result "Fine-Tuned Language Models Generate Stable Inorganic Materials as Text" released by Meta FAIR and New York University in February this year. The study proved that the success rate of the fine-tuned LLM (LLaMA-2 70B) in predicting the generation of metastable materials is about twice that of the competitive diffusion model CDVAE.
Paper address:
https://arxiv.org/abs/2402.04379
FlowMM comes from the result "FlowMM: Generating Materials with Riemannian Flow Matching" released by Meta FAIR and the University of Amsterdam in June this year. As a generative model, FlowMM is three times more efficient than previous open source methods in finding stable materials.
Paper address:
https://arxiv.org/abs/2406.04713
As shown in the figure below, the researchers first used the fine-tuned LLM to generate an initial material representation through an unconditional query. Then, the RFM model iteratively transformed the material, updating its atomic positions and lattice parameters. It should be noted that in RFM, the atomic type remains unchanged.

The researchers point out that combining the two models can complement each other's strengths.On the one hand, LLM provides a good learning base distribution for RFM:The output distribution of LLM is used as the learned base distribution of RFM, replacing the commonly used uniform base distribution. Since LLM has been trained on material data, the learned base distribution is closer to the target distribution, which greatly simplifies the integration with RFM.
* In flow models (such as RFM), the base distribution is the starting distribution from which the model generates samples. Learning the base distribution can more accurately capture the real structure and pattern of the data. Especially when dealing with complex data (such as crystal structure in material design), learning the base distribution can effectively improve the quality of generated samples and the performance of the model.
On the other hand, RFM optimizes the output of LLM:LLM produces an approximate material representation due to its limited precision when processing continuous values. RFM refines this approximation through iterative denoising, producing a more accurate representation.
Outstanding: Model stable material generation efficiency increased by 300%, SUN material generation efficiency increased by 50%
To test the performance of the model, the researchers compared the FlowLLM model with the CD-VAE model (a hybrid model of variational autoencoder and diffusion model), the DiffCSP model (diffusion model), the FlowMM model (Riemann flow matching model), and the CrystalLLM model (LLaMA-2 model fine-tuned on material sequences), and asked each model to generate 10,000 new structures.
In the performance comparison,The main indicators that researchers focus on are stability rate and SUN rate. Specifically, stability refers to the proportion of thermodynamically stable materials in the generated material, which is an important indicator of synthesizability; SUN rate refers to the proportion of stable, unique and novel materials. The results are shown in the figure below:

In terms of stability and SUN rate,Thermodynamically stable materials accounted for 17.82% of the materials generated by the FlowLLM model, and the SUN rate reached 4.92%. The research team introduced in the paper,Compared with the previous optimal model, the stability rate of FlowLLM is improved by 300%, and the SUN rate is improved by 50%.
The Ehull value is one of the important parameters for measuring the stability and synthesizability of materials. For a given material structure, if the Ehull value is close to zero, it means that the material is stable to a great extent and is more likely to exist in the actual synthesis process. A higher Ehull value may indicate that the material is not stable and is difficult to synthesize.
To further test the stability and synthesizability of the materials generated by FlowLLM,The researchers compared the Ehull values of the materials generated by FlowLLM with those of existing models, as shown in the figure below. The dotted line represents the thermodynamic stability threshold (Ehull = 0), red represents the FlowLLM model, and blue represents CD-VAE, DiffCSP and FlowMM respectively.
It can be seen that FlowLLM can generate more materials with lower Ehull values than other models.The materials generated by FlowLLM are more stable and synthesizable than those generated by other models.

In addition, the researchers evaluated the model's N-ary value, which refers to the number of different element types in a material.The higher the N-ary value, the greater the complexity of the material and the more difficult it is to synthesize.As shown in the figure below, the researchers compared the N-ary value distribution of different models. The results show that compared with the diffusion model, FlowMM and FlowLLM are more consistent with the data distribution. This means that the FlowMM and FlowLLM models are more suitable for fitting material data.Can better capture the intrinsic structure and distribution characteristics of the material.

Finally, the researchers also conducted a comparative analysis of the RFM integration steps of the model. As shown in the figure below, compared with the diffusion and flow matching models that require hundreds or thousands of integration steps,FlowLLM is able to converge in as little as 50 steps.

A Hundred Schools of Thought in the Field of Crystal Material Generation
In the field of materials science research, Meta's FAIR laboratory has recently entered a period of high productivity. Just a few weeks ago, the OMat24 dataset was released, which contains more than 110 million DFT calculation results focusing on structural and compositional diversity, providing new high-quality "raw materials" for model training.
In fact, in the field of crystal material generation, in addition to the LLM and denoising models mentioned in this article, there are several other methods, such as material generation based on generative adversarial networks (GANs), material generation based on variational autoencoders (VAEs), material generation based on graph neural networks (GNNs), and so on.
In 2018, University Paris Est and Sorbonne University combined two cross-domain GAN modules to propose CrystalGAN.It is worth mentioning that CrystalGAN has been applied in the discovery of hydrogen storage materials, demonstrating its effectiveness in solving real chemistry and materials science challenges.
The related research was published in ICLR 2019 under the title “CrystalGAN: Learning to Discover Crystallographic Structures with Generative Adversarial Networks”.
Paper address:
https://openreview.net/pdf?id=SyEGUi05Km
In 2021, MIT Computer and Artificial Intelligence Laboratory proposed CD-VAE.It captures the physical inductive bias of material stability by learning the data distribution of stable materials. The related research was published at ICLR 2022 under the title "Crystal Diffusion Variational Autoencoder for Periodic Material Generation".
Paper address:
https://openreview.net/forum?id=03RLpj-tc_
In 2023, Chulalongkorn University in Thailand and the Thailand Center of Excellence in Physics released DP-CDVAE based on the research of CD-VAE. While maintaining comparable performance to CD-VAE, DP-CDVAE demonstrates significant advantages in terms of energy accuracy, generation performance, and lattice generation quality.
The related research was published in Nature under the title "Diffusion probabilistic models enhance variational autoencoder for crystal structure generative modeling".
Paper address:
https://www.nature.com/articles/s41598-024-51400-4
In 2023, Google DeepMind's materials team released GNoME, a graph neural network model for materials exploration.In a short period of time, 2.2 million new crystals were discovered (equivalent to nearly 800 years of knowledge accumulated by human scientists), of which 380,000 new crystals had stable structures, becoming potential new materials most likely to be synthesized experimentally and put into use.
This year, researchers from Tohoku University and MIT also proposed the GNNOpt model based on the GNN method.The successful identification of 246 materials with solar energy conversion efficiency exceeding 32% and 296 quantum materials with high quantum weight has greatly accelerated the discovery of energy and quantum materials.
The relevant research results are far more than this. In the field of crystal material generation, we are witnessing a prosperous scene of "a hundred schools of thought contending". As the research deepens, we have reason to believe that these innovative methods and theories will provide key solutions to global challenges in areas such as energy, environment and health.
