MIT Team Uses Large-scale Model to Screen 25 Types of Cement Clinker Alternative Materials, Equivalent to Reducing Greenhouse Gas Emissions by 1.2 Billion Tons

Cement production is one of the major sources of greenhouse gas (GHG) emissions worldwide, accounting for more than 6% of global anthropogenic GHG emissions. This environmental burden is mainly due to the production process of cement clinker, which includes the chemical reaction of calcining limestone (CaCO₃→CaO+CO₂) at high temperatures (>950°C) and high energy consumption. With the growth of global infrastructure needs and population expansion,According to a paper by the MIT team, cement production is expected to increase by another 20% by 2050, further exacerbating environmental pressures.
Traditional cement clinker replacement strategies rely primarily on fly ash (a byproduct of coal combustion) and granulated blast furnace slag (a byproduct of steel production), which can replace up to 50% of clinker mass while maintaining mechanical properties, theoretically reducing GHG intensity by 50%.But over the past two decades, its supply as a share of total cement production has fallen from 25% to 17%, due to reduced coal energy production and increased steel recycling.Emerging alternative materials such as biomass ash, waste glass powder, and municipal solid waste incineration ash have potential, but they have problems such as unstable reactivity and seasonal fluctuations in supply. Therefore, it is urgent to develop more sustainable and stable alternative materials.
In order to systematically identify more feasible alternative materials, Soroush Mahjoubi, Elsa A. Olivetti and others from the Massachusetts Institute of Technology (MIT) proposed an innovative multi-source data integration method. Based on the large language model (LLM), this method extracts the chemical composition of 14,000 materials from 88,000 papers, and then uses a multi-head neural network to predict the reaction activity of the materials (heat release, calcium hydroxide (Ca(OH)₂) consumption, and bound water), and builds a unified activity evaluation framework.For the first time, the reactivity of more than 50,000 natural and industrial by-product materials was identified and quantified worldwide, and 25 natural rock types with the potential to replace cement clinker were selected.The study found that natural materials such as construction demolition waste, incineration ash, and volcanic rock are highly reactive and can replace approximately 50% of global clinker usage, equivalent to reducing greenhouse gas emissions by 1.2 billion tons.
The related research was published in Communication Materials under the title “Data-driven material screening of secondary and natural cementitious precursors”.
Research highlights
* Proposed a multi-scale reactivity modeling framework integrating LLM and neural network to uniformly evaluate the cement reactivity of alternative materials
* Build the world's largest database of cement substitute materials, covering 14,000 materials and more than 1,200 rock types, breaking through the limitations of traditional experimental screening
* 25 natural rock types found to be highly reactive, supporting regional clinker substitution strategies that could significantly reduce carbon emissions in the global cement industry

Paper address:
More AI frontier papers:
https://go.hyper.ai/owxf6
The open source project "awesome-ai4s" brings together more than 100 AI4S paper interpretations and provides massive data sets and tools:
https://github.com/hyperai/awesome-ai4s
Dataset: Extraction of chemical composition and type information of 14,000 materials
Building a comprehensive database covering multiple source materials is the key to research.First, the research team selected 4,312 core documents from 88,000 academic papers related to cement and concrete by using keywords.The chemical composition and type information of 14,434 materials were extracted, covering 19 predefined categories such as fly ash, slag, and natural volcanic rock, including 2,028 fly ash samples and 1,346 slag samples, which is a significant expansion of the data size compared with the 725 fly ash samples and 828 slag samples in the earlier study.
On the other hand, to train the model,The researchers integrated experimental data from the R³ standard test method.It includes heat release data for 1,330 samples, Ca(OH)₂ consumption data for 208 samples, and bound water data for 292 samples, covering 318 materials. It is one of the largest cement substitute experimental data sets currently available.
* R³ Standard Test: A standard chemical reactivity test based on chemical composition, median particle size, specific gravity, mixing ratio and amorphous/crystalline phase content
The researchers applied the trained model to the world's largest database of rock chemical composition.The database contains more than 1 million rock samples in total.Subsequently, the researchers scored and classified the reactivity of all records, combined with about 160 rock samples with measured amorphous content data in the literature, and corrected the missing key properties such as amorphous content through data interpolation technology, and finally constructed a unified database of the reactivity of natural and secondary source cementitious materials.
In addition, in terms of data feature construction, the study extracted the contents of major oxides such as CaO, Al₂O₃, and SiO₂, among which the materials with a total of more than 80% accounted for a relatively high proportion. At the same time, combined with physical parameters such as median particle size, specific gravity, amorphous phase content, and process conditions such as curing temperature and age, a training set containing 318 materials and 1,850 data points was constructed.
Model architecture: Multi-task neural network prediction of gel reactivity
In this paper, a multi-head neural network architecture is used to predict the reactivity of materials in cement systems.This architecture is designed to simultaneously predict multiple reactivity metrics, including heat release, Ca(OH)₂ consumption, and bound water. The advantage of the multi-head architecture is that it is able to leverage cross-task transfer learning to improve the prediction accuracy of individual tasks by sharing underlying features.
The inputs to the model include key descriptors such as the chemical composition of the material (e.g. CaO, Al₂O₃, SiO₂, Fe₂O₃, MgO, etc.), particle size, amorphous content, and specific gravity. These descriptors are validated using SHAP analysis (Shapley Additive exPlanations) to ensure that their contribution to the reactivity prediction is reasonable. SHAP analysis results show that major oxides (e.g., CaO, Al₂O₃, SiO₂) are the top descriptors for reactivity prediction, while amorphous content and specific gravity also have significant effects on reactivity.
To meet the challenges of simultaneously predicting multiple reactivity indicators and handling missing values,The researchers designed an Imputation-aware multi-task neural network that manages missing values using two methods:A custom loss function is designed at the output end, and the loss is calculated only based on non-missing values; a dual method is developed at the input end to interpolate missing values while creating a mask to mark the interpolated data, so that the network can distinguish between original values and interpolated values. The model architecture integrates the input descriptor and its mask through connections to handle interpolated values. The optimized network structure contains 4 dense layers with ReLU activation function, interspersed with dropout layers and batch normalization layers to alleviate overfitting. The loss weights of different outputs are inversely proportional to the number of available data points for the indicator to balance the contribution. Finally, Keras Tuner is used to optimize hyperparameters (such as optimizer, learning rate, number of layers, etc.), and an early stopping strategy is adopted in training. The optimal model weights are restored by monitoring and verifying the loss to avoid overfitting.
LLM-based material mining and reactivity research and evaluation
The experimental model can accurately predict the reactivity of materials in cement systems without the need for physical laboratory testing, greatly accelerating the process of material discovery and screening, and providing a new way to reduce greenhouse gas emissions in cement production. In addition, the study confirmed the potential of alternative materials in reducing clinker use. This discovery of increasing the reactivity of materials by increasing the amorphous content provides important guidance for future material design.
Literature mining and precursor analysis based on LLM
The chemical components extracted by fine-tuning LLM,The researchers drew a CaO–Al₂O₃–SiO₂ ternary diagram.As shown in the figure below, among the samples with a total content of more than 80 wt%, except for tailings and a small amount of cement, most samples are characterized by low Al₂O₃, high CaO and low SiO₂. Among them, 56% contain 15–70 wt% CaO, 73% contain 15–70 wt% SiO₂, and 70.5% contain Al₂O₃ less than 15 wt%. Almost 94.5% of the samples contain Fe₂O₃ at 0–15 wt%, and 95% contain MgO less than 10 wt%.Compared with previous studies, the researchers added 2,028 fly ash samples and 1,346 slag samples.At the same time, new material types such as natural volcanic ash, biomass ash and tailings were included. That is, the previous study divided 7,490 materials into 11 categories, while this study expanded it to 12,898 materials and 19 categories.

Another LLM identified material types and subtypes (such as copper tailings in tailings) based on journal data, and classified the materials into 19 predefined types and subtypes for more refined classification analysis. Although chemical composition helps to identify material types, it cannot directly reveal material reactivity. To explore the changes in the composition of cementitious precursors, the researchers conducted t-SNE dimensionality reduction analysis on samples with a total content of CaO, Al₂O₃ and SiO₂ exceeding 80 wt%, as shown in the figure below.The results show that, except for tailings, biomass ash and glass, most materials appear to be clustered separately, and cement is not clearly separated from the inert lime.This indicates that predicting reactivity based solely on chemical composition has limitations.

Machine learning model building and reactive prediction
In terms of predicting material reactivity through machine learning, the researchers used three reactivity indicators obtained from the R³ test, namely heat release, Ca(OH)₂ consumption, and bound water content for training.The study found that heat release is linearly related to bound water.Therefore, bound water can be used to estimate heat release, thereby achieving multi-angle reactivity evaluation. In addition, compared with support vector machine, random forest, XGBoost and single-head neural network, the model performed better in all three indicators: heat release RMSE was 28.20 J/g (confidence interval 3.88 J/g), Ca(OH)₂ consumption was 12.17 g/100g (±4.25), bound water was 1.47 g/100g (±0.45), and the prediction R² was more than 0.85.
The model reveals the key determinants through permutation feature importance analysis and SHAP interpretation.As shown in the figure below. The main oxides (CaO, Al₂O₃, SiO₂, Fe₂O₃, MgO), amorphous content and specific gravity all significantly affect reactivity. Among them, Al₂O₃ and CaCO₃ are the most critical for heat release and bound water, suggesting that they can promote heat release and the formation of aluminates/ettringite, and also enhance early strength;The increase of CaO reduces the consumption of Ca(OH)₂ because it provides a direct calcium source;Low specific gravity materials have more hydration reaction sites. SHAP analysis also shows that as the hydration age of the material increases, the reactivity increases when the proportion of amorphous structure is high. These results are not only consistent with the known mineral activity laws, but also provide an interpretable and three-index prediction technical basis for the use of machine learning to screen high-performance cementitious materials.



(Gray bars represent chemical properties, yellow bars represent environmental descriptors, light blue bars represent physical properties, and red bars represent the mixing ratio of added materials in the paste mixture)
Reactivity assessment and utilization potential of secondary materials
The model framework provides a quantitative assessment of the reactivity of a variety of secondary materials based on chemical composition and interpolation methods to estimate descriptors such as amorphous content, specific gravity and median particle size.
The researchers mapped the reactivity of the materials in terms of heat release and Ca(OH)₂ consumption, clearly distinguishing between pozzolanic materials (Ca(OH)₂ consumption > 50g/100g), inert materials (heat release < 100J/g), and slags that exhibit hydraulic hardness.In general, fly ash, natural pozzolans, silica fume, certain clays, glass, and tailings all exhibit pozzolanic properties, while calcium-containing wastes are almost unreactive;Slag-based materials, although less reactive, typically behave hydraulically; whereas biomass ash, construction waste, and bottom ash also show potential as pozzolanic cementitious materials, validating the model's agreement with previous studies.
In order to accurately assess the clinker replacement potential of each material,The study further subdivided the materials into subtypes based on their sources and processing methods, and analyzed their unique reaction characteristics.As shown in the figure below, the results show that: the volcanic ash activity of F-type fly ash is stronger than that of C-type; the reaction performance of slag and biomass ash varies significantly due to their diverse sources; recycled ceramics, bricks, and concrete in construction and demolition waste all show considerable volcanic ash characteristics, among which the thermal release value of waste ceramics is as high as 450 J/g; the thermal release of copper and zinc tailings can reach 400 J/g, showing that mixed minerals also have potential.

Supply analysis shows that although fly ash, slag and biomass can together replace 53% of global cement production (accounting for 19%, 12% and 22% respectively),The study further pointed out that construction and demolition waste and municipal solid waste can also replace clinker to a large extent in most countries, replacing about 55% and 13% respectively.The substitution potential is even greater, and the two may replace 68% of global cement production. Although some materials are not naturally reactive, construction and urban solid waste still show significant clinker substitution potential through scalable specific activation processes. For example, the electric arc furnace processing method re-matures the cement paste in recycled concrete; wood waste and other waste co-pyrolysis can convert biochar into moderately reactive pozzolanic materials.
Global Discovery of Natural Gel Precursors
The researchers used the interpolation model to input chemical composition and amorphous data from the R³ dataset, significantly improving the accuracy of reactivity prediction. The results showed that the average interpolation error of the model for amorphous content was only 3.0%, and the corresponding reactivity prediction error was 5.0%.
Using predictive models to evaluate rock reactivity, we studied more than 1,200 rock types with heat release greater than 200 J/g and identified 50,569 natural precursors.Among them, the reactive precursors of 25 rock types exceed 5%. The reactivity of anorthite and ignimbrite is the highest compared with the total sample, about 25%;Porphyry, clastic rock, and siliceous tuff are next. Although the reactivity to total volume ratio of extrusive volcanic rocks such as rhyolite is lower than 12%, there are more reactive samples due to their wide global distribution. Most of the reactive samples identified are in the volcanic ash range, with about 46,700 samples belonging to volcanic ash and about 3,800 samples belonging to hydraulic hardness. There are differences in the high reactivity potential of different rock types.The identified natural precursors are distributed all over the world, concentrated in areas such as seismic zones. Medium and high activity precursors can be used as substitutes for clinker raw materials.Although current data shows that the precursors are mostly distributed in Canada, the United States and other countries, they are actually found all over the world. Volcanic precursors are concentrated in Northern Europe, Asia and other places. In North America, they are mainly located in the Appalachian Mountains and other places, as shown in the figure below.


Data-driven low-carbon intelligent cement era
In fact, in the field of academic research, AI technology is penetrating all links of the cement and concrete industry chain in a disruptive manner, and has achieved multi-dimensional breakthroughs in performance prediction and production optimization.
For example,Professor Wei Xiaoyong of the Department of Electronic Computing at the Hong Kong Polytechnic University and his team proposed a machine learning method that can effectively store carbon dioxide in cement materials.Three advanced machine learning techniques, decision trees, random forests, and extreme gradient boosting (XGBoost), were used to couple the existing datasets with data collected from the literature. It was verified that the performance of the XGBoost model was significantly better than the traditional linear regression method. In addition, with the help of SHAP, in addition to the widely recognized factors, the cement type was also studied and its key role in affecting the carbonation depth was demonstrated. CEM II/B-LL and CEM II/BM are two types with higher carbonation potential. The results enable the identification of key factors affecting cement CO2 sequestration and provide insights for optimizing experimental design. The related results were published in Nature partner journals under the title "Machine learning for efficient CO2 sequestration in cementitious materials: a data-driven method".
Paper address:
https://www.nature.com/articles/s44296-025-00053-z
Faced with the high cost of ultra-high performance concrete (UHPC),A research team from the Department of Materials Science and Engineering at Missouri University of Science and Technology used machine learning to optimize and predict the performance of UHPC mixtures, significantly improving efficiency and shortening development time.The results show that the random forest (RF) model is better than the artificial neural network (ANN) model in predicting compressive strength; SHAP value analysis shows that age, fiber content and admixture (SCM) content have significant effects, and the chemical composition of SCM is less important; after removing the chemical composition, the prediction efficiency of only selected input variables is equivalent to that of the full set of inputs. It can be seen that only basic mixture design information is needed to accurately predict UHPC performance, which not only reduces the amount of data collection, but also reduces the computational memory usage and processing time.
Paper address:
https://www.nature.com/articles/s41598-025-94484-2
Looking into the future, AI integrating high-throughput models and neural networks in the field of cement materials may become the core driving force for the cement industry to move towards its carbon neutrality goal by 2050. Standing at the critical point of the new materials revolution, it will open up a new intelligent and green path for infrastructure construction under the "dual carbon" goals.
Reference Links:
1.https://mp.weixin.qq.com/s/4Nmf7aMkuRo8-eietH7bNw
2.https://mp.weixin.qq.com/s/f9D6tVDsruhUr7YbZ7zlhA
3.https://mp.weixin.qq.com/s/3q696f2qqU8Wk949qgivbw