From 9,874 Papers to 15,000 Crystal Structures, MOF-ChemUnity Reconstructs the Panoramic Knowledge of MOF, Propelling Materials Discovery Into the Era of "interpretable AI".

2 months ago

In the field of materials science, metal-organic frameworks (MOFs) are considered scientists' "Swiss Army knife": they possess high specific surface area, chemical tunability, and structural diversity, and have wide applications in gas separation and storage, catalysis, and sensing. However, for researchers, the world of MOFs is extremely vast and complex—more than 125,000 MOF frameworks have been synthesized to date, and millions of possible structures have been calculated and predicted.

Although artificial intelligence (AI) has profoundly changed the field of MOF research,However, most existing methods are still limited in scope, mainly focusing on extraction of single performance or static datasets that are not easily scalable.Even with large-scale text mining datasets, there is a greater emphasis on extracting performance from literature rather than establishing robust associations with crystal structures. A major obstacle to achieving this uniformity is the lack of standardized naming conventions—for example, the same compound might be called "HKUST-1" in the literature, labeled "Compound 1" in one article, and registered as "FIQCEN" in the Cambridge Structure Database (CSD). This inconsistency exists not only in MOFs but is pervasive in materials science, making it difficult for humans and large language models (LLMs) to match data across sources.

Against this backdrop,A research team from the University of Toronto and the Clean Energy Innovation Research Centre of the National Research Council of Canada proposed MOF-ChemUnity: a structured, scalable, and extensible knowledge graph.This method utilizes LLM to establish a reliable one-to-one mapping between MOF names and their synonyms in the literature and crystal structures registered in CSD, thereby achieving disambiguation between MOF names and their synonyms and crystal structures. In its current version, MOF-ChemUnity integrates approximately 10,000 scientific articles and over 15,000 CSD crystal structures and their computational chemical properties, presented in a machine-operable format. When used as a knowledge source to augment LLM, MOF-ChemUnity enables AI assistants to perform reasoning based on comprehensive literature knowledge.Expert evaluations show that its accuracy, interpretability, and reliability are superior to the standard LLM in tasks such as retrieval, structure-property relationship inference, and material recommendation.

The related research findings, titled "MOF-ChemUnity: Literature-Informed Large Language Models for Metal–Organic Framework Research," have been published in ACS Publications.

Research highlights:

* MOF-ChemUnity enables cross-publication information integration and analysis by identifying and linking all designations and names to a single material entity.

* This structure allows researchers to ask high-level scientific questions and enables AI models to reason about the MOF chemical space on a factual and interpretable basis, thus opening up new ways of literature interaction that go beyond reading a single article or manual data collection.

* For domains facing similar problems to MOF, such as the lack of standard naming conventions and data heterogeneity, MOF-ChemUnity provides a powerful blueprint for unified information.

Paper address:

https://pubs.acs.org/doi/10.1021/jacs.5c11789
Follow our official WeChat account and reply "MOF-ChemUnit" in the background to get the complete PDF.

More AI frontier papers:
https://hyper.ai/papers

Datasets: Providing a comprehensive data perspective

MOF-ChemUnity's data foundation comes from two main databases:CoRE MOF 2019 and QMOF, totaling more than 31,000 unique crystal structures.To ensure data reliability, the research team only retained entries with gas adsorption or band structure information and had to have CSD (Cambridge Structural Database) reference codes to trace back to the original literature.

Using text mining and data mining (TDM) methods, researchers obtained full-text articles from multiple publishers, including ACS, Elsevier, and RSC. Regardless of whether the documents were in XML or PDF format, they were converted into standardized Markdown files to ensure efficient processing by subsequent AI models.

After applying the matching workflow, the team successfully resolved and associated 15,143 MOF crystal structures of 93%, establishing correspondences with names and synonyms in 9,874 publications. More importantly,The research team not only matched MOF names with crystal structures, but also identified referential information in the literature (such as "Compound 1" referring to a specific MOF), ensuring that each MOF entity forms a one-to-one corresponding entry in the knowledge graph, laying a solid foundation for subsequent model training and information extraction.

Building on this, the research team also extracted the experimental properties, synthetic routes, and recommended applications of MOFs, forming a structured treasure trove containing more than 70,000 property data points and more than 2,500 application suggestions, providing scientists with a comprehensive data perspective.

ChemUnity: A structured, scalable, and extensible knowledge graph

In MOF-ChemUnity, the core is a model framework consisting of LLM matching and extraction agents and a knowledge graph:

The first part of the workflow aims to address the issues of named entity recognition, referential resolution, and unique entity association in MOF.The researchers' solution involved providing LLM with crystal structure-derived information, matching MOF names in papers with their corresponding CSD reference codes. This information included CSD reference codes, lattice parameters, metal nodes, space groups, molecular formulas, chemical names, and known synonyms, all obtained through the CSD Python API. LLM was instructed to find which unique MOF names in the papers corresponded to given CSD reference codes, ensuring a one-to-one correspondence between CSD reference codes and MOF names in each paper. LLM also needed to find all references associated with the MOF. By separating the MOF name matching and reference resolution tasks, a refined accuracy assessment of each step was possible, providing a reliable foundation for subsequent information extraction. (See figure below.)

LLM agent for matching and extracting MOF data

Information extraction workflow

General workflow:The MOF names extracted from the matching workflow are used for information extraction integration; in this integration, multiple workflows receive MOF names and extract different information associated with them, such as properties, recommended applications, and synthesis information.

Dedicated workflow:For complex properties (such as water stability), the Chain of Verification (CoV) method is used to ensure the reliability of extraction results and reduce the generation of AI "illusions".

Knowledge Graph MOF-ChemUnity Construction

In designing MOF-ChemUnity, researchers focused on three key objectives:Scalability, associativity, and queryability.

First, the knowledge graph must be scalable and additable, capable of seamlessly integrating new data as literature and computational databases grow. Second, it must support cross-document entity resolution, ensuring accurate association of multiple citations of the same compound, regardless of whether they come from different papers, nomenclatures, or databases. Third, it should support both local and global queries, enabling both fine-grained queries (such as the synthesis conditions of a single MOF) and broader analyses (such as identifying structure-property trends across application domains).

To achieve these goals,The research team designed a pattern with unique node and relationship types.Each MOF is represented as an MOF node, with publications, synthesis steps, properties, and application mentions modeled as independent nodes and connected by semantic relationships. The generated knowledge graph contains over 40,000 nodes and 3,200,000 relationships. The complete schema, the complete knowledge graph, and individual MOF subgraphs are shown in the following figure:

Constructing heterogeneous MOF data using knowledge graphs

Graph-Enhanced Retrieval and Generation (Graph-Enhanced RAG)

The graph-enhanced RAG system retrieves relevant information and uses it as few-shot context for general question answering. The framework also incorporates machine learning-based embeddings to identify structurally or chemically similar MOFs, thus enabling more informative question answering.The core components—the Query tool and the Neighbor Finder tool—are modular and can be invoked as needed by the AI agent.

MOF Recommendations and Embedding Space

Using chemical and geometric descriptors (RAC, pore volume, pore size, etc.), MOFs are projected into a low-dimensional embedding space, and similar materials are recommended using the nearest neighbor method. This can be applied to gas adsorption, carbon capture, and other scenarios, transforming human experience into machine learning-compatible rules.

Results Showcase: Scientists and AI systems can fully utilize the complete knowledge of MOFs.

Using the above framework, the research team conducted system verification and application demonstration:

Water stability prediction

Using the water stability dataset from MOF-ChemUnity, researchers trained a classifier model that performed exceptionally well in water stability prediction, achieving an accuracy of 80% and an F1 score of 86% (see figure below). More importantly, since MOF-ChemUnity also includes CO₂ adsorption data from molecular simulations, researchers can perform joint screening to identify materials that simultaneously meet both criteria.

Predicting the water stability of MOFs using machine learning

Expert Recommendation and Verification

Experts often recommend MOFs for specific applications based on intuition, experience, or domain knowledge. While this information is valuable in itself, it is often difficult to formalize or systematize its use. To address this issue, researchers have leveraged the correlation between expert recommendations and crystal structures within MOF-ChemUnity to embed MOFs into a structure-aware chemical space.

Researchers evaluated the effectiveness of this method in two applications with computationally relevant data: methane storage and carbon dioxide capture. As shown in the figure below, in both applications, these neighboring MOFs (labeled as model-recommended) exhibited performance similar to expert-recommended materials. This indicates that…Once expert intuition is mapped onto the structural space, machine learning models can learn from that intuition and combine it with experimental data to make predictions.

Methane and carbon dioxide absorption distributions for all materials in the CoRE MOF 2019 database

Assessing the strength and specificity of expert recommendations is also insightful. To this end, researchers compared the performance distribution of expert-recommended MOFs with their neighboring materials and materials randomly sampled from the entire database. For methane storage, the average CH4 adsorption capacity of expert-recommended MOFs and their neighboring MOFs was significantly higher than the average of the entire dataset, indicating that experts effectively selected materials with excellent methane storage performance. This is consistent with previous research, which suggests that methane storage is primarily influenced by intuitive geometric properties such as porosity and effective capacity under pressure swing conditions.

In contrast, for carbon dioxide capture, the performance distribution of expert-recommended MOFs is similar to that of random samples, indicating that expert intuition is less reliable in this field.

Document AI Assistant Application

Banerjee et al. synthesized a lithium-based MOF called Ultralight MOF (ULMOF-5), which they referred to as "Compound 1" in their paper.When querying the water stability of ULMOF-5 using standard LLM, the model provides a "illusory" answer, confusing it with the similarly named but unrelated Zn-based MOF-5. In contrast, MOF-ChemUnity associates all references with the correct crystal structure and captures the water stability label ("unstable") indicated by the sentence "compound 1 is soluble in water" in the paper. The system proposed in this study can retrieve this information and provide a well-founded answer with citations and explanations, thus improving accuracy and transparency.

To further evaluate the system, researchers compared the responses of the graph-enhanced RAG and the original LLM (GPT-4o) on three tasks: fact retrieval, structure-property inference, and material recommendation. Nine MOF experts evaluated the quality and credibility of the responses in a blinded survey. Figure c below shows that the graph-enhanced assistant scored higher across all tasks. Experts placed particular emphasis on cited literature, specific examples, and verifiable assertions, while the baseline model's responses were often general, unsubstantiated, or unverifiable. This suggests that integrating structured scientific knowledge into LLM can improve factual reliability and user trust.

RAG, based on knowledge graphs, serves as an AI assistant for literature information.

MOF-ChemUnity can be extended to other material categories.

The significance of MOF-ChemUnity extends far beyond existing MOF data integration; it provides a cross-disciplinary and scalable data management and analysis paradigm for materials science research. In recent years, with the rapid development of research on covalent organic frameworks, zeolites, polymers, and porous materials, various materials data have exhibited high heterogeneity and inconsistent nomenclature, making cross-document and cross-database information integration a bottleneck restricting scientific discovery. Against this backdrop, the knowledge graph framework established by MOF-ChemUnity provides a blueprint for these material categories:By using unified entity parsing, core relationship annotation, and attribute extraction methods, effective association and systematic management of data from different sources can be achieved even in fields lacking standardized naming or with significant differences in data formats.

Many teams in the industry are also working on similar projects.For example, a wealth of scientific findings have been accumulated in a vast body of academic literature on materials science. However, the scientific knowledge scattered throughout these documents in textual form is typically collected and analyzed manually by researchers, a process that is often time-consuming and struggles to ensure the completeness of the information. If the materials science information in these documents is represented as structured knowledge, and then combined with methods such as knowledge association, fusion, and reasoning to construct a materials knowledge graph, researchers can acquire information accurately and efficiently.

Professor Pan Feng's research group at the School of New Materials, Peking University Shenzhen Graduate School, has been dedicated to constructing materials knowledge graphs and solving key scientific and technical challenges in recent years. They have developed a high-precision and efficient framework for name-based disambiguation and information search, constructing a materials knowledge graph called MatKG. Building on this foundation, in 2022, the group proposed a semantic representation framework that enables the embedding of materials science knowledge. This framework improves the representation quality of materials entities through multi-source information fusion, allowing for accurate mining of lithium-ion battery cathode material entities from materials science literature and the construction of a cathode material knowledge graph to predict high-performance lithium battery materials.
Paper Title:Automating Materials Exploration with a Semantic Knowledge Graph for Li-ion Battery Cathodes
Paper address:https://advanced.onlinelibrary.wiley.com/doi/abs/10.1002/adfm.202201437

On the other hand, with the introduction of standardized formats such as the IUPAC Adsorption Information File (AIF), MOF-ChemUnity's design allows for seamless integration of new standards, achieving data unification, traceability, and interpretability. In this way, both new literature reports and computational simulation data can be easily incorporated into the system, enabling continuous expansion and iterative updates of the dataset. This sustainable updating capability provides a solid foundation for high-throughput, multi-target material screening, aligns with current trends in materials genome initiatives and FAIR data principles, and provides researchers with a reproducible and verifiable analytical framework.

In the future, MOF-ChemUnity's potential also lies in its ability to serve as a scientific assistant. Through natural language interaction and graph query tools, researchers can ask complex questions, such as "Which MOFs suitable for pollutant removal in aquatic environments possess both high stability and specific metal nodes?", and the system can provide verifiable answers based on literature, experimental, and computational data. This approach, which integrates knowledge graphs and LLM, sets a new benchmark for AI applications in materials science research.

References:
1. https://pubs.acs.org/doi/10.1021/jacs.5c11789

2. https://advanced.onlinelibrary.wiley.com/doi/abs/10.1002/adfm.202201437

3. https://news.pku.edu.cn/jxky/64f28e5b50074113bfaec41af68c1971.htm