MOFormer Self-Supervised Transformer model for Metal-Organic Framework Property Prediction

2025-05-06 0 0 8.93MB 34 页 10玖币
侵权投诉
MOFormer: Self-Supervised Transformer model
for Metal-Organic Framework Property
Prediction
Zhonglin Cao,,§Rishikesh Magar,,§Yuyang Wang,and Amir Barati
Farimani,,,
Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh PA, USA
15213
Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh PA, USA
15213
Machine Learning Department, Carnegie Mellon University, Pittsburgh PA, USA 15213
§Joint First Authorship
E-mail: barati@cmu.edu
Abstract
Metal-Organic Frameworks (MOFs) are materials with a high degree of porosity
that can be used for applications in energy storage, water desalination, gas storage, and
gas separation. However, the chemical space of MOFs is close to an infinite size due to
the large variety of possible combinations of building blocks and topology. Discovering
the optimal MOFs for specific applications requires an efficient and accurate search
over an enormous number of potential candidates. Previous high-throughput screening
methods using computational simulations like DFT can be time-consuming. Such
methods also require optimizing 3D atomic structure of MOFs, which adds one extra
1
arXiv:2210.14188v1 [cs.LG] 25 Oct 2022
step when evaluating hypothetical MOFs. In this work, we propose a structure-agnostic
deep learning method based on the Transformer model, named as MOFormer, for
property predictions of MOFs. The MOFormer takes a text string representation of
MOF (MOFid) as input, thus circumventing the need of obtaining the 3D structure of
hypothetical MOF and accelerating the screening process. Furthermore, we introduce
a self-supervised learning framework that pretrains the MOFormer via maximizing
the cross-correlation between its structure-agnostic representations and structure-based
representations of crystal graph convolutional neural network (CGCNN) on >400k
publicly available MOF data. Using self-supervised learning allows the MOFormer to
intrinsically learn 3D structural information though it is not included in the input.
Experiments show that pretraining improved the prediction accuracy of both models
on various downstream prediction tasks. Furthermore, we revealed that MOFormer
can be more data-efficient on quantum-chemical property prediction than structure-
based CGCNN when training data is limited. Overall, MOFormer provides a novel
perspective on efficient MOF design using deep learning.
Introduction
Metal-organic frameworks (MOFs) are a type of porous crystalline materials,1,2 which have
been extensively researched during the past several decades. Research interests have been
induced by the porous structure and versatile nature of MOFs on their potential applica-
tions such as gas adsorption,3–5 water harvesting and desalination,6–8 and energy storage.9–11
MOFs typically consist of several building blocks, including metal-based metal nodes and
organic linkers.4,12,13 The assembly of those building blocks following certain topologies gener-
ates the 2-dimensional or 3-dimensional porous structures of MOFs. Because of the countless
possible combinations of metal nodes, organic linkers, and topologies,13,14 there is a sheer
amount of MOFs with different mechanical properties and surface chemistry. Given the enor-
mous variety of possible MOF structures, rapidly and inexpensively selecting the potential
2
top performers for each specific task can be challenging. High-throughput screening with
computational tools such as molecular simulation5,15 or density functional theory (DFT)16,17
has been widely used to evaluate the properties of MOFs. Without the need to experimen-
tally synthesize MOF structures, those computational tools accelerate the screening process
and allow researchers to screen hundreds of thousands of hypothetical MOF structures for
their performance in different applications.
Recently, machine learning (ML) models have become increasingly popular in the field of
MOF property prediction.18–24 The advantage of the ML models over the simulation meth-
ods is their instantaneous inference of the properties of MOFs. In contrast, the simulation
methods require a computationally expensive rerun for every new MOF. In the last decade,
multiple large scale MOF dataset are released, including the CoRE MOF 2019,25 hypotheti-
cal MOFs,5and QMOF.26 These datasets contain the atomic structures of MOFs and their
properties like CO2absorption and band gap. These datasets are large enough to train accu-
rate data-driven ML models for the prediction of MOF properties. Handcrafted geometrical
features such as large cavity diameter and pore limiting diameter have been used as input
to a multilayer perceptron (MLP) to predict MOFs properties.19,23 Although the training
of MLP with a few layers can be fast, this method suffers from underwhelming accuracy
due to the simplicity of network architecture. Moreover, selecting features requires extensive
domain knowledge from the researchers and optimized 3D structures of MOFs, thus making
this method less generic. Given the aforementioned drawbacks, a novel method that can
achieve high accuracy with a more generic input of MOFs representations should be pur-
sued. Wang et al.27 utilize the crystal graph convolutional neural network (CGCNN)28 to
predict methane absorption of MOFs. CGCNN is a prevalent model with has an architecture
designed specifically for crystalline materials. It takes the type and the 3D coordinates of
atoms in the crystalline materials as input and constructs a crystal graph. CGCNN can
extract features that encode rich chemical information through convolution operations on
the crystal graph. However, one drawback of using CGCNN for MOFs property prediction
3
is that it requires the optimized 3D atomic structures of MOFs which are computationally
expensive to obtain. In addition, some large MOF structures consist of hundred or even
thousand of atoms, thus rendering crystal graph for them can be memory-inefficient.
Enlightened by the fact that all MOFs are combinations of metal nodes, organic linkers,
and topologies, Bucior et al.29 proposed a text string representation of MOFs called MOFid.
The two core sections of a typical MOFid include the chemical information of building blocks
in the format of SMILES30 and the topology and the catenation of the MOF structure. The
building blocks are represented by an extensively used string representation of molecules
called SMILES.30 The topology and catenation are each represented by a code adopted from
the Reticular Chemistry Structure Resource (RCSR) database.31 Therefore, MOFid is a
concise text string representation of MOFs that preserves the chemical and the majority of
the structural information through topology encoding. The MOFid text based representation
enables the application of language ML models that take text string as input for MOF
property prediction.
In this work, we proposed and developed a Transformer-based language model for MOF
property prediction. Transformer and its variants have become the top choice for the natural
language processing tasks since publication in 2017 by Vaswani et al.32 The multi-head
attention mechanism allows the Transformer model to learn contextual information in a
sequence without suffering from long-range dependency.33,34 With its success in processing
long sequential data, Transformer and its other variants are also adopted for chemistry or
bioinformatics application such as molecular35–37 and protein38 property prediction. The
Transformer model in our work, named as MOFormer, takes a modified MOFid as input
to make predictions of various MOF properties. The advantage of this method is that it
does not require the 3D atomic structure of the MOF (structure-agnostic), thus enabling
a much faster and more flexible exploration of the hypothetical MOF space. In practice,
pretraining the Transformer model in a self-supervised manner38–42 leverages large quantity
of unlabeled data to help the MOFormer learn a more robust representation of the sequence,
4
and further improve its performance in downstream tasks. To take advantage of pre-training
benefits, we also added a self-supervised learning framework in which the MOFormer and
the CGCNN model are jointly pretrained with >400k MOF structures. Dimensionality
reduction tools are used to visualize the latent representation learned by both models to
provide insight into their performance characteristics. Visualization of attention weights in
MOFormer demonstrates that MOFormer learns MOF representations based on some key
atoms and topology. Lastly, we compared the data efficiency of models to show which one
is a better choice when training data is limited.
Methods
MOFid tokenization and Transformer
The MOFormer is built upon the encoder part of the Transformer model that takes tokenized
MOFid as input (Figure. 1a). The MOFid tokenizer is a customized version of the SMILES
tokenizer.43 The SMILES strings of all secondary building units (SBUs) of the MOFid is
tokenized by the SMILES tokenizer, while the topology and catenation section of the MOFid
is separately tokenized based on the topology encoding adopted from RCSR.31 The tokens
from both sections are then connected by a separator token “&&”. The tokenization process
followed the BERT39 to add a [CLS] token and a [SEP] token at the beginning and the end of
the sequence to symbolize the start and the end, respectively. Since the tokenized sequences
conform to a fixed length of 512, sequences longer than the fixed length are truncated and
the sequences shorter than that are padded with special tokens [PAD].
Tokenized sequence is embedded and combined with a positional encoding (Figure. 1a)
to include information about the relative and absolute position of each token. The position
5
摘要:

MOFormer:Self-SupervisedTransformermodelforMetal-OrganicFrameworkPropertyPredictionZhonglinCao,y,xRishikeshMagar,y,xYuyangWang,yandAmirBaratiFarimani,y,z,{yDepartmentofMechanicalEngineering,CarnegieMellonUniversity,PittsburghPA,USA15213zDepartmentofChemicalEngineering,CarnegieMellonUniversity,Pitts...

展开>> 收起<<
MOFormer Self-Supervised Transformer model for Metal-Organic Framework Property Prediction.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:8.93MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注