MOFormer Self-Supervised Transformer model for Metal-Organic Framework Property Prediction

2025-05-06 0 0 8.93MB 34 页 10玖币

侵权投诉

MOFormer: Self-Supervised Transformer model

for Metal-Organic Framework Property

Prediction

Zhonglin Cao,†,§Rishikesh Magar,†,§Yuyang Wang,†and Amir Barati

Farimani∗,†,‡,¶

†Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh PA, USA

15213

‡Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh PA, USA

15213

¶Machine Learning Department, Carnegie Mellon University, Pittsburgh PA, USA 15213

§Joint First Authorship

E-mail: barati@cmu.edu

Abstract

Metal-Organic Frameworks (MOFs) are materials with a high degree of porosity

that can be used for applications in energy storage, water desalination, gas storage, and

gas separation. However, the chemical space of MOFs is close to an inﬁnite size due to

the large variety of possible combinations of building blocks and topology. Discovering

the optimal MOFs for speciﬁc applications requires an eﬃcient and accurate search

over an enormous number of potential candidates. Previous high-throughput screening

methods using computational simulations like DFT can be time-consuming. Such

methods also require optimizing 3D atomic structure of MOFs, which adds one extra

arXiv:2210.14188v1 [cs.LG] 25 Oct 2022

step when evaluating hypothetical MOFs. In this work, we propose a structure-agnostic

deep learning method based on the Transformer model, named as MOFormer, for

property predictions of MOFs. The MOFormer takes a text string representation of

MOF (MOFid) as input, thus circumventing the need of obtaining the 3D structure of

hypothetical MOF and accelerating the screening process. Furthermore, we introduce

a self-supervised learning framework that pretrains the MOFormer via maximizing

the cross-correlation between its structure-agnostic representations and structure-based

representations of crystal graph convolutional neural network (CGCNN) on >400k

publicly available MOF data. Using self-supervised learning allows the MOFormer to

intrinsically learn 3D structural information though it is not included in the input.

Experiments show that pretraining improved the prediction accuracy of both models

on various downstream prediction tasks. Furthermore, we revealed that MOFormer

can be more data-eﬃcient on quantum-chemical property prediction than structure-

based CGCNN when training data is limited. Overall, MOFormer provides a novel

perspective on eﬃcient MOF design using deep learning.

Introduction

Metal-organic frameworks (MOFs) are a type of porous crystalline materials,1,2 which have

been extensively researched during the past several decades. Research interests have been

induced by the porous structure and versatile nature of MOFs on their potential applica-

tions such as gas adsorption,3–5 water harvesting and desalination,6–8 and energy storage.9–11

MOFs typically consist of several building blocks, including metal-based metal nodes and

organic linkers.4,12,13 The assembly of those building blocks following certain topologies gener-

ates the 2-dimensional or 3-dimensional porous structures of MOFs. Because of the countless

possible combinations of metal nodes, organic linkers, and topologies,13,14 there is a sheer

amount of MOFs with diﬀerent mechanical properties and surface chemistry. Given the enor-

mous variety of possible MOF structures, rapidly and inexpensively selecting the potential

top performers for each speciﬁc task can be challenging. High-throughput screening with

computational tools such as molecular simulation5,15 or density functional theory (DFT)16,17

has been widely used to evaluate the properties of MOFs. Without the need to experimen-

tally synthesize MOF structures, those computational tools accelerate the screening process

and allow researchers to screen hundreds of thousands of hypothetical MOF structures for

their performance in diﬀerent applications.

Recently, machine learning (ML) models have become increasingly popular in the ﬁeld of

MOF property prediction.18–24 The advantage of the ML models over the simulation meth-

ods is their instantaneous inference of the properties of MOFs. In contrast, the simulation

methods require a computationally expensive rerun for every new MOF. In the last decade,

multiple large scale MOF dataset are released, including the CoRE MOF 2019,25 hypotheti-

cal MOFs,5and QMOF.26 These datasets contain the atomic structures of MOFs and their

properties like CO2absorption and band gap. These datasets are large enough to train accu-

rate data-driven ML models for the prediction of MOF properties. Handcrafted geometrical

features such as large cavity diameter and pore limiting diameter have been used as input

to a multilayer perceptron (MLP) to predict MOFs properties.19,23 Although the training

of MLP with a few layers can be fast, this method suﬀers from underwhelming accuracy

due to the simplicity of network architecture. Moreover, selecting features requires extensive

domain knowledge from the researchers and optimized 3D structures of MOFs, thus making

this method less generic. Given the aforementioned drawbacks, a novel method that can

achieve high accuracy with a more generic input of MOFs representations should be pur-

sued. Wang et al.27 utilize the crystal graph convolutional neural network (CGCNN)28 to

predict methane absorption of MOFs. CGCNN is a prevalent model with has an architecture

designed speciﬁcally for crystalline materials. It takes the type and the 3D coordinates of

atoms in the crystalline materials as input and constructs a crystal graph. CGCNN can

extract features that encode rich chemical information through convolution operations on

the crystal graph. However, one drawback of using CGCNN for MOFs property prediction

is that it requires the optimized 3D atomic structures of MOFs which are computationally

expensive to obtain. In addition, some large MOF structures consist of hundred or even

thousand of atoms, thus rendering crystal graph for them can be memory-ineﬃcient.

Enlightened by the fact that all MOFs are combinations of metal nodes, organic linkers,

and topologies, Bucior et al.29 proposed a text string representation of MOFs called MOFid.

The two core sections of a typical MOFid include the chemical information of building blocks

in the format of SMILES30 and the topology and the catenation of the MOF structure. The

building blocks are represented by an extensively used string representation of molecules

called SMILES.30 The topology and catenation are each represented by a code adopted from

the Reticular Chemistry Structure Resource (RCSR) database.31 Therefore, MOFid is a

concise text string representation of MOFs that preserves the chemical and the majority of

the structural information through topology encoding. The MOFid text based representation

enables the application of language ML models that take text string as input for MOF

property prediction.

In this work, we proposed and developed a Transformer-based language model for MOF

property prediction. Transformer and its variants have become the top choice for the natural

language processing tasks since publication in 2017 by Vaswani et al.32 The multi-head

attention mechanism allows the Transformer model to learn contextual information in a

sequence without suﬀering from long-range dependency.33,34 With its success in processing

long sequential data, Transformer and its other variants are also adopted for chemistry or

bioinformatics application such as molecular35–37 and protein38 property prediction. The

Transformer model in our work, named as MOFormer, takes a modiﬁed MOFid as input

to make predictions of various MOF properties. The advantage of this method is that it

does not require the 3D atomic structure of the MOF (structure-agnostic), thus enabling

a much faster and more ﬂexible exploration of the hypothetical MOF space. In practice,

pretraining the Transformer model in a self-supervised manner38–42 leverages large quantity

of unlabeled data to help the MOFormer learn a more robust representation of the sequence,

and further improve its performance in downstream tasks. To take advantage of pre-training

beneﬁts, we also added a self-supervised learning framework in which the MOFormer and

the CGCNN model are jointly pretrained with >400k MOF structures. Dimensionality

reduction tools are used to visualize the latent representation learned by both models to

provide insight into their performance characteristics. Visualization of attention weights in

MOFormer demonstrates that MOFormer learns MOF representations based on some key

atoms and topology. Lastly, we compared the data eﬃciency of models to show which one

is a better choice when training data is limited.

Methods

MOFid tokenization and Transformer

The MOFormer is built upon the encoder part of the Transformer model that takes tokenized

MOFid as input (Figure. 1a). The MOFid tokenizer is a customized version of the SMILES

tokenizer.43 The SMILES strings of all secondary building units (SBUs) of the MOFid is

tokenized by the SMILES tokenizer, while the topology and catenation section of the MOFid

is separately tokenized based on the topology encoding adopted from RCSR.31 The tokens

from both sections are then connected by a separator token “&&”. The tokenization process

followed the BERT39 to add a [CLS] token and a [SEP] token at the beginning and the end of

the sequence to symbolize the start and the end, respectively. Since the tokenized sequences

conform to a ﬁxed length of 512, sequences longer than the ﬁxed length are truncated and

the sequences shorter than that are padded with special tokens [PAD].

Tokenized sequence is embedded and combined with a positional encoding (Figure. 1a)

to include information about the relative and absolute position of each token. The position

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MOFormer:Self-SupervisedTransformermodelforMetal-OrganicFrameworkPropertyPredictionZhonglinCao,y,xRishikeshMagar,y,xYuyangWang,yandAmirBaratiFarimani,y,z,{yDepartmentofMechanicalEngineering,CarnegieMellonUniversity,PittsburghPA,USA15213zDepartmentofChemicalEngineering,CarnegieMellonUniversity,Pitts...

展开>> 收起<<

MOFormer Self-Supervised Transformer model for Metal-Organic Framework Property Prediction.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MOFormer Self-Supervised Transformer model for Metal-Organic Framework Property Prediction

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: