When does deep learning fail and how to tackle it A critical analysis on polymer sequence -property surrogate models Himanshu and Tarak K Patra

2025-04-24 0 0 1.3MB 19 页 10玖币
侵权投诉
When does deep learning fail and how to tackle it? A critical analysis on
polymer sequence-property surrogate models
Himanshu and Tarak K Patra*
Department of Chemical Engineering, Center for Atomistic Modeling and Materials Design
and Center for Carbon Capture Utilization and Storage, Indian Institute of Technology
Madras, Chennai TN 600036, India
Abstract:
Deep learning models are gaining popularity and potency in predicting polymer properties.
These models can be built using pre-existing data and are useful for the rapid prediction of
polymer properties. However, the performance of a deep learning model is intricately
connected to its topology and the volume of training data. There is no facile protocol available
to select a deep learning architecture, and there is a lack of a large volume of homogeneous
sequence-property data of polymers. These two factors are the primary bottleneck for the
efficient development of deep learning models. Here we assess the severity of these factors and
propose new algorithms to address them. We show that a linear layer-by-layer expansion of a
neural network can help in identifying the best neural network topology for a given problem.
Moreover, we map the discrete sequence space of a polymer to a continuous one-dimensional
latent space using a machine learning pipeline to identify minimal data points for building a
universal deep learning model. We implement these approaches for three representative cases
of building sequence-property surrogate models, viz., the single-molecule radius of gyration
of a copolymer, adhesive free energy of a copolymer, and copolymer compatibilizer,
demonstrating the generality of the proposed strategies. This work establishes efficient
methods for building universal deep learning models with minimal data and hyperparameters
for predicting sequence-defined properties of polymers.
Keywords: Deep Learning, Structure-Property Correlations, Polymer Genome, Materials
Design
*Author to Correspond, E-mail: tpatra@iitm.ac.in
2
I. Introduction
Traditional experiments and computer simulations are limited by their inability to rapidly
measure polymer properties, and, thus, inadequate to screen the astronomically large chemical
and conformational space of a polymer. Recent advances in machine learning (ML) and
increasing data and software availability trends can address this problem and accelerate
polymer design.15 Significant progress has been made to create data-driven models that predict
polymer properties. These models are built by collecting candidate polymers and labeling them
by their properties, which are calculated using physics-based methods. A large variety of
machine-readable fingerprints and chemical descriptors are developed to represent polymers
for ML models.69 The fingerprint-property data is utilized for training and building an ML
model. Such an ML model serves as a cheaper, albeit low-fidelity, surrogate for the high-
fidelity first-principle-based simulations and experiments that are expensive. There exist a
large number of numerical frameworks, such as support vector regression, random forest, and
deep neural network (DNN) to build these ML models. Among these, DNN appears to be more
versatile and transferable and provides a flexible mathematical framework to model structure-
property correlation. DNNs have been progressively used to build structure-property models
of a wide range of materials including polymers.1019 They consist of a large number of nodes
arranged in several intermediate layers between the input and output layers. Some of the
important factors that impact the performance of a DNN are weight initialization, activation
function of its nodes, learning rate, network topology, stopping criteria, and loss optimization
algorithm. Among these, the number of nodes and their arrangement in the intermediate layers
plays a key role in determining the accuracy and efficiency of the model. However, there is no
systematic guideline to build DNNs that are computationally efficient yet make good-quality
predictions. The connection between a DNN topology and the quality of its predictions is not
well-established. Moreover, there is no comprehensive understanding of the amount of training
data required for building a DNN model that can predict a wide variation in a material's
property.
We address the above problems of DNN model development for a representative case
of materials properties prediction, viz., sequence-property surrogate model of polymers. The
sequence of a polymer appreciably impacts its bulk and single-molecule properties. Glass
transition, ion transport, thermal conductivity, a single-molecule radius of gyration, and
multimolecular aggregation are all impacted by monomer-to-monomer sequence details of a
polymer.2026 This sequence-property correlation of a polymer is poorly understood due to its
enormous sequence and composition space, and DNNs have been recently used to address this
problem and predict sequence-defined properties of polymers.8,2729 However, no agreed-upon
strategy has emerged to decide the minimum sequence-property data required to build these
models. Also, it is not clear what would be the most efficient neural network topology for the
sequence-property metamodel of polymer. The primary bottleneck in building a universal
model is the astronomically large number of sequences that are possible for a copolymer, and
the sequence-specificity is so profound that a subtle change in the copolymer sequence results
in a significant change in the properties of interest.3032 Oftentimes, the optimal property is
present in a non-intuitive, seemingly arbitrary polymer sequence, the sequence-specificity of
which is unknown.20,21,23 Learning and predicting these variations in structure-property
relations of a polymer are challenging. There are no analytical methods that can estimate the
extremum of a property and the corresponding sequences. It is also challenging to establish the
3
sequence space as a function of a few coordinates. Therefore, building a transferable model
remains a substantially complex task.
While the potential of ML predictive models such as DNNs is very lucrative, they are
interpolative and, therefore, it is not always clear how one should go about training a neural
network to exhaustively fit the entire configurational space of a given system. Currently, DNNs
are trained by generating a large quantity of training data in hopes that they have adequately
sampled the configurational space of a molecular system. This can, however, be an increasingly
prohibitive task when it comes to generating data using computationally expensive physics-
based methods. As such, it is desirable to train a model using the absolute minimal data set
possible, especially when the costs of high-fidelity calculations are high. In the recent past, we
have proposed active learning methods to sample configurational space for collecting DNN
training data in the context of neural network potential development.3335 Several other active
learning strategies, such as QBC (query by committee)36, DP-GEN (deep potential generator)37,
and adaptive Bayesian inferences38 for data selection and building transferable neural network
models. Moreover, there have been other attempts, such as transfer learnings, to build models
with minimal training data. In transfer learning, a model trained on a different property with a
given abundant data set is reused and transferred to build another model for a target task with
considerably small data.39,40 All of these require physics-based property calculation while
selecting the training data. Therefore, selecting the minimal amount of candidate structures
without knowing their properties a priori remains an elusive and attractive goal of ML model
development.
The objectives of this work are to build an algorithm to identify the hyperparameters of
a DNN, estimate the limitation of a DNN, and, finally, establish a framework to build DNN
models that are transferable across the sequence space without the need to generate a large
volume of sequence-property data. To accomplish these objectives, we consider three
representative problems the radius of gyration of a copolymer in an infinitely dilute solution,
copolymer compatibilizer, and copolymer adsorption on a surface. We propose a systematic
linear expansion of DNN architecture to identify the best surrogate models for all three cases.
This approach does not require any special optimization algorithm to explore enormously large
possibilities of a DNN topology. We use this protocol to develop DNN models that predict
sequence-defined properties of polymers with more than 95% accuracy. Secondly, we build a
DNN model using training data that represent a specific range of property and test this model's
ability to predict the property that is outside the training data. We show that the performance
of a DNN declines when the target property is outside the known range of property. We propose
a new framework to tackle the transferability problem of ML by leveraging the power of
convolution DNN autoencoder that automatically extracts features of a molecular system. We
construct a one-dimensional sequence space and sample the sequences uniformly covering the
entire space. This collection of points serves as the training data for our DNN model. We show
that a model based on ~500 data points, which are selected intelligently, can predict the
properties of ~40000 sequences very accurately. We expect this model to predict the properties
of all possible sequences of a copolymer, which is ~1030 for a binary copolymer of chain length
100. Although the current study focuses on sequence-property ML models, these methods are
extensible for other classes of properties and materials. We expect that these new approaches
to data and hyperparameter selections will accelerate the progress of ML model development.
4
II. Polymer Sequence-Property Data
In this study, we focus on three sequence-defined properties of a binary copolymer, viz., the
radius of gyration in an infinitely dilute solution, compatibilization of a polymer blend, and the
adsorption-free energy on a patterned surface. The data are collected from recent molecular
simulation studies that use the Kremer-Grest bead-spring phenomenological model41,42 to
investigate sequence-property correlations. In this phenomenological model, two chemical
moieties are linearly connected to form a copolymer. The interaction parameters of the moieties
are adjusted to represent their chemical affinity in a given system. It is a standard and popular
model for studying generic polymer properties in molecular simulations without considering
specific polymer chemistry and condition. This simple model is computationally very efficient
and can be mapped to real polymers by tuning its parameters.43 The schematic representations
of the systems and the distribution of data are shown in Figure 1. The radius of gyration of a
copolymer in an implicit solvent is taken from our recent study.22 In this study, a polymer of
chain length N=100 with an equal composition of both moieties is simulated in an implicit
solvent condition, as schematically shown in Figure 1A. A large number of sequences are
sampled using a molecular dynamics simulation-based evolutionary algorithm. The data set
consists of ~40000 sequences and their radius of gyration. The second data set (cf. Figure 1B
and E) corresponds to a copolymer compatibilizer.24 Copolymer compatibilizers are surfactant
molecules designed to improve the stability of an interface. They are deployed to enhance
material properties in settings ranging from emulsions to polymer blends. A major
compatibilization strategy employs block or random copolymers composed of distinct repeat
units with preferential affinity for each of the two phases forming the interface, as shown in
Figure 1B. In recent studies, we have shown that the surface tension of the interface is very
Figure 1: Sequence-property polymer data. The schematic representations of three systems - folding of a polymer chain in
an implicit solvent, a copolymer compatibilizer at the interface between two immiscible homopolymers, adsorption of a
copolymer on a substrate are shown schematically in A, B and C, respectively. The corresponding histograms of the available
data for the three cases are shown in D, E and F, respectively. A reduced unit is used in all three studies, wherein
and
are the unit of length and energy, respectively. Also,
𝑘𝐵
and T are the Boltzmann constant and temperature of a system,
respectively.
摘要:

Whendoesdeeplearningfailandhowtotackleit?Acriticalanalysisonpolymersequence-propertysurrogatemodelsHimanshuandTarakKPatra*DepartmentofChemicalEngineering,CenterforAtomisticModelingandMaterialsDesignandCenterforCarbonCaptureUtilizationandStorage,IndianInstituteofTechnologyMadras,ChennaiTN600036,India...

展开>> 收起<<
When does deep learning fail and how to tackle it A critical analysis on polymer sequence -property surrogate models Himanshu and Tarak K Patra.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.3MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注