I. Introduction
Traditional experiments and computer simulations are limited by their inability to rapidly
measure polymer properties, and, thus, inadequate to screen the astronomically large chemical
and conformational space of a polymer. Recent advances in machine learning (ML) and
increasing data and software availability trends can address this problem and accelerate
polymer design.1–5 Significant progress has been made to create data-driven models that predict
polymer properties. These models are built by collecting candidate polymers and labeling them
by their properties, which are calculated using physics-based methods. A large variety of
machine-readable fingerprints and chemical descriptors are developed to represent polymers
for ML models.6–9 The fingerprint-property data is utilized for training and building an ML
model. Such an ML model serves as a cheaper, albeit low-fidelity, surrogate for the high-
fidelity first-principle-based simulations and experiments that are expensive. There exist a
large number of numerical frameworks, such as support vector regression, random forest, and
deep neural network (DNN) to build these ML models. Among these, DNN appears to be more
versatile and transferable and provides a flexible mathematical framework to model structure-
property correlation. DNNs have been progressively used to build structure-property models
of a wide range of materials including polymers.10–19 They consist of a large number of nodes
arranged in several intermediate layers between the input and output layers. Some of the
important factors that impact the performance of a DNN are weight initialization, activation
function of its nodes, learning rate, network topology, stopping criteria, and loss optimization
algorithm. Among these, the number of nodes and their arrangement in the intermediate layers
plays a key role in determining the accuracy and efficiency of the model. However, there is no
systematic guideline to build DNNs that are computationally efficient yet make good-quality
predictions. The connection between a DNN topology and the quality of its predictions is not
well-established. Moreover, there is no comprehensive understanding of the amount of training
data required for building a DNN model that can predict a wide variation in a material's
property.
We address the above problems of DNN model development for a representative case
of materials properties prediction, viz., sequence-property surrogate model of polymers. The
sequence of a polymer appreciably impacts its bulk and single-molecule properties. Glass
transition, ion transport, thermal conductivity, a single-molecule radius of gyration, and
multimolecular aggregation are all impacted by monomer-to-monomer sequence details of a
polymer.20–26 This sequence-property correlation of a polymer is poorly understood due to its
enormous sequence and composition space, and DNNs have been recently used to address this
problem and predict sequence-defined properties of polymers.8,27–29 However, no agreed-upon
strategy has emerged to decide the minimum sequence-property data required to build these
models. Also, it is not clear what would be the most efficient neural network topology for the
sequence-property metamodel of polymer. The primary bottleneck in building a universal
model is the astronomically large number of sequences that are possible for a copolymer, and
the sequence-specificity is so profound that a subtle change in the copolymer sequence results
in a significant change in the properties of interest.30–32 Oftentimes, the optimal property is
present in a non-intuitive, seemingly arbitrary polymer sequence, the sequence-specificity of
which is unknown.20,21,23 Learning and predicting these variations in structure-property
relations of a polymer are challenging. There are no analytical methods that can estimate the
extremum of a property and the corresponding sequences. It is also challenging to establish the