APREPRINT - OCTOBER 26, 2022
Materials discovery offers unique challenges as an ML application area. For example, materials of interest often exhibit
properties with extreme values, requiring ML models to extrapolate to new regions of property space. This challenge,
and methods to address it, have been discussed previously [
18
,
19
]. Another challenge is representing a material
suitably for input to an ML algorithm, either by incorporating domain knowledge, or learning representations from data.
Chemical composition-based features [
20
,
21
], have become widely used in materials discovery, but it is likely that
further headroom exists for optimization of materials representations. Finally, many materials informatics applications
suffer from lack of data. While there have been many large scale data collection efforts [
22
,
23
,
24
], researchers often
extract data by hand to build informative training sets [
25
,
26
,
27
,
28
,
29
,
30
,
31
], which is a highly time-consuming
process. These unique challenges motivate the need for greater insight into the potential for success in a given SL-driven
materials discovery effort.
Typically, the performance of ML is measured by the improvement in predictive capability of a model, using accuracy
metrics such as root-mean-square error (RMSE) and the coefficient of determination
r2
. While these metrics provide
robust estimates for predictive capabilities against a defined test set, their connection to the ability of SL to identify
new, high-performing materials is unclear. In recent studies, the number of experiments necessary to identify a high-
performing material has been used as a metric for monitoring SL performance [
8
,
16
,
32
]. Modeling benchmark datasets
and tools, such as Olympus [
33
] and MatBench [
34
], have started to standardize assessment of model and dataset
performance. Notably, a recent study by Rohr et al. [
35
] considers additional metrics that quantify SL performance
relative to a benchmark case (typically random search). Rohr et al. focuses their study on identifying top 1% materials
from high-throughput electrochemical data and subsequent research expands on this work to compare performance
across a variety of SL frameworks and datasets [
36
,
37
]. Here, we build upon these works by investigating SL
performance metrics for different design problems and specific targets within those design problems. We compare our
approach to Rohr et al. in more detail in Section 2.3.
In this work, we explore in more detail the topic of SL performance metrics, generalizing conclusions across multiple
datasets and design target ranges. In particular, we identify a decoupling between traditional model error metrics and a
model’s ability select high-performance materials. We look at three SL performance metrics: Discovery Acceleration
Factor (
DAFn
), the average number of SL iterations required to identify
n
materials in a target range, Discovery Yield
(
DY (i)
), the number of materials in a target range identified after
i
SL iterations (normalized by the total number of of
materials in the target range), and Discovery Probability (
DP (i)
), the average number of targets found at a given SL
iteration,
i
. Each metric is focused on the ability of an SL strategy to identify high-performance materials, rather than the
error associated with model predictions. We then demonstrate use of these metrics with a simulated SL pipeline using a
commonly available band gap database [
34
]. Next, we focus on the challenge of ML-driven design of thermoelectric
(TE) materials. Fundamental TE properties (Seebeck coefficient, electrical conductivity, and thermal conductivity) were
extracted from the Starrydata platform [
38
] and used to compute TE figures of merit (ZT and
σE0
) [
38
]. The same
simulated SL pipeline was used for these new datasets and performance metrics were compared to identify the optimal
design strategies for TE materials discovery from existing data. We then compare the SL performance metrics and
traditional model error metrics to identify general trends across multiple materials domains and compare these results to
prior work.
2 Methods
A typical SL process for materials discovery begins with an initial training set of material compositions and their
properties. A ML model is trained on the initial training set and used to make predictions on a set of compositions not
in the training set (known as the candidate pool or design space). An acquisition function (detailed in Section 2.2) is
used to select the optimum next experiment to perform from materials in the design space. Those experiments are then
performed (either physically or by some physics-based simulation) and the results are added to the training dataset. The
improved training set is then used in place of the initial training set and the process is repeated until the design goals
have been realized.
To initialize simulated sequential learning, the SL pipeline must be supplied with a set of user-defined SL configuration
parameters, a dataset that contains a set of inputs (typically chemical composition) and a property to be optimized. For
a given dataset, chemical compositions were featurized with the Magpie elemental feature set [
20
] implemented via the
element-property featurizer in Matminer [
39
]. For all discovery tasks, random forests were employed with uncertainty
estimates (estimated by calculating the standard deviation in predicted values across estimators) and paired with an
acquisition function to enable the selection of a candidate from the design space, as detailed below.
2