QUANTIFYING THE PERFORMANCE OF MACHINE LEARNING MODELS IN MATERIALS DISCOVERY A P REPRINT

2025-05-02 0 0 2.28MB 19 页 10玖币
侵权投诉
QUANTIFYING THE PERFORMANCE OF MACHINE LEARNING
MODELS IN MATERIALS DISCOVERY
A PREPRINT
Christopher K. H. Borg1, Eric S. Muckley1, Clara Nyby1, James E. Saal1, Logan Ward2, Apurva Mehta3, and Bryce
Meredig1
1Citrine Informatics, Redwood City, CA, United States
2Argonne National Laboratory, Lemont, IL, USA
3
Stanford Synchrotron Radiation Lightsource, SLAC National Accelerator Laboratory, Menlo Park, California 94025,
USA
*corresponding author(s): James E. Saal (jsaal@citrine.io)
October 26, 2022
ABSTRACT
The predictive capabilities of machine learning (ML) models used in materials discovery are typically
measured using simple statistics such as the root-mean-square error (RMSE) or the coefficient of
determination (
r2
) between ML-predicted materials property values and their known values. A
tempting assumption is that models with low error should be effective at guiding materials discovery,
and conversely, models with high error should give poor discovery performance. However, we
observe that no clear connection exists between a “static” quantity averaged across an entire training
set, such as RMSE, and an ML property model’s ability to dynamically guide the iterative (and often
extrapolative) discovery of novel materials with targeted properties. In this work, we simulate a
sequential learning (SL)-guided materials discovery process and demonstrate a decoupling between
traditional model error metrics and model performance in guiding materials discoveries. We show
that model performance in materials discovery depends strongly on (1) the target range within the
property distribution (e.g., whether a 1st or 10th decile material is desired); (2) the incorporation of
uncertainty estimates in the SL acquisition function; (3) whether the scientist is interested in one
discovery or many targets; and (4) how many SL iterations are allowed. To overcome the limitations
of static metrics and robustly capture SL performance, we recommend metrics such as Discovery
Yield (
DY
), a measure of how many high-performing materials were discovered during SL, and
Discovery Probability (
DP
), a measure of likelihood of discovering high-performing materials at any
point in the SL process.
1 Introduction
As machine learning (ML) tools become more accessible to materials researchers, utilizing ML models to design
experiments is becoming commonplace. Many of the recent successes in applying ML for materials discovery have
been captured in a review by Saal et al. [
1
]. One approach to materials discovery via ML is to train an ML model (or an
ensemble of models) on a property of interest, make predictions on unknown materials, and conduct a validation test,
usually by experiment [
2
,
3
,
4
,
5
,
6
,
7
]. These methods rely on having large enough training sets that new predictions
represent interpolations within explored regions. In contrast, when training data is scarce or extrapolation is necessary,
a sequential learning (SL), sometimes referred to as active learning, approach can be employed [
8
,
9
,
10
,
11
,
12
,
13
].
Sequential learning involves training an initial ML model, selecting optimum candidates based on an acquisition
function, verifying those predictions with simulations or experiments, and then updating the model with new data. This
iterative sequential learning loop offers an efficient means of exploring large design spaces, reducing the number of
experiments necessary to realize a performance goal [14, 15, 16, 17].
arXiv:2210.13587v1 [cond-mat.mtrl-sci] 24 Oct 2022
APREPRINT - OCTOBER 26, 2022
Materials discovery offers unique challenges as an ML application area. For example, materials of interest often exhibit
properties with extreme values, requiring ML models to extrapolate to new regions of property space. This challenge,
and methods to address it, have been discussed previously [
18
,
19
]. Another challenge is representing a material
suitably for input to an ML algorithm, either by incorporating domain knowledge, or learning representations from data.
Chemical composition-based features [
20
,
21
], have become widely used in materials discovery, but it is likely that
further headroom exists for optimization of materials representations. Finally, many materials informatics applications
suffer from lack of data. While there have been many large scale data collection efforts [
22
,
23
,
24
], researchers often
extract data by hand to build informative training sets [
25
,
26
,
27
,
28
,
29
,
30
,
31
], which is a highly time-consuming
process. These unique challenges motivate the need for greater insight into the potential for success in a given SL-driven
materials discovery effort.
Typically, the performance of ML is measured by the improvement in predictive capability of a model, using accuracy
metrics such as root-mean-square error (RMSE) and the coefficient of determination
r2
. While these metrics provide
robust estimates for predictive capabilities against a defined test set, their connection to the ability of SL to identify
new, high-performing materials is unclear. In recent studies, the number of experiments necessary to identify a high-
performing material has been used as a metric for monitoring SL performance [
8
,
16
,
32
]. Modeling benchmark datasets
and tools, such as Olympus [
33
] and MatBench [
34
], have started to standardize assessment of model and dataset
performance. Notably, a recent study by Rohr et al. [
35
] considers additional metrics that quantify SL performance
relative to a benchmark case (typically random search). Rohr et al. focuses their study on identifying top 1% materials
from high-throughput electrochemical data and subsequent research expands on this work to compare performance
across a variety of SL frameworks and datasets [
36
,
37
]. Here, we build upon these works by investigating SL
performance metrics for different design problems and specific targets within those design problems. We compare our
approach to Rohr et al. in more detail in Section 2.3.
In this work, we explore in more detail the topic of SL performance metrics, generalizing conclusions across multiple
datasets and design target ranges. In particular, we identify a decoupling between traditional model error metrics and a
model’s ability select high-performance materials. We look at three SL performance metrics: Discovery Acceleration
Factor (
DAFn
), the average number of SL iterations required to identify
n
materials in a target range, Discovery Yield
(
DY (i)
), the number of materials in a target range identified after
i
SL iterations (normalized by the total number of of
materials in the target range), and Discovery Probability (
DP (i)
), the average number of targets found at a given SL
iteration,
i
. Each metric is focused on the ability of an SL strategy to identify high-performance materials, rather than the
error associated with model predictions. We then demonstrate use of these metrics with a simulated SL pipeline using a
commonly available band gap database [
34
]. Next, we focus on the challenge of ML-driven design of thermoelectric
(TE) materials. Fundamental TE properties (Seebeck coefficient, electrical conductivity, and thermal conductivity) were
extracted from the Starrydata platform [
38
] and used to compute TE figures of merit (ZT and
σE0
) [
38
]. The same
simulated SL pipeline was used for these new datasets and performance metrics were compared to identify the optimal
design strategies for TE materials discovery from existing data. We then compare the SL performance metrics and
traditional model error metrics to identify general trends across multiple materials domains and compare these results to
prior work.
2 Methods
A typical SL process for materials discovery begins with an initial training set of material compositions and their
properties. A ML model is trained on the initial training set and used to make predictions on a set of compositions not
in the training set (known as the candidate pool or design space). An acquisition function (detailed in Section 2.2) is
used to select the optimum next experiment to perform from materials in the design space. Those experiments are then
performed (either physically or by some physics-based simulation) and the results are added to the training dataset. The
improved training set is then used in place of the initial training set and the process is repeated until the design goals
have been realized.
To initialize simulated sequential learning, the SL pipeline must be supplied with a set of user-defined SL configuration
parameters, a dataset that contains a set of inputs (typically chemical composition) and a property to be optimized. For
a given dataset, chemical compositions were featurized with the Magpie elemental feature set [
20
] implemented via the
element-property featurizer in Matminer [
39
]. For all discovery tasks, random forests were employed with uncertainty
estimates (estimated by calculating the standard deviation in predicted values across estimators) and paired with an
acquisition function to enable the selection of a candidate from the design space, as detailed below.
2
APREPRINT - OCTOBER 26, 2022
2.1 Simulated sequential learning pipeline
In this work, we developed a standard processing pipeline to simulate sequential learning, a workflow summarized in
Figure 1. First, 10% of the dataset (
ntest
) is randomly sampled and held out of the SL process entirely. This "holdout
set" is used to calculate RMSE against the training set at each iteration in the SL process. This is done to ensure that the
model is tested on the same test set at each iteration. A target range is then selected comprising one of the 10 deciles
of the remaining dataset (e.g., "1st decile" indicates that the SL design goal is to find materials between 0th and 10th
percentile of the entire dataset). Then, an initial training set, with size n0, is sampled such that it does not contain any
material in the target range. For example, when the target range was defined as 10th decile, compounds with the highest
10% of values would be excluded from being in the initial training set, ensuring that they are left in the candidate
pool. This was done to simulate a real-world materials discovery problem, as a typical goal is to find materials with
performance superior to what is currently known. Finally, sequential learning is performed for
niter
iterations using the
acquisition functions (detailed in Section 2.2) to find materials as close as possible to the mean of the target range. At
each iteration,
nbatch
compound(s) are added to the training set, SL metrics are calculated (defined in Section 2.3), and
the entire process is repeated ntrials times to determine the trial-to-trial stochastic uncertainty of the SL pipeline.
In this work, the following sequential learning configuration was used:
ntest
size = 10% of dataset,
n0
= 50,
niter
=100,
nbatch
= 1,
ntrials
=100. This configuration was thought to be well-aligned with traditional materials discovery problems.
For example, initial training sets were limited to 50 points, as it is expected many experimental studies wishing to
employ ML would have at least this number of datapoints. In our SL workflow we set
nbatch
= 1 as experiments are
often performed one at a time, however, this parameter is adjustable to values greater than 1. By design, the first step in
the SL process is the selection of the training set (as opposed to selecting training set points and then conducting a
round of SL); therefore, an SL workflow where
niter
=100 has a single step to select training set points followed by 99
rounds of SL.
2.2 Acquisition functions
Four acquisition functions are considered in this work: expected value (EV), which selects candidates whose predicted
property value lies closest to the mean of the target window; expected improvement (EI), which selects candidates
that are likeliest to fall within the target window based on predicted property values and the uncertainties in those
predictions; maximum uncertainty (MU), which selects the candidate with largest prediction uncertainty; and random
search (RS) [
40
], where a candidate is selected at random. EV is an exploitative acquisition function that tends to
locally improve upon known good materials. MU, in contrast, is meant to be more exploratory, focusing on model
improvement by selecting candidates with the most uncertain prediction. EI is a hybrid approach that attempts to
improve materials performance while also taking uncertainty into account. RS is included here for comparison, with
the intuition that a directed acquisition function should outperform random selection in SL. The comparison of these
functions seeks to demonstrate tradeoffs made when considering the uncertainty associated with a ML prediction and
balancing exploration and exploitation.
2.3 SL figures of merit
To measure the performance of the simulated SL efforts, we used the following metrics:
1. Discovery acceleration factor (DAFn)
: The average number of SL iterations required to identify
n
com-
pounds in the target range, relative to random search. For example, if RS takes 10 iterations to identify a target
compound and EV takes 5 iterations, then
DAF1
(EV) = 2, indicating that EV identifies a target compound
twice as fast as RS. The number of iterations for RS for
DAF1
,
DAF3
, and
DAF5
were estimated to be 10,
30, and 50 respectively as each target range is comprised of a decile of the dataset. Rohr et al. [
35
] proposed
"acceleration factors" to refer to the number of iterations needed to achieve a particular SL performance goal
(e.g., a desired DY or DP value), normalized by the number of iterations needed for RS to accomplish the
same goal. Acceleration Factor (AF) is also more broadly defined as the "reduction of required budget (e.g. in
time, iterations, or some other consumed resource) between an agent (model + acquisition function) and a
benchmark case (e.g. random selection) to reach a particular fraction of ideal candidates" [
36
]. In our case,
DAFnis specific to the number of iterations to find ntarget materials relative to random search.
The equation for DAFnis given by:
DAFn=in(EV, EI, M U)
in(random)(1)
where
in(EV, EI, M U)
= the number of SL iterations it takes to find n target compounds via EV, EI, or MU
and in(random)= the number of SL iterations it takes to find n target compounds via random search.
3
APREPRINT - OCTOBER 26, 2022
Figure 1: The sequential learning workflow used to calculate parameters of interest. A holdout dataset (n
test
) is defined
prior to initializing the SL process. This holdout set (denoted in yellow) is used to calculate RMSE against predicted
values calculated using a model trained on the updated training set (denoted in purple) at each SL iteration. The actual
training set is denoted in green and untested candidates (i.e., the candidate pool) are denoted in orange.
2. Discovery yield (DY (i))
: The number of compounds in the target range identified after a set number of
iterations divided by the total number of compounds in the target range. For example, the band gap 10th
decile range is comprised of 193 compounds (after removing the holdout set). On average, after 20 iterations,
DYi=20
(EI) = 0.07
±
0.02; indicating
7% of targets were discovered. This is meant to represent, for a given
number of experiments, how many high-performing compounds a researcher could expect to find. Rohr et
al. [
35
] proposed
all
ALM
i
as the "fraction of the top percentile catalysts that have been measured by cycle
i
".
We interpret
all
ALM
i
to be an equivalent figure of merit to
DY (i)
but applied to identifying top 1% materials
(rather than a decile window).
The equation for DY (i)is given by:
DY(i) = 1
ttotal
ti(2)
where ti= number of targets found by iteration i, and ttotal = total targets in dataset.
3. Discovery probability (DP (i))
: The likelihood of finding a target at a given SL iteration. For example, as
shown in Figure 4, after one iteration of using EI to identify 10th decile band gap materials,
DP
(1)(EI)=0.4
(i.e., 40 out of 100 trials identified a target compound after 1 SL iteration). In contrast, after 99 iterations,
DP
(99)(EI)=0.6 (i.e., 60 out of 100 trials identified a target compound after 99 SL iterations). This is meant
to estimate the likelihood of identifying a target compound at every point in the SL process, independent of
previous SL cycles. In contrast to DY (i),DP (i)may increase or decrease with increasing SL iterations.
The equation for DP (i)is given by:
DP(i) = 1
ntrials
ntrials
X
n=1
T Fi{0,1}(3)
where
T Fi
= target found boolean (1 if target was found, 0 if not found) at iteration
i
and
ntrials
= number of
trials.
Consequently,
DP (i)
is also the derivative of
DY (i)
with respect to iterations when correcting for the total
number of targets in the dataset. The equation is given by:
DP(i) = 1
ttotal
dDY (i)
di (4)
where ttotal = total targets in dataset.
4
摘要:

QUANTIFYINGTHEPERFORMANCEOFMACHINELEARNINGMODELSINMATERIALSDISCOVERYAPREPRINTChristopherK.H.Borg1,EricS.Muckley1,ClaraNyby1,JamesE.Saal1,LoganWard2,ApurvaMehta3,andBryceMeredig11CitrineInformatics,RedwoodCity,CA,UnitedStates2ArgonneNationalLaboratory,Lemont,IL,USA3StanfordSynchrotronRadiationLightso...

展开>> 收起<<
QUANTIFYING THE PERFORMANCE OF MACHINE LEARNING MODELS IN MATERIALS DISCOVERY A P REPRINT.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:2.28MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注