QUANTIFYING THE PERFORMANCE OF MACHINE LEARNING MODELS IN MATERIALS DISCOVERY A P REPRINT

2025-05-02 0 0 2.28MB 19 页 10玖币

侵权投诉

QUANTIFYING THE PERFORMANCE OF MACHINE LEARNING

MODELS IN MATERIALS DISCOVERY

A PREPRINT

Christopher K. H. Borg1, Eric S. Muckley1, Clara Nyby1, James E. Saal1, Logan Ward2, Apurva Mehta3, and Bryce

Meredig1

1Citrine Informatics, Redwood City, CA, United States

2Argonne National Laboratory, Lemont, IL, USA

Stanford Synchrotron Radiation Lightsource, SLAC National Accelerator Laboratory, Menlo Park, California 94025,

USA

*corresponding author(s): James E. Saal (jsaal@citrine.io)

October 26, 2022

ABSTRACT

The predictive capabilities of machine learning (ML) models used in materials discovery are typically

measured using simple statistics such as the root-mean-square error (RMSE) or the coefﬁcient of

determination (

) between ML-predicted materials property values and their known values. A

tempting assumption is that models with low error should be effective at guiding materials discovery,

and conversely, models with high error should give poor discovery performance. However, we

observe that no clear connection exists between a “static” quantity averaged across an entire training

set, such as RMSE, and an ML property model’s ability to dynamically guide the iterative (and often

extrapolative) discovery of novel materials with targeted properties. In this work, we simulate a

sequential learning (SL)-guided materials discovery process and demonstrate a decoupling between

traditional model error metrics and model performance in guiding materials discoveries. We show

that model performance in materials discovery depends strongly on (1) the target range within the

property distribution (e.g., whether a 1st or 10th decile material is desired); (2) the incorporation of

uncertainty estimates in the SL acquisition function; (3) whether the scientist is interested in one

discovery or many targets; and (4) how many SL iterations are allowed. To overcome the limitations

of static metrics and robustly capture SL performance, we recommend metrics such as Discovery

Yield (

), a measure of how many high-performing materials were discovered during SL, and

Discovery Probability (

), a measure of likelihood of discovering high-performing materials at any

point in the SL process.

1 Introduction

As machine learning (ML) tools become more accessible to materials researchers, utilizing ML models to design

experiments is becoming commonplace. Many of the recent successes in applying ML for materials discovery have

been captured in a review by Saal et al. [

]. One approach to materials discovery via ML is to train an ML model (or an

ensemble of models) on a property of interest, make predictions on unknown materials, and conduct a validation test,

usually by experiment [

]. These methods rely on having large enough training sets that new predictions

represent interpolations within explored regions. In contrast, when training data is scarce or extrapolation is necessary,

a sequential learning (SL), sometimes referred to as active learning, approach can be employed [

Sequential learning involves training an initial ML model, selecting optimum candidates based on an acquisition

function, verifying those predictions with simulations or experiments, and then updating the model with new data. This

iterative sequential learning loop offers an efﬁcient means of exploring large design spaces, reducing the number of

experiments necessary to realize a performance goal [14, 15, 16, 17].

arXiv:2210.13587v1 [cond-mat.mtrl-sci] 24 Oct 2022

APREPRINT - OCTOBER 26, 2022

Materials discovery offers unique challenges as an ML application area. For example, materials of interest often exhibit

properties with extreme values, requiring ML models to extrapolate to new regions of property space. This challenge,

and methods to address it, have been discussed previously [

]. Another challenge is representing a material

suitably for input to an ML algorithm, either by incorporating domain knowledge, or learning representations from data.

Chemical composition-based features [

], have become widely used in materials discovery, but it is likely that

further headroom exists for optimization of materials representations. Finally, many materials informatics applications

suffer from lack of data. While there have been many large scale data collection efforts [

], researchers often

extract data by hand to build informative training sets [

], which is a highly time-consuming

process. These unique challenges motivate the need for greater insight into the potential for success in a given SL-driven

materials discovery effort.

Typically, the performance of ML is measured by the improvement in predictive capability of a model, using accuracy

metrics such as root-mean-square error (RMSE) and the coefﬁcient of determination

. While these metrics provide

robust estimates for predictive capabilities against a deﬁned test set, their connection to the ability of SL to identify

new, high-performing materials is unclear. In recent studies, the number of experiments necessary to identify a high-

performing material has been used as a metric for monitoring SL performance [

]. Modeling benchmark datasets

and tools, such as Olympus [

] and MatBench [

], have started to standardize assessment of model and dataset

performance. Notably, a recent study by Rohr et al. [

] considers additional metrics that quantify SL performance

relative to a benchmark case (typically random search). Rohr et al. focuses their study on identifying top 1% materials

from high-throughput electrochemical data and subsequent research expands on this work to compare performance

across a variety of SL frameworks and datasets [

]. Here, we build upon these works by investigating SL

performance metrics for different design problems and speciﬁc targets within those design problems. We compare our

approach to Rohr et al. in more detail in Section 2.3.

In this work, we explore in more detail the topic of SL performance metrics, generalizing conclusions across multiple

datasets and design target ranges. In particular, we identify a decoupling between traditional model error metrics and a

model’s ability select high-performance materials. We look at three SL performance metrics: Discovery Acceleration

Factor (

DAFn

), the average number of SL iterations required to identify

materials in a target range, Discovery Yield

(

DY (i)

), the number of materials in a target range identiﬁed after

SL iterations (normalized by the total number of of

materials in the target range), and Discovery Probability (

DP (i)

), the average number of targets found at a given SL

iteration,

. Each metric is focused on the ability of an SL strategy to identify high-performance materials, rather than the

error associated with model predictions. We then demonstrate use of these metrics with a simulated SL pipeline using a

commonly available band gap database [

]. Next, we focus on the challenge of ML-driven design of thermoelectric

(TE) materials. Fundamental TE properties (Seebeck coefﬁcient, electrical conductivity, and thermal conductivity) were

extracted from the Starrydata platform [

] and used to compute TE ﬁgures of merit (ZT and

σE0

) [

]. The same

simulated SL pipeline was used for these new datasets and performance metrics were compared to identify the optimal

design strategies for TE materials discovery from existing data. We then compare the SL performance metrics and

traditional model error metrics to identify general trends across multiple materials domains and compare these results to

prior work.

2 Methods

A typical SL process for materials discovery begins with an initial training set of material compositions and their

properties. A ML model is trained on the initial training set and used to make predictions on a set of compositions not

in the training set (known as the candidate pool or design space). An acquisition function (detailed in Section 2.2) is

used to select the optimum next experiment to perform from materials in the design space. Those experiments are then

performed (either physically or by some physics-based simulation) and the results are added to the training dataset. The

improved training set is then used in place of the initial training set and the process is repeated until the design goals

have been realized.

To initialize simulated sequential learning, the SL pipeline must be supplied with a set of user-deﬁned SL conﬁguration

parameters, a dataset that contains a set of inputs (typically chemical composition) and a property to be optimized. For

a given dataset, chemical compositions were featurized with the Magpie elemental feature set [

] implemented via the

element-property featurizer in Matminer [

]. For all discovery tasks, random forests were employed with uncertainty

estimates (estimated by calculating the standard deviation in predicted values across estimators) and paired with an

acquisition function to enable the selection of a candidate from the design space, as detailed below.

APREPRINT - OCTOBER 26, 2022

2.1 Simulated sequential learning pipeline

In this work, we developed a standard processing pipeline to simulate sequential learning, a workﬂow summarized in

Figure 1. First, 10% of the dataset (

ntest

) is randomly sampled and held out of the SL process entirely. This "holdout

set" is used to calculate RMSE against the training set at each iteration in the SL process. This is done to ensure that the

model is tested on the same test set at each iteration. A target range is then selected comprising one of the 10 deciles

of the remaining dataset (e.g., "1st decile" indicates that the SL design goal is to ﬁnd materials between 0th and 10th

percentile of the entire dataset). Then, an initial training set, with size n0, is sampled such that it does not contain any

material in the target range. For example, when the target range was deﬁned as 10th decile, compounds with the highest

10% of values would be excluded from being in the initial training set, ensuring that they are left in the candidate

pool. This was done to simulate a real-world materials discovery problem, as a typical goal is to ﬁnd materials with

performance superior to what is currently known. Finally, sequential learning is performed for

niter

iterations using the

acquisition functions (detailed in Section 2.2) to ﬁnd materials as close as possible to the mean of the target range. At

each iteration,

nbatch

compound(s) are added to the training set, SL metrics are calculated (deﬁned in Section 2.3), and

the entire process is repeated ntrials times to determine the trial-to-trial stochastic uncertainty of the SL pipeline.

In this work, the following sequential learning conﬁguration was used:

ntest

size = 10% of dataset,

= 50,

niter

=100,

nbatch

= 1,

ntrials

=100. This conﬁguration was thought to be well-aligned with traditional materials discovery problems.

For example, initial training sets were limited to 50 points, as it is expected many experimental studies wishing to

employ ML would have at least this number of datapoints. In our SL workﬂow we set

nbatch

= 1 as experiments are

often performed one at a time, however, this parameter is adjustable to values greater than 1. By design, the ﬁrst step in

the SL process is the selection of the training set (as opposed to selecting training set points and then conducting a

round of SL); therefore, an SL workﬂow where

niter

=100 has a single step to select training set points followed by 99

rounds of SL.

2.2 Acquisition functions

Four acquisition functions are considered in this work: expected value (EV), which selects candidates whose predicted

property value lies closest to the mean of the target window; expected improvement (EI), which selects candidates

that are likeliest to fall within the target window based on predicted property values and the uncertainties in those

predictions; maximum uncertainty (MU), which selects the candidate with largest prediction uncertainty; and random

search (RS) [

], where a candidate is selected at random. EV is an exploitative acquisition function that tends to

locally improve upon known good materials. MU, in contrast, is meant to be more exploratory, focusing on model

improvement by selecting candidates with the most uncertain prediction. EI is a hybrid approach that attempts to

improve materials performance while also taking uncertainty into account. RS is included here for comparison, with

the intuition that a directed acquisition function should outperform random selection in SL. The comparison of these

functions seeks to demonstrate tradeoffs made when considering the uncertainty associated with a ML prediction and

balancing exploration and exploitation.

2.3 SL ﬁgures of merit

To measure the performance of the simulated SL efforts, we used the following metrics:

1. Discovery acceleration factor (DAFn)

: The average number of SL iterations required to identify

com-

pounds in the target range, relative to random search. For example, if RS takes 10 iterations to identify a target

compound and EV takes 5 iterations, then

DAF1

(EV) = 2, indicating that EV identiﬁes a target compound

twice as fast as RS. The number of iterations for RS for

DAF1

DAF3

, and

DAF5

were estimated to be 10,

30, and 50 respectively as each target range is comprised of a decile of the dataset. Rohr et al. [

] proposed

"acceleration factors" to refer to the number of iterations needed to achieve a particular SL performance goal

(e.g., a desired DY or DP value), normalized by the number of iterations needed for RS to accomplish the

same goal. Acceleration Factor (AF) is also more broadly deﬁned as the "reduction of required budget (e.g. in

time, iterations, or some other consumed resource) between an agent (model + acquisition function) and a

benchmark case (e.g. random selection) to reach a particular fraction of ideal candidates" [

]. In our case,

DAFnis speciﬁc to the number of iterations to ﬁnd ntarget materials relative to random search.

The equation for DAFnis given by:

DAFn=in(EV, EI, M U)

in(random)(1)

where

in(EV, EI, M U)

= the number of SL iterations it takes to ﬁnd n target compounds via EV, EI, or MU

and in(random)= the number of SL iterations it takes to ﬁnd n target compounds via random search.

APREPRINT - OCTOBER 26, 2022

Figure 1: The sequential learning workﬂow used to calculate parameters of interest. A holdout dataset (n

test

) is deﬁned

prior to initializing the SL process. This holdout set (denoted in yellow) is used to calculate RMSE against predicted

values calculated using a model trained on the updated training set (denoted in purple) at each SL iteration. The actual

training set is denoted in green and untested candidates (i.e., the candidate pool) are denoted in orange.

2. Discovery yield (DY (i))

: The number of compounds in the target range identiﬁed after a set number of

iterations divided by the total number of compounds in the target range. For example, the band gap 10th

decile range is comprised of 193 compounds (after removing the holdout set). On average, after 20 iterations,

DYi=20

(EI) = 0.07

0.02; indicating

≈

7% of targets were discovered. This is meant to represent, for a given

number of experiments, how many high-performing compounds a researcher could expect to ﬁnd. Rohr et

al. [

] proposed

all

ALM

as the "fraction of the top percentile catalysts that have been measured by cycle

We interpret

all

ALM

to be an equivalent ﬁgure of merit to

DY (i)

but applied to identifying top 1% materials

(rather than a decile window).

The equation for DY (i)is given by:

DY(i) = 1

ttotal

ti(2)

where ti= number of targets found by iteration i, and ttotal = total targets in dataset.

3. Discovery probability (DP (i))

: The likelihood of ﬁnding a target at a given SL iteration. For example, as

shown in Figure 4, after one iteration of using EI to identify 10th decile band gap materials,

(1)(EI)=0.4

(i.e., 40 out of 100 trials identiﬁed a target compound after 1 SL iteration). In contrast, after 99 iterations,

(99)(EI)=0.6 (i.e., 60 out of 100 trials identiﬁed a target compound after 99 SL iterations). This is meant

to estimate the likelihood of identifying a target compound at every point in the SL process, independent of

previous SL cycles. In contrast to DY (i),DP (i)may increase or decrease with increasing SL iterations.

The equation for DP (i)is given by:

DP(i) = 1

ntrials

n=1

T Fi{0,1}(3)

where

T Fi

= target found boolean (1 if target was found, 0 if not found) at iteration

and

ntrials

= number of

trials.

Consequently,

DP (i)

is also the derivative of

DY (i)

with respect to iterations when correcting for the total

number of targets in the dataset. The equation is given by:

DP(i) = 1

ttotal

dDY (i)

di (4)

where ttotal = total targets in dataset.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

QUANTIFYINGTHEPERFORMANCEOFMACHINELEARNINGMODELSINMATERIALSDISCOVERYAPREPRINTChristopherK.H.Borg1,EricS.Muckley1,ClaraNyby1,JamesE.Saal1,LoganWard2,ApurvaMehta3,andBryceMeredig11CitrineInformatics,RedwoodCity,CA,UnitedStates2ArgonneNationalLaboratory,Lemont,IL,USA3StanfordSynchrotronRadiationLightso...

展开>> 收起<<

QUANTIFYING THE PERFORMANCE OF MACHINE LEARNING MODELS IN MATERIALS DISCOVERY A P REPRINT.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

QUANTIFYING THE PERFORMANCE OF MACHINE LEARNING MODELS IN MATERIALS DISCOVERY A P REPRINT

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: