Large-scale machine-learning-assisted exploration of the whole materials space Jonathan Schmidt1Noah Hoﬀmann1Hai-Chen Wang1Pedro Borlido2Pedro J. M. A. Carriço2Tiago F. T. Cerqueira2Silvana Botti3and Miguel A. L. Marques1y

2025-05-03 0 0 8.94MB 13 页 10玖币

侵权投诉

Large-scale machine-learning-assisted exploration of the whole materials space

Jonathan Schmidt,1Noah Hoﬀmann,1Hai-Chen Wang,1Pedro Borlido,2Pedro J. M.

A. Carriço,2Tiago F. T. Cerqueira,2Silvana Botti,3, ∗and Miguel A. L. Marques1, †

1Institut für Physik, Martin-Luther-Universität Halle-Wittenberg, D-06099 Halle, Germany

2CFisUC, Department of Physics, University of Coimbra, Rua Larga, 3004-516 Coimbra, Portugal

3Institut für Festkörpertheorie und -optik, Friedrich-Schiller-Universität Jena, Max-Wien-Platz 1, 07743 Jena, Germany

(Dated: October 4, 2022)

Crystal-graph attention networks have emerged recently as remarkable tools for the prediction

of thermodynamic stability and materials properties from unrelaxed crystal structures. Previous

networks trained on two million materials exhibited, however, strong biases originating from under-

represented chemical elements and structural prototypes in the available data. We tackled this issue

computing additional data to provide better balance across both chemical and crystal-symmetry

space. Crystal-graph networks trained with this new data show unprecedented generalization ac-

curacy, and allow for reliable, accelerated exploration of the whole space of inorganic compounds.

We applied this universal network to perform machine-learning assisted high-throughput materi-

als searches including 2500 binary and ternary structure prototypes and spanning about 1 billion

compounds. After validation using density-functional theory, we uncover in total 19512 additional

materials on the convex hull of thermodynamic stability and ∼150 000 compounds with a distance of

less than 50 meV/atom from the hull. Combining again machine learning and ab-initio methods, we

ﬁnally evaluate the discovered materials for applications as superconductors, superhard materials,

and we look for candidates with large gap deformation potentials, ﬁnding several compounds with

extreme values of these properties.

I. INTRODUCTION

One of the most tantalizing possibilities of modern

computational materials science is the prediction and

characterization of experimentally unknown compounds.

In fact, developments in theory and algorithms in the

past decades allowed for the systematic exploration of

a chemical space spanning millions of materials, search-

ing for compounds with tailored properties for speciﬁc

technological applications. Currently, the most eﬃcient

approach consists in scanning the composition space for

a given crystal structure prototype. In such approaches,

the key material property that is used to estimate if a

material can be experimentally synthesized is the total

energy, or more speciﬁcally the energy distance to the

convex hull of thermodynamic stability. Typically, for

each combination of chemical composition and crystal-

structure prototype one performs a geometry optimiza-

tion, e.g., using some ﬂavor of density functional the-

ory (DFT), and compares the resulting DFT energy with

all possible decomposition channels. Compounds on the

convex hull (or close to it) are then selected for char-

acterization and, if they possess interesting physical or

chemical properties, proposed for experimental synthe-

sis.

For binary prototypes this approach is relatively

straightforward, and therefore the binary phase space

has been comprehensively explored [1]. For a ternary

prototype the diﬀerent combination of chemical elements

generates roughly 500 000 compositions, a number still

∗silvana.botti@uni-jena.de

†miguel.marques@physik.uni-halle.de

within reach of DFT calculations, at least for prototypes

with a high symmetry and relatively few atoms in the

unit cell [2]. However, there are thousands of known

ternary prototypes, making a brute force approach to the

problem unrealistic. Despite the resulting huge number

of candidate ternary compounds, it is worth observing

that the largest computational databases only contain

overall about 4×106materials [1, 3, 4].

Machine learning methods have made it possible to

accelerate material searches considerably. These meth-

ods are some of the most useful instruments added to

the toolbox of material science and solid-state physics

in the last decade. Thanks to them, a wide variety

of material properties can now be eﬃciently predicted

with close to ab-initio accuracy [5–7]. Early works in

this direction achieved speedups by factors of about

5-30 [2, 8, 9]. These works were generally based on

relatively simple machine learning models, e.g., deci-

sion trees or kernel ridge regression, that used hand-

built features of the composition as input and had to

be retrained for diﬀerent crystal prototypes. Signiﬁ-

cant progress toward more general models was made by

Ward et al. [9] who included structural descriptors ap-

plicable to high-throughput searches in terms of Voronoi

tesselations. This allowed Ward et al. to use training

data from all prototypes, resulting in improved perfor-

mance for high-throughput searches. Other important

steps forward were achieved by two other classes of mod-

els that were developed simultaneously: message pass-

ing networks for crystal and molecular graphs, as well as

deeper composition-based models [10–13]. We note that

compositional models can be completely independent of

the crystal structure. However, they are inadequate for

large-scale high-throughput searches, as they cannot dif-

arXiv:2210.00579v1 [cond-mat.mtrl-sci] 2 Oct 2022

ferentiate between polymorphs with the same chemical

composition. Message passing networks, on the other

hand, enabled unprecedented performance for the pre-

diction of properties with ab-initio accuracy [5, 14, 15]

from crystal structures.

Until recently all message passing networks for crys-

tals used atomic positions in some form as input. How-

ever, this information is not available until calcula-

tions, e.g. using DFT structure optimization, are per-

formed. Consequently, such models are unpractical for

high-throughput searches. Recently, Schmidt et al. [16]

and Goodall et al. [17] developed coarse-grained message

passing networks that circumvent the problem, as they

do not require atomic positions as an input. In this work

we apply the former approach to explore a signiﬁcantly

enlarged space of crystalline compounds.

Currently, the largest issue concerning the accuracy of

message passing networks is no longer the topology of the

networks, nor its complexity, but is related to the limita-

tions of existing materials datasets. Reference 16 identi-

ﬁed large biases stemming from the lack of structural and

chemical diversity in the available data. These biases, ul-

timately of anthropogenic nature [18, 19], lead unfortu-

nately to a poor generalization error. In fact, even if the

error in test sets is of the order of 20–30 meV/atom, the

actual error during high-throughput searches can be eas-

ily one order of magnitude larger if the available training

data is not representative of the actual material space 16.

In this work we tackle this challenging problem us-

ing a stepwise approach. First, we perform a series of

high-throughput searches with an extended set of chem-

ical elements (including lanthanides and some actinide

elements), applying the transfer learning approach pre-

sented in Ref. [16]. Thanks to the additional data gen-

erated by these calculations, we expect to reduce the

bias due to the representation of the chemical elements

in the dataset. In a subsequent step, we retrain the

crystal-graph network and employ it to scan a mate-

rial space of almost 1 billion compounds that comprises

more than 2000 crystal-structure prototypes. We obtain

in this way a dataset of DFT calculations with a con-

siderably larger structural diversity, that we then use to

retrain a network. This crystal graph network is then

shown to possess a massively improved generalization er-

ror and a strongly reduced chemical and structural bias.

Finally, we oﬀer a demonstration of the usefulness of our

approach, and inspect this dataset to search for materials

with extreme values of some interesting physical proper-

ties.

II. CONSTRUCTION OF DATASETS AND

NETWORKS

A. Enlarging the chemical space

Our starting point is the dataset used by some of us

for training in Ref. 16. We will refer to this dataset as

“DCGAT-1” and to the crystal-graph network of Ref. 16

as “CGAT-1”, respectively.

As discussed previously, the training data in DCGAT-

1 is biased with respect to the distribution of chemi-

cal elements and crystal symmetries. To circumvent the

ﬁrst problem we performed a series of high-throughput

calculations for speciﬁc structure prototypes. We used

a larger chemical space than previous works, consider-

ing 84 chemical elements, including all elements up to

Pu (with the exception of Po and At, for which we do

not have pseudopotentials, Yb whose pseudopotential ex-

hibits numerical problems, and rare gases). This results

in 6 972 possible permutations per binary, 571 704 permu-

tations per ternary, and 46 308 024 permutations per qua-

ternary system. For all these compositions we considered

a (largely arbitrary) selection of prototypes, including

ternary garnets, Ruddlesden–Popper layered perovskites,

cubic Laves phases, ternary and quaternary Heuslers, au-

ricuprides, etc. In total we included 11 binary, 8 ternaries

and 1 quaternary prototypes (a complete list and more

details can be found in the Supporting Information).

For each structural prototype included in the selection,

we performed a high-throughput study using the transfer

learning approach proposed in Ref. 16: (i) The machine-

learning model is used to predict the distance to the con-

vex hull of stability for possible chemical compositions.

At the start we use the pre-trained CGAT-1 machine;

(ii) We perform DFT geometry optimizations to validate

all compounds predicted below 200 meV/atom from the

hull; (iii) We add these calculations to a dataset con-

taining all DFT calculations for the corresponding struc-

tural prototype. (iv) We use transfer learning to train

a new model on the basis of this dataset with a train-

ing/validation/testing split of 80%/10%/10%. (v) The

cycle is restarted one to three times, until the MAE of

the model is below 30 meV/atom.

This procedure resulted in 397 438 additional DFT cal-

culations, yielding 4382 compounds below the hull of

DCGAT-1 (and therefore already increasing the known

convex hull by approximately ten percent). Moreover, we

added a large dataset of mixed perovskites [16] plus data

concerning oxynitride, oxyﬂuoride and nitroﬂuoride per-

ovskites from Ref. 20, amounting to around 381 000 DFT

calculations. Finally, we recalculated and added 1343

compounds that were possibly unconverged outliers from

AFLOW[1] according to the criteria in Ref. 21. The ﬁnal

dataset resulting from all these changes and additions,

contains ∼780 000 compounds more than DCGAT-1 and

will be denoted as DCGAT-2.

In ﬁg. 1 we plot the element distribution in both

datasets DCGAT-1 and DCGAT-2. As expected, the

original dataset is quite biased with a drastic under-

sampling of most lanthanides and actinides. Despite its

smaller size, the new dataset includes between three and

twenty times more compounds containing undersampled

elements, and it therefore counteracts the unbalanced

distribution of chemical elements of DCGAT-1. Note

that, in particular, metallic elements appear in very sim-

(a)

(b)

FIG. 1. Number of materials in (a) DCGAT-1 and

(b) DCGAT-2 containing one speciﬁc chemical element of the

periodic table.

ilar quantities in the revised dataset, with exception of

the heavier actinides that are still somewhat underrepre-

sented.

We used DCGAT-2 to retrain a CGAT with the same

hyperparameters used in Ref. 16 (the resulting network

will be denoted as CGAT-2). The CGAT-2 network has

a mean absolute test error of 21 meV/atom for the dis-

tance to the convex hull using a training/validation/test

split of 80%/10%/10%. Although the test error is of the

same order of magnitude than CGAT-1, we will see that

the generalization error is drastically reduced. We also

trained a network to predict the volume per atom of the

crystals, obtaining a test error of 0.25 Å3/atom.

B. Enlarging the structural space

After having successfully removed the bias in our

dataset in the distribution of chemical elements, we now

tackle the lack of structural variety. Our strategy con-

sists in adding calculations of underrepresented struc-

tural types, keeping in mind that we are mainly inter-

ested in phases that are thermodynamically stable, or

close to stability. We start by querying our database us-

ing the pymatgen [22] structure matcher to identify all

distinct prototypes present in DCGAT-1. Note that our

A B

AC2

A2C

BC2

B2C

ABA2B AB2

FIG. 2. Ternary phase diagram showing the stoichiometries

covered by the prototypes studied in this work.

deﬁnition of a crystal structure prototype is relatively

loose and some of them can be transformed into others

through minor distortions. It is nevertheless important

to keep track of all these related structures in order to

increase the precision of the crystal-graph network pre-

dictions [16]. We found a total of ∼58000 prototypes,

the large majority of them appearing only once or twice

in the dataset. We then selected all prototypes with less

than 21 atoms in the unit cell, space-group number larger

than 9 and that appeared at least 10 times in our dataset.

The ﬁrst two criteria are chosen to limit the run-time of

the DFT calculations. Following these criteria, we end up

with 639 binary and 1829 ternary crystal-structure pro-

totypes, spanning a space of 1 050 101 724 possible com-

pounds. These prototypes also densely cover the compo-

sition space, as depicted in the generic phase diagram of

ﬁg. 2.

In ﬁg. 3 we plot the distributions of number of atoms

in the unit cell (panel a) and of crystal systems (panel b)

in the set of selected prototypes. The distribution of the

number of prototypes displays a maximum at 6 atoms

per unit cell, decreasing then slowly for larger number of

atoms. It is also clear that prototypes with an even num-

ber of atoms are far more common than those with odd

number of atoms. The most represented crystal system

is orthorhombic, followed by monoclinic and tetragonal,

while cubic prototypes are rare. Note that the number of

monoclinic structures is reduced by the imposed restric-

tion on the space group number, as monoclinic structures

have space groups between 3 and 15. Also due to this re-

striction, no triclinic structures are present in the dataset.

All these conclusions apply to both binary and ternary

prototypes.

We use our CGAT-2 network to predict the distance to

the convex hull for these prototypes, after grouping them

according to their general composition AxByCz. For ev-

ery composition, we occupy the lattice sites of each pro-

totype with all permutations of the A, B, and C chemical

elements, and let the machine predict the ones that are

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Large-scalemachine-learning-assistedexplorationofthewholematerialsspaceJonathanSchmidt,1NoahHomann,1Hai-ChenWang,1PedroBorlido,2PedroJ.M.A.Carriço,2TiagoF.T.Cerqueira,2SilvanaBotti,3,andMiguelA.L.Marques1,y1InstitutfürPhysik,Martin-Luther-UniversitätHalle-Wittenberg,D-06099Halle,Germany2CFisUC,Dep...

展开>> 收起<<

Large-scale machine-learning-assisted exploration of the whole materials space Jonathan Schmidt1Noah Hoﬀmann1Hai-Chen Wang1Pedro Borlido2Pedro J. M. A. Carriço2Tiago F. T. Cerqueira2Silvana Botti3and Miguel A. L. Marques1y.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Large-scale machine-learning-assisted exploration of the whole materials space Jonathan Schmidt1Noah Hoﬀmann1Hai-Chen Wang1Pedro Borlido2Pedro J. M. A. Carriço2Tiago F. T. Cerqueira2Silvana Botti3and Miguel A. L. Marques1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: