Large-scale machine-learning-assisted exploration of the whole materials space Jonathan Schmidt1Noah Hoffmann1Hai-Chen Wang1Pedro Borlido2Pedro J. M. A. Carriço2Tiago F. T. Cerqueira2Silvana Botti3and Miguel A. L. Marques1y

2025-05-03 0 0 8.94MB 13 页 10玖币
侵权投诉
Large-scale machine-learning-assisted exploration of the whole materials space
Jonathan Schmidt,1Noah Hoffmann,1Hai-Chen Wang,1Pedro Borlido,2Pedro J. M.
A. Carriço,2Tiago F. T. Cerqueira,2Silvana Botti,3, and Miguel A. L. Marques1,
1Institut für Physik, Martin-Luther-Universität Halle-Wittenberg, D-06099 Halle, Germany
2CFisUC, Department of Physics, University of Coimbra, Rua Larga, 3004-516 Coimbra, Portugal
3Institut für Festkörpertheorie und -optik, Friedrich-Schiller-Universität Jena, Max-Wien-Platz 1, 07743 Jena, Germany
(Dated: October 4, 2022)
Crystal-graph attention networks have emerged recently as remarkable tools for the prediction
of thermodynamic stability and materials properties from unrelaxed crystal structures. Previous
networks trained on two million materials exhibited, however, strong biases originating from under-
represented chemical elements and structural prototypes in the available data. We tackled this issue
computing additional data to provide better balance across both chemical and crystal-symmetry
space. Crystal-graph networks trained with this new data show unprecedented generalization ac-
curacy, and allow for reliable, accelerated exploration of the whole space of inorganic compounds.
We applied this universal network to perform machine-learning assisted high-throughput materi-
als searches including 2500 binary and ternary structure prototypes and spanning about 1 billion
compounds. After validation using density-functional theory, we uncover in total 19512 additional
materials on the convex hull of thermodynamic stability and 150 000 compounds with a distance of
less than 50 meV/atom from the hull. Combining again machine learning and ab-initio methods, we
finally evaluate the discovered materials for applications as superconductors, superhard materials,
and we look for candidates with large gap deformation potentials, finding several compounds with
extreme values of these properties.
I. INTRODUCTION
One of the most tantalizing possibilities of modern
computational materials science is the prediction and
characterization of experimentally unknown compounds.
In fact, developments in theory and algorithms in the
past decades allowed for the systematic exploration of
a chemical space spanning millions of materials, search-
ing for compounds with tailored properties for specific
technological applications. Currently, the most efficient
approach consists in scanning the composition space for
a given crystal structure prototype. In such approaches,
the key material property that is used to estimate if a
material can be experimentally synthesized is the total
energy, or more specifically the energy distance to the
convex hull of thermodynamic stability. Typically, for
each combination of chemical composition and crystal-
structure prototype one performs a geometry optimiza-
tion, e.g., using some flavor of density functional the-
ory (DFT), and compares the resulting DFT energy with
all possible decomposition channels. Compounds on the
convex hull (or close to it) are then selected for char-
acterization and, if they possess interesting physical or
chemical properties, proposed for experimental synthe-
sis.
For binary prototypes this approach is relatively
straightforward, and therefore the binary phase space
has been comprehensively explored [1]. For a ternary
prototype the different combination of chemical elements
generates roughly 500 000 compositions, a number still
silvana.botti@uni-jena.de
miguel.marques@physik.uni-halle.de
within reach of DFT calculations, at least for prototypes
with a high symmetry and relatively few atoms in the
unit cell [2]. However, there are thousands of known
ternary prototypes, making a brute force approach to the
problem unrealistic. Despite the resulting huge number
of candidate ternary compounds, it is worth observing
that the largest computational databases only contain
overall about 4×106materials [1, 3, 4].
Machine learning methods have made it possible to
accelerate material searches considerably. These meth-
ods are some of the most useful instruments added to
the toolbox of material science and solid-state physics
in the last decade. Thanks to them, a wide variety
of material properties can now be efficiently predicted
with close to ab-initio accuracy [5–7]. Early works in
this direction achieved speedups by factors of about
5-30 [2, 8, 9]. These works were generally based on
relatively simple machine learning models, e.g., deci-
sion trees or kernel ridge regression, that used hand-
built features of the composition as input and had to
be retrained for different crystal prototypes. Signifi-
cant progress toward more general models was made by
Ward et al. [9] who included structural descriptors ap-
plicable to high-throughput searches in terms of Voronoi
tesselations. This allowed Ward et al. to use training
data from all prototypes, resulting in improved perfor-
mance for high-throughput searches. Other important
steps forward were achieved by two other classes of mod-
els that were developed simultaneously: message pass-
ing networks for crystal and molecular graphs, as well as
deeper composition-based models [10–13]. We note that
compositional models can be completely independent of
the crystal structure. However, they are inadequate for
large-scale high-throughput searches, as they cannot dif-
arXiv:2210.00579v1 [cond-mat.mtrl-sci] 2 Oct 2022
2
ferentiate between polymorphs with the same chemical
composition. Message passing networks, on the other
hand, enabled unprecedented performance for the pre-
diction of properties with ab-initio accuracy [5, 14, 15]
from crystal structures.
Until recently all message passing networks for crys-
tals used atomic positions in some form as input. How-
ever, this information is not available until calcula-
tions, e.g. using DFT structure optimization, are per-
formed. Consequently, such models are unpractical for
high-throughput searches. Recently, Schmidt et al. [16]
and Goodall et al. [17] developed coarse-grained message
passing networks that circumvent the problem, as they
do not require atomic positions as an input. In this work
we apply the former approach to explore a significantly
enlarged space of crystalline compounds.
Currently, the largest issue concerning the accuracy of
message passing networks is no longer the topology of the
networks, nor its complexity, but is related to the limita-
tions of existing materials datasets. Reference 16 identi-
fied large biases stemming from the lack of structural and
chemical diversity in the available data. These biases, ul-
timately of anthropogenic nature [18, 19], lead unfortu-
nately to a poor generalization error. In fact, even if the
error in test sets is of the order of 20–30 meV/atom, the
actual error during high-throughput searches can be eas-
ily one order of magnitude larger if the available training
data is not representative of the actual material space 16.
In this work we tackle this challenging problem us-
ing a stepwise approach. First, we perform a series of
high-throughput searches with an extended set of chem-
ical elements (including lanthanides and some actinide
elements), applying the transfer learning approach pre-
sented in Ref. [16]. Thanks to the additional data gen-
erated by these calculations, we expect to reduce the
bias due to the representation of the chemical elements
in the dataset. In a subsequent step, we retrain the
crystal-graph network and employ it to scan a mate-
rial space of almost 1 billion compounds that comprises
more than 2000 crystal-structure prototypes. We obtain
in this way a dataset of DFT calculations with a con-
siderably larger structural diversity, that we then use to
retrain a network. This crystal graph network is then
shown to possess a massively improved generalization er-
ror and a strongly reduced chemical and structural bias.
Finally, we offer a demonstration of the usefulness of our
approach, and inspect this dataset to search for materials
with extreme values of some interesting physical proper-
ties.
II. CONSTRUCTION OF DATASETS AND
NETWORKS
A. Enlarging the chemical space
Our starting point is the dataset used by some of us
for training in Ref. 16. We will refer to this dataset as
“DCGAT-1” and to the crystal-graph network of Ref. 16
as “CGAT-1”, respectively.
As discussed previously, the training data in DCGAT-
1 is biased with respect to the distribution of chemi-
cal elements and crystal symmetries. To circumvent the
first problem we performed a series of high-throughput
calculations for specific structure prototypes. We used
a larger chemical space than previous works, consider-
ing 84 chemical elements, including all elements up to
Pu (with the exception of Po and At, for which we do
not have pseudopotentials, Yb whose pseudopotential ex-
hibits numerical problems, and rare gases). This results
in 6 972 possible permutations per binary, 571 704 permu-
tations per ternary, and 46 308 024 permutations per qua-
ternary system. For all these compositions we considered
a (largely arbitrary) selection of prototypes, including
ternary garnets, Ruddlesden–Popper layered perovskites,
cubic Laves phases, ternary and quaternary Heuslers, au-
ricuprides, etc. In total we included 11 binary, 8 ternaries
and 1 quaternary prototypes (a complete list and more
details can be found in the Supporting Information).
For each structural prototype included in the selection,
we performed a high-throughput study using the transfer
learning approach proposed in Ref. 16: (i) The machine-
learning model is used to predict the distance to the con-
vex hull of stability for possible chemical compositions.
At the start we use the pre-trained CGAT-1 machine;
(ii) We perform DFT geometry optimizations to validate
all compounds predicted below 200 meV/atom from the
hull; (iii) We add these calculations to a dataset con-
taining all DFT calculations for the corresponding struc-
tural prototype. (iv) We use transfer learning to train
a new model on the basis of this dataset with a train-
ing/validation/testing split of 80%/10%/10%. (v) The
cycle is restarted one to three times, until the MAE of
the model is below 30 meV/atom.
This procedure resulted in 397 438 additional DFT cal-
culations, yielding 4382 compounds below the hull of
DCGAT-1 (and therefore already increasing the known
convex hull by approximately ten percent). Moreover, we
added a large dataset of mixed perovskites [16] plus data
concerning oxynitride, oxyfluoride and nitrofluoride per-
ovskites from Ref. 20, amounting to around 381 000 DFT
calculations. Finally, we recalculated and added 1343
compounds that were possibly unconverged outliers from
AFLOW[1] according to the criteria in Ref. 21. The final
dataset resulting from all these changes and additions,
contains 780 000 compounds more than DCGAT-1 and
will be denoted as DCGAT-2.
In fig. 1 we plot the element distribution in both
datasets DCGAT-1 and DCGAT-2. As expected, the
original dataset is quite biased with a drastic under-
sampling of most lanthanides and actinides. Despite its
smaller size, the new dataset includes between three and
twenty times more compounds containing undersampled
elements, and it therefore counteracts the unbalanced
distribution of chemical elements of DCGAT-1. Note
that, in particular, metallic elements appear in very sim-
3
(a)
(b)
FIG. 1. Number of materials in (a) DCGAT-1 and
(b) DCGAT-2 containing one specific chemical element of the
periodic table.
ilar quantities in the revised dataset, with exception of
the heavier actinides that are still somewhat underrepre-
sented.
We used DCGAT-2 to retrain a CGAT with the same
hyperparameters used in Ref. 16 (the resulting network
will be denoted as CGAT-2). The CGAT-2 network has
a mean absolute test error of 21 meV/atom for the dis-
tance to the convex hull using a training/validation/test
split of 80%/10%/10%. Although the test error is of the
same order of magnitude than CGAT-1, we will see that
the generalization error is drastically reduced. We also
trained a network to predict the volume per atom of the
crystals, obtaining a test error of 0.25 Å3/atom.
B. Enlarging the structural space
After having successfully removed the bias in our
dataset in the distribution of chemical elements, we now
tackle the lack of structural variety. Our strategy con-
sists in adding calculations of underrepresented struc-
tural types, keeping in mind that we are mainly inter-
ested in phases that are thermodynamically stable, or
close to stability. We start by querying our database us-
ing the pymatgen [22] structure matcher to identify all
distinct prototypes present in DCGAT-1. Note that our
A B
C
AC
AC2
A2C
BC
BC2
B2C
ABA2B AB2
FIG. 2. Ternary phase diagram showing the stoichiometries
covered by the prototypes studied in this work.
definition of a crystal structure prototype is relatively
loose and some of them can be transformed into others
through minor distortions. It is nevertheless important
to keep track of all these related structures in order to
increase the precision of the crystal-graph network pre-
dictions [16]. We found a total of 58000 prototypes,
the large majority of them appearing only once or twice
in the dataset. We then selected all prototypes with less
than 21 atoms in the unit cell, space-group number larger
than 9 and that appeared at least 10 times in our dataset.
The first two criteria are chosen to limit the run-time of
the DFT calculations. Following these criteria, we end up
with 639 binary and 1829 ternary crystal-structure pro-
totypes, spanning a space of 1 050 101 724 possible com-
pounds. These prototypes also densely cover the compo-
sition space, as depicted in the generic phase diagram of
fig. 2.
In fig. 3 we plot the distributions of number of atoms
in the unit cell (panel a) and of crystal systems (panel b)
in the set of selected prototypes. The distribution of the
number of prototypes displays a maximum at 6 atoms
per unit cell, decreasing then slowly for larger number of
atoms. It is also clear that prototypes with an even num-
ber of atoms are far more common than those with odd
number of atoms. The most represented crystal system
is orthorhombic, followed by monoclinic and tetragonal,
while cubic prototypes are rare. Note that the number of
monoclinic structures is reduced by the imposed restric-
tion on the space group number, as monoclinic structures
have space groups between 3 and 15. Also due to this re-
striction, no triclinic structures are present in the dataset.
All these conclusions apply to both binary and ternary
prototypes.
We use our CGAT-2 network to predict the distance to
the convex hull for these prototypes, after grouping them
according to their general composition AxByCz. For ev-
ery composition, we occupy the lattice sites of each pro-
totype with all permutations of the A, B, and C chemical
elements, and let the machine predict the ones that are
摘要:

Large-scalemachine-learning-assistedexplorationofthewholematerialsspaceJonathanSchmidt,1NoahHomann,1Hai-ChenWang,1PedroBorlido,2PedroJ.M.A.Carriço,2TiagoF.T.Cerqueira,2SilvanaBotti,3,andMiguelA.L.Marques1,y1InstitutfürPhysik,Martin-Luther-UniversitätHalle-Wittenberg,D-06099Halle,Germany2CFisUC,Dep...

展开>> 收起<<
Large-scale machine-learning-assisted exploration of the whole materials space Jonathan Schmidt1Noah Hoffmann1Hai-Chen Wang1Pedro Borlido2Pedro J. M. A. Carriço2Tiago F. T. Cerqueira2Silvana Botti3and Miguel A. L. Marques1y.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:8.94MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注