2
ferentiate between polymorphs with the same chemical
composition. Message passing networks, on the other
hand, enabled unprecedented performance for the pre-
diction of properties with ab-initio accuracy [5, 14, 15]
from crystal structures.
Until recently all message passing networks for crys-
tals used atomic positions in some form as input. How-
ever, this information is not available until calcula-
tions, e.g. using DFT structure optimization, are per-
formed. Consequently, such models are unpractical for
high-throughput searches. Recently, Schmidt et al. [16]
and Goodall et al. [17] developed coarse-grained message
passing networks that circumvent the problem, as they
do not require atomic positions as an input. In this work
we apply the former approach to explore a significantly
enlarged space of crystalline compounds.
Currently, the largest issue concerning the accuracy of
message passing networks is no longer the topology of the
networks, nor its complexity, but is related to the limita-
tions of existing materials datasets. Reference 16 identi-
fied large biases stemming from the lack of structural and
chemical diversity in the available data. These biases, ul-
timately of anthropogenic nature [18, 19], lead unfortu-
nately to a poor generalization error. In fact, even if the
error in test sets is of the order of 20–30 meV/atom, the
actual error during high-throughput searches can be eas-
ily one order of magnitude larger if the available training
data is not representative of the actual material space 16.
In this work we tackle this challenging problem us-
ing a stepwise approach. First, we perform a series of
high-throughput searches with an extended set of chem-
ical elements (including lanthanides and some actinide
elements), applying the transfer learning approach pre-
sented in Ref. [16]. Thanks to the additional data gen-
erated by these calculations, we expect to reduce the
bias due to the representation of the chemical elements
in the dataset. In a subsequent step, we retrain the
crystal-graph network and employ it to scan a mate-
rial space of almost 1 billion compounds that comprises
more than 2000 crystal-structure prototypes. We obtain
in this way a dataset of DFT calculations with a con-
siderably larger structural diversity, that we then use to
retrain a network. This crystal graph network is then
shown to possess a massively improved generalization er-
ror and a strongly reduced chemical and structural bias.
Finally, we offer a demonstration of the usefulness of our
approach, and inspect this dataset to search for materials
with extreme values of some interesting physical proper-
ties.
II. CONSTRUCTION OF DATASETS AND
NETWORKS
A. Enlarging the chemical space
Our starting point is the dataset used by some of us
for training in Ref. 16. We will refer to this dataset as
“DCGAT-1” and to the crystal-graph network of Ref. 16
as “CGAT-1”, respectively.
As discussed previously, the training data in DCGAT-
1 is biased with respect to the distribution of chemi-
cal elements and crystal symmetries. To circumvent the
first problem we performed a series of high-throughput
calculations for specific structure prototypes. We used
a larger chemical space than previous works, consider-
ing 84 chemical elements, including all elements up to
Pu (with the exception of Po and At, for which we do
not have pseudopotentials, Yb whose pseudopotential ex-
hibits numerical problems, and rare gases). This results
in 6 972 possible permutations per binary, 571 704 permu-
tations per ternary, and 46 308 024 permutations per qua-
ternary system. For all these compositions we considered
a (largely arbitrary) selection of prototypes, including
ternary garnets, Ruddlesden–Popper layered perovskites,
cubic Laves phases, ternary and quaternary Heuslers, au-
ricuprides, etc. In total we included 11 binary, 8 ternaries
and 1 quaternary prototypes (a complete list and more
details can be found in the Supporting Information).
For each structural prototype included in the selection,
we performed a high-throughput study using the transfer
learning approach proposed in Ref. 16: (i) The machine-
learning model is used to predict the distance to the con-
vex hull of stability for possible chemical compositions.
At the start we use the pre-trained CGAT-1 machine;
(ii) We perform DFT geometry optimizations to validate
all compounds predicted below 200 meV/atom from the
hull; (iii) We add these calculations to a dataset con-
taining all DFT calculations for the corresponding struc-
tural prototype. (iv) We use transfer learning to train
a new model on the basis of this dataset with a train-
ing/validation/testing split of 80%/10%/10%. (v) The
cycle is restarted one to three times, until the MAE of
the model is below 30 meV/atom.
This procedure resulted in 397 438 additional DFT cal-
culations, yielding 4382 compounds below the hull of
DCGAT-1 (and therefore already increasing the known
convex hull by approximately ten percent). Moreover, we
added a large dataset of mixed perovskites [16] plus data
concerning oxynitride, oxyfluoride and nitrofluoride per-
ovskites from Ref. 20, amounting to around 381 000 DFT
calculations. Finally, we recalculated and added 1343
compounds that were possibly unconverged outliers from
AFLOW[1] according to the criteria in Ref. 21. The final
dataset resulting from all these changes and additions,
contains ∼780 000 compounds more than DCGAT-1 and
will be denoted as DCGAT-2.
In fig. 1 we plot the element distribution in both
datasets DCGAT-1 and DCGAT-2. As expected, the
original dataset is quite biased with a drastic under-
sampling of most lanthanides and actinides. Despite its
smaller size, the new dataset includes between three and
twenty times more compounds containing undersampled
elements, and it therefore counteracts the unbalanced
distribution of chemical elements of DCGAT-1. Note
that, in particular, metallic elements appear in very sim-