2
lim
τ→∞ Ψ(R, τ) = c0e−0τφ0(R) (1)
Where |ψiwas expanded in eigenstates iof the Hamil-
tonian as
|ψi=
∞
X
i=0
ci|φ0i,ˆ
H|φii=i|φii(2)
As mentioned, the sampling in DMC is not constrained
to a specific distribution, takes into account all electronic
correlations and therefore makes DMC a rigorously exact
method. While this is true when solving for bosonic
particles, solving for fermionic particles requires some
approximations to remain computationally feasible and
maintain the anti-symmetric nature of the wavefunction.
Some of these approximations are controlled and can be
rigorously extrapolated out (such as time steps, use of
electron-core potentials or pseudopotential, etc...). The
only uncontrolled source of error8is the fixed-node (FN)
approximation introduced to suppress the fluctuations
of the sign of the wavefunction (fermion sign problem).
This approximation means that any proposed configura-
tion changing the sign of the wavefunction is rejected,
while any configuration lowering the local energy would
be promoted. DMC being variational, if the positions
of the nodes of the trial wavefunction are exact, the
averaged local energy is rigorously the exact ground
state energy. FN-DMC energies are an upper bound to
the exact ground state energy9. This implies that from
a FN-DMC perspective, trial wavefunctions differ only
by their nodal surface, and the best nodal surface leads
to a lower energy and variance.
Despite the FN approximation, DMC was shown to
reach successfully accuracy below the chemical accuracy
threshold of 1 kcal/mol for chemical systems10–14 and
a few tens of meV/unit-cells for solids within periodic
boundary conditions15–18. Recently, multiple calcula-
tions using a selected Configuration Interaction (sCI)
trial wavefunction have demonstrated how to system-
atically reduce the error from the fixed nodes19–21.
Nevertheless, these errors have proven to be significant
only for open shell or multi-reference molecules.12,22
From above discussion, we see that on the one hand,
stochastic numerical sampling permits independent eval-
uationsy, making the method embarrassingly parallel and
highly efficient for high performance computing (HPC).
On the other hand, accuracy is a direct consequence of
the quality of the fixed nodes in the trial wavefunction: If
the nodes are exact, the method is rigorously exact. Us-
ing DMC energies as reference for QML models will then
boost efficiency by several orders of magnitude, since the
property of any new out-of-sample query compound can
be predicted after training, solely based on inference from
the DMC information stored in the training data. In or-
der to retain the predictive accuracy of reference data,
however, a significant amount of training data can be
necessary. This issue is essentially caused by the use of
random selection of training instances which should be
representative of query compounds. To rise to this chal-
lenge, some of us (BH, OAvL) recently introduced the
amon (A) based QML method which enables a dramatic
reduction in training set size as well as size of training
molecules23.
Amons correspond to systematically fragmented enti-
ties of query target molecules, containing an increasing
number of heavy atoms (typically no more than 7). They
can be seen as effective building blocks of target com-
pounds with atomic states being perturbed according to
their chemical environment. With amons used as train-
ing set, QML models trained on the fly (AQML) repre-
sent a scalable approach which can be applied through-
out chemical space to predict quantum properties of large
molecules.
For this study, we combine DMC reference calculations
with AQML and ∆-AQML models, and we numerically
demonstrate the feasibility of these approaches to achieve
chemical accuracy, ∆-AQML in particular, by making use
of a dictionary of 1175 small amons of QM924 with up to
only 5 heavy atoms (not counting hydrogens), together
with reference energies calculated at much cheaper lev-
els of theory, including mean-field theory (Hartree-Fock
(HF) and density functional theory (DFT) level using
various levels of approximations according to Jacob’s lad-
der) and Møller–Plesset perturbation theory (MP) to sec-
ond (MP2).
II. DATA-SETS
1175 unique amon graphs (i.e., molecular graphs) with
up to 5 atoms were firstly identified by application of
the amon-selection algorithm23 (also briefly summarized
below in the methodology section IV A) to QM924,25
molecules, with SMILES strings as the only input. Then
for each amon one hundred conformers were sampled us-
ing RDKit26, optimized by MMFF94 force field and only
the global minimum configuration (i.e., lowest force field
energy conformer) was chosen. The geometries of the
thus-selected amon conformers were further optimized
at the level of theory B3LYP/Def2TZVP. Based on these
geometries, single point energies were calculated at mul-
tiple levels of theory, including HF, PBE, PBE0, B3LYP,
MP2 (with basis cc-pVTZ) and DMC (see the next sec-
tion for details). The resulting dataset can be seen as
a compact dictionary of small molecules, which is to be
looked up later in AQML for any query molecule of larger
size.
For test purpose of QML models, 50 molecules all made
up of 9 heavy atoms are randomly drawn from the QM9
dataset. Geometries were optimized at the same level
of theory as for amons, followed by single point energy
calculations by all levels of theory mentioned above, in-
cluding DMC. For a depiction of all test molecules, see
Fig. 5.