Towards DMC accuracy across chemical space with scalable -QML Bing Huang1aO. Anatole von Lilienfeld2 3 4 bJaron T. Krogel5cand Anouar Benali6d 1University of Vienna Faculty of Physics Kolingasse 14-16 1090 Vienna Austria

2025-04-26 0 0 2.71MB 13 页 10玖币
侵权投诉
Towards DMC accuracy across chemical space with scalable -QML
Bing Huang,1, a) O. Anatole von Lilienfeld,2, 3, 4, b) Jaron T. Krogel,5, c) and Anouar Benali6, d)
1)University of Vienna, Faculty of Physics, Kolingasse 14-16, 1090 Vienna, Austria
2)Departments of Chemistry, Materials Science and Engineering, and Physics, University of Toronto,
St. George Campus, Toronto, ON, Canada
3)Vector Institute for Artificial Intelligence, Toronto, ON, M5S 1M1, Canada
4)Machine Learning Group, Technische Universit¨at Berlin and Institute for the Foundations of Learning and Data,
10587 Berlin, Germany
5)Materials Science and Technology Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831,
United States
6)Computational Sciences Division, Argonne National Laboratory, Argonne, IL 60439,
United States
In the past decade, quantum diffusion Monte Carlo (DMC) has been demonstrated to successfully predict
the energetics and properties of a wide range of molecules and solids by numerically solving the electronic
many-body Schr¨odinger equation. We show that when coupled with quantum machine learning (QML) based
surrogate methods the computational burden can be alleviated such that QMC shows clear potential to under-
gird the formation of high quality descriptions across chemical space. We discuss three crucial approximations
necessary to accomplish this: The fixed node approximation, universal and accurate references for chemical
bond dissociation energies, and scalable minimal amons set based QML (AQML) models. Numerical evidence
presented includes converged DMC results for over one thousand small organic molecules with up to 5 heavy
atoms used as amons, and 50 medium sized organic molecules with 9 heavy atoms to validate the AQML
predictions. Numerical evidence collected for ∆-AQML models suggests that already modestly sized QMC
training data sets of amons suffice to predict total energies with near chemical accuracy throughout chemical
space.
I. INTRODUCTION
The predictive accuracy of quantum machine learn-
ing (QML) models trained on quantum chemistry data
and used for the navigation of chemical compound space
(CCS) is inherently limited by the predictive accuracy
of the approximations used within the underlying quan-
tum theory1. Consequently, in order for QML models to
achieve the coveted threshold of chemical accuracy (1
kcal/mol average deviation of calculated from experimen-
tal measurements of atomization energies), it is necessary
to rely on training data generated at least at the post-
Hartree-Fock level, e.g. CCSD(T)/CBS. Unfortunately,
the ‘gold-standard’ in the field, CCSD(T)/CBS, gener-
ally imposes considerable computational cost due to steep
prefactors and scaling O(N7) (Ncorresponding to sys-
tem size)2. As such, the routine generation of large high
quality quantum data sets has remained elusive, even for
relatively small organic molecules with only four or five
‘heavy’ (second-row) atoms. Here, we demonstrate for
an exemplary sub-set of CCS (namely organic molecules)
the usefulness of recently implemented and numerically
more efficient Quantum Monte Carlo (QMC) methods for
computing QML training data. Our numerical evidence
indicates the possibility to routinely train QML mod-
a)Electronic mail: bing.huang@univie.ac.at
b)Electronic mail: anatole.vonlilienfeld@utoronto.ca
c)Electronic mail: krogeljt@ornl.gov
d)Electronic mail: benali@anl.gov
els that achieve predictive power similar to QMC but at
much reduced computational cost.
QMC approaches solve the many-body electronic
Schr¨odinger equation stochastically. QMC is general
and applicable to a wide range of physical and chem-
ical systems in any dimension or boundary condition
etc. Amongst the most widely used flavors for electronic
structure are the variational Monte Carlo (VMC)3,4 and
diffusion Monte Carlo (DMC).5. Both VMC and DMC
are variational methods and allow to estimate the en-
ergy and properties of a given trial wavefunction with-
out requiring to compute the matrix elements, posing no
restriction on its functional form. Using the VMC algo-
rithm, through stochastic numerical integration scheme,
the expectation value of the energy for any form of the
trial wavefunction can be estimated by averaging the lo-
cal energy over an ensemble of configurations distributed
as ψ2, sampled during a random walk in the configu-
ration space using Metropolis6or Langevin algorithms7.
The fluctuations of the local energy depend on the qual-
ity of the trial wavefunction, and they are zero if the ex-
act wavefunction is used (zero-variance principle). DMC
algorithm is very similar but the sampling goes beyond
the ψ2distribution function by solving the Schrodinger
equation in an imaginary time τ=it using a projector
or a Green’s function based method. Any initial state
|ψi, that is not orthogonal to the ground state |φ0i, will
evolve to the ground state in the long time limit and any
excited state will decay exponentially fast leading to the
true ground state of the function.
arXiv:2210.06430v1 [physics.chem-ph] 12 Oct 2022
2
lim
τ→∞ Ψ(R, τ) = c0e0τφ0(R) (1)
Where |ψiwas expanded in eigenstates iof the Hamil-
tonian as
|ψi=
X
i=0
ci|φ0i,ˆ
H|φii=i|φii(2)
As mentioned, the sampling in DMC is not constrained
to a specific distribution, takes into account all electronic
correlations and therefore makes DMC a rigorously exact
method. While this is true when solving for bosonic
particles, solving for fermionic particles requires some
approximations to remain computationally feasible and
maintain the anti-symmetric nature of the wavefunction.
Some of these approximations are controlled and can be
rigorously extrapolated out (such as time steps, use of
electron-core potentials or pseudopotential, etc...). The
only uncontrolled source of error8is the fixed-node (FN)
approximation introduced to suppress the fluctuations
of the sign of the wavefunction (fermion sign problem).
This approximation means that any proposed configura-
tion changing the sign of the wavefunction is rejected,
while any configuration lowering the local energy would
be promoted. DMC being variational, if the positions
of the nodes of the trial wavefunction are exact, the
averaged local energy is rigorously the exact ground
state energy. FN-DMC energies are an upper bound to
the exact ground state energy9. This implies that from
a FN-DMC perspective, trial wavefunctions differ only
by their nodal surface, and the best nodal surface leads
to a lower energy and variance.
Despite the FN approximation, DMC was shown to
reach successfully accuracy below the chemical accuracy
threshold of 1 kcal/mol for chemical systems10–14 and
a few tens of meV/unit-cells for solids within periodic
boundary conditions15–18. Recently, multiple calcula-
tions using a selected Configuration Interaction (sCI)
trial wavefunction have demonstrated how to system-
atically reduce the error from the fixed nodes19–21.
Nevertheless, these errors have proven to be significant
only for open shell or multi-reference molecules.12,22
From above discussion, we see that on the one hand,
stochastic numerical sampling permits independent eval-
uationsy, making the method embarrassingly parallel and
highly efficient for high performance computing (HPC).
On the other hand, accuracy is a direct consequence of
the quality of the fixed nodes in the trial wavefunction: If
the nodes are exact, the method is rigorously exact. Us-
ing DMC energies as reference for QML models will then
boost efficiency by several orders of magnitude, since the
property of any new out-of-sample query compound can
be predicted after training, solely based on inference from
the DMC information stored in the training data. In or-
der to retain the predictive accuracy of reference data,
however, a significant amount of training data can be
necessary. This issue is essentially caused by the use of
random selection of training instances which should be
representative of query compounds. To rise to this chal-
lenge, some of us (BH, OAvL) recently introduced the
amon (A) based QML method which enables a dramatic
reduction in training set size as well as size of training
molecules23.
Amons correspond to systematically fragmented enti-
ties of query target molecules, containing an increasing
number of heavy atoms (typically no more than 7). They
can be seen as effective building blocks of target com-
pounds with atomic states being perturbed according to
their chemical environment. With amons used as train-
ing set, QML models trained on the fly (AQML) repre-
sent a scalable approach which can be applied through-
out chemical space to predict quantum properties of large
molecules.
For this study, we combine DMC reference calculations
with AQML and ∆-AQML models, and we numerically
demonstrate the feasibility of these approaches to achieve
chemical accuracy, ∆-AQML in particular, by making use
of a dictionary of 1175 small amons of QM924 with up to
only 5 heavy atoms (not counting hydrogens), together
with reference energies calculated at much cheaper lev-
els of theory, including mean-field theory (Hartree-Fock
(HF) and density functional theory (DFT) level using
various levels of approximations according to Jacob’s lad-
der) and Møller–Plesset perturbation theory (MP) to sec-
ond (MP2).
II. DATA-SETS
1175 unique amon graphs (i.e., molecular graphs) with
up to 5 atoms were firstly identified by application of
the amon-selection algorithm23 (also briefly summarized
below in the methodology section IV A) to QM924,25
molecules, with SMILES strings as the only input. Then
for each amon one hundred conformers were sampled us-
ing RDKit26, optimized by MMFF94 force field and only
the global minimum configuration (i.e., lowest force field
energy conformer) was chosen. The geometries of the
thus-selected amon conformers were further optimized
at the level of theory B3LYP/Def2TZVP. Based on these
geometries, single point energies were calculated at mul-
tiple levels of theory, including HF, PBE, PBE0, B3LYP,
MP2 (with basis cc-pVTZ) and DMC (see the next sec-
tion for details). The resulting dataset can be seen as
a compact dictionary of small molecules, which is to be
looked up later in AQML for any query molecule of larger
size.
For test purpose of QML models, 50 molecules all made
up of 9 heavy atoms are randomly drawn from the QM9
dataset. Geometries were optimized at the same level
of theory as for amons, followed by single point energy
calculations by all levels of theory mentioned above, in-
cluding DMC. For a depiction of all test molecules, see
Fig. 5.
3
III. COMPUTATIONAL DETAILS
We used B3LYP/Def2TZVP for geometry optimiza-
tion as implemented in the Gaussian 0927 code. For
HF, DFT and post-HF (MP2) single point energies,
we switched to the cc-pVTZ basis, and used instead
Molpro201828 with cc-pVQZ-jkfit density-fitting basis to
speed-up computations of both Coulomb and exchange
integrals.
For DMC calculations, we used a trial wavefunction with
a Slater Jastrow form29
ΨT(~
R) = exp "X
i
Ji(~
R)#M
X
k
CkD
k(ϕ)D
k(ϕ) (3)
Where D
k(ϕ) is a slater determinant expressed in terms
of single particle orbitals (SPO) ϕi=PNb
lCi
lΦl. We
use DMC to evaluate the total and formation energies
of all molecules in data-sets, as implemented in the
QMCPACK code30,31. Our trial many-body wave func-
tions are constructed with the product of the Jastrow
functions and a single Slater determinant from Hartree
Fock, PBE32 PBE032,33 and B3LYP34–37 Kohn-Sham
(KS) orbitals obtained from the PYSCF38 package.
Using a variant of the linear method of Umrigar and
co-workers39, up to 40 variational parameters including
one-body, two-body and three-body Jastrow factors
are optimized within VMC, after which we can obtain
the ground-state energies with diffusion Monte Carlo
(DMC) under the fixed node approximation5. For all
molecules, we used nodal surfaces coming from HF and
aforementioned DFT functionals as a way of assessing
the quality of the trial wavefunction. Jastrow param-
eters were optmized independently for each molecule
and each trial wavefunction. All calculations used a
0.001 time-step holding an error within error bars of
the extrapolated 0 time-step. This was verified by
randomly selecting 10 molecules of different size and
running the time-step extrapolation. With such small
time-steps, we increased the size of decorrelation time to
avoid auto-correlation. Each molecule used 4096 walkers
and 2000 blocks to insure convergence. Using the re-
sources of the ALCF-Theta supercomputer (Cray XC40,
with Intel Xeon Phi KNL processors), each molecules
was run on 32 nodes and 128 threads (2 hyperthreads
) for 1 hour of compute time or a total of 9.6M core-hours.
For all QML models, we rely on a local representa-
tion called atomic Spectrum of London and Axilrod-
Teller-Muto potential (aSLATM)23, with a weight of 1
and 1/3 for the 2-body London potential and 3-body
Axilrod-Teller-Muto potential respectively (Note that 1-
body terms are not necessary), to describe atoms in
molecules (i.e., atomic environments). Default 1D grids
were used, as was implemented in the original aqml code
(available at https://github.com/binghuang2018/aqml),
with grid spacing 0.05 ˚
A (rad) for the 2-body (3-body)
potential, ranging from 0.2 to a maximal atomic cutoff
of 4.8 ˚
A for the 2-body part and 0 to πfor the 3-body
part, respectively. A smearing width of 0.05 ˚
A (rad)
was used for the normalized 1D Gaussian distribution
centered on each bond distance (bond angle) within the
atomic cutoff. L2norm was used to compute the dis-
tances between two atomic environments and Gaussian
kernel was used to measure their similarity, with atom
type (characterized by nuclear charge) dependent kernel
width set to the maximal value of aSLATM distance be-
tween all pairs of atomic environments (of the same kind)
divided by 2 ln 2. A universal regularization parameter
of 104was used to reduce the complexity of all QML
models.
IV. METHODOLOGY
A. From amons to -AQML
To provide sufficient context, we now briefly summa-
rize the key ideas underlying amons and their use within
∆-AQML models. The interested reader is referred to
the original papers23,40 for further details.
The amons approach attempts to mitigate the curse of
dimensionality in CCS through selection of the smallest
possible, yet “optimal”, training set on-the-fly after hav-
ing been provided a given specific query test molecule (or
a set). The amons selection procedure23 can be roughly
divided into three major steps: a) Perceive the connec-
tivity graph Gof any query based on its 3D geometry.
b) Collect all isomorphic subgraphs {Gi}of Gwith no
more than NIheavy atoms (NIis set to 5 in this study)
based on an efficient tree enumeration algorithm. Sub-
graph isomorphism help retain all hybridization states
of atoms during fragmentation (hydrogens are added to
heavy atoms when necessary). c) Perform geometry re-
laxation for each fragment with some force field and sub-
sequently quantum chemical approach, with dihedral an-
gles constrained to match that in the query molecule so
as to avoid too much change in conformational degrees of
freedom. The resulting fragments, if unique and survived
(i.e., no dissociation or graph change after geometry re-
laxation), are then selected for the amon database.
In case of neglect of amon conformers, i.e., precise ge-
ometry information of the query is disregarded by provid-
ing only the molecular graph of the query (which allows
further reduction in training set size) the last step (step
c) is replaced by global minima search based on force
field methods, followed by geometry relaxation at some
quantum chemical level of theory. Given the complete
set of amons, any query molecule, even if even substan-
tially larger than amons in molecular size, can then be
predicted to high accuracy by QML models, when used
in conjunction with atomic representations. Note that
the overall number of amons for any given query, can be
rather small, in particularly for highly regular systems
exhibiting repeating patterns or periodicity, e.g. poly-
mers, peptides, or crystals.
摘要:

TowardsDMCaccuracyacrosschemicalspacewithscalable-QMLBingHuang,1,a)O.AnatolevonLilienfeld,2,3,4,b)JaronT.Krogel,5,c)andAnouarBenali6,d)1)UniversityofVienna,FacultyofPhysics,Kolingasse14-16,1090Vienna,Austria2)DepartmentsofChemistry,MaterialsScienceandEngineering,andPhysics,UniversityofToronto,St.Ge...

展开>> 收起<<
Towards DMC accuracy across chemical space with scalable -QML Bing Huang1aO. Anatole von Lilienfeld2 3 4 bJaron T. Krogel5cand Anouar Benali6d 1University of Vienna Faculty of Physics Kolingasse 14-16 1090 Vienna Austria.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:2.71MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注