Towards DMC accuracy across chemical space with scalable -QML Bing Huang1aO. Anatole von Lilienfeld2 3 4 bJaron T. Krogel5cand Anouar Benali6d 1University of Vienna Faculty of Physics Kolingasse 14-16 1090 Vienna Austria

2025-04-26 0 0 2.71MB 13 页 10玖币

侵权投诉

Towards DMC accuracy across chemical space with scalable ∆-QML

Bing Huang,1, a) O. Anatole von Lilienfeld,2, 3, 4, b) Jaron T. Krogel,5, c) and Anouar Benali6, d)

1)University of Vienna, Faculty of Physics, Kolingasse 14-16, 1090 Vienna, Austria

2)Departments of Chemistry, Materials Science and Engineering, and Physics, University of Toronto,

St. George Campus, Toronto, ON, Canada

3)Vector Institute for Artiﬁcial Intelligence, Toronto, ON, M5S 1M1, Canada

4)Machine Learning Group, Technische Universit¨at Berlin and Institute for the Foundations of Learning and Data,

10587 Berlin, Germany

5)Materials Science and Technology Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831,

United States

6)Computational Sciences Division, Argonne National Laboratory, Argonne, IL 60439,

United States

In the past decade, quantum diﬀusion Monte Carlo (DMC) has been demonstrated to successfully predict

the energetics and properties of a wide range of molecules and solids by numerically solving the electronic

many-body Schr¨odinger equation. We show that when coupled with quantum machine learning (QML) based

surrogate methods the computational burden can be alleviated such that QMC shows clear potential to under-

gird the formation of high quality descriptions across chemical space. We discuss three crucial approximations

necessary to accomplish this: The ﬁxed node approximation, universal and accurate references for chemical

bond dissociation energies, and scalable minimal amons set based QML (AQML) models. Numerical evidence

presented includes converged DMC results for over one thousand small organic molecules with up to 5 heavy

atoms used as amons, and 50 medium sized organic molecules with 9 heavy atoms to validate the AQML

predictions. Numerical evidence collected for ∆-AQML models suggests that already modestly sized QMC

training data sets of amons suﬃce to predict total energies with near chemical accuracy throughout chemical

space.

I. INTRODUCTION

The predictive accuracy of quantum machine learn-

ing (QML) models trained on quantum chemistry data

and used for the navigation of chemical compound space

(CCS) is inherently limited by the predictive accuracy

of the approximations used within the underlying quan-

tum theory1. Consequently, in order for QML models to

achieve the coveted threshold of chemical accuracy (∼1

kcal/mol average deviation of calculated from experimen-

tal measurements of atomization energies), it is necessary

to rely on training data generated at least at the post-

Hartree-Fock level, e.g. CCSD(T)/CBS. Unfortunately,

the ‘gold-standard’ in the ﬁeld, CCSD(T)/CBS, gener-

ally imposes considerable computational cost due to steep

prefactors and scaling ∝O(N7) (Ncorresponding to sys-

tem size)2. As such, the routine generation of large high

quality quantum data sets has remained elusive, even for

relatively small organic molecules with only four or ﬁve

‘heavy’ (second-row) atoms. Here, we demonstrate for

an exemplary sub-set of CCS (namely organic molecules)

the usefulness of recently implemented and numerically

more eﬃcient Quantum Monte Carlo (QMC) methods for

computing QML training data. Our numerical evidence

indicates the possibility to routinely train QML mod-

a)Electronic mail: bing.huang@univie.ac.at

b)Electronic mail: anatole.vonlilienfeld@utoronto.ca

c)Electronic mail: krogeljt@ornl.gov

d)Electronic mail: benali@anl.gov

els that achieve predictive power similar to QMC but at

much reduced computational cost.

QMC approaches solve the many-body electronic

Schr¨odinger equation stochastically. QMC is general

and applicable to a wide range of physical and chem-

ical systems in any dimension or boundary condition

etc. Amongst the most widely used ﬂavors for electronic

structure are the variational Monte Carlo (VMC)3,4 and

diﬀusion Monte Carlo (DMC).5. Both VMC and DMC

are variational methods and allow to estimate the en-

ergy and properties of a given trial wavefunction with-

out requiring to compute the matrix elements, posing no

restriction on its functional form. Using the VMC algo-

rithm, through stochastic numerical integration scheme,

the expectation value of the energy for any form of the

trial wavefunction can be estimated by averaging the lo-

cal energy over an ensemble of conﬁgurations distributed

as ψ2, sampled during a random walk in the conﬁgu-

ration space using Metropolis6or Langevin algorithms7.

The ﬂuctuations of the local energy depend on the qual-

ity of the trial wavefunction, and they are zero if the ex-

act wavefunction is used (zero-variance principle). DMC

algorithm is very similar but the sampling goes beyond

the ψ2distribution function by solving the Schrodinger

equation in an imaginary time τ=it using a projector

or a Green’s function based method. Any initial state

|ψi, that is not orthogonal to the ground state |φ0i, will

evolve to the ground state in the long time limit and any

excited state will decay exponentially fast leading to the

true ground state of the function.

arXiv:2210.06430v1 [physics.chem-ph] 12 Oct 2022

lim

τ→∞ Ψ(R, τ) = c0e−0τφ0(R) (1)

Where |ψiwas expanded in eigenstates iof the Hamil-

tonian as

|ψi=

∞

i=0

ci|φ0i,ˆ

H|φii=i|φii(2)

As mentioned, the sampling in DMC is not constrained

to a speciﬁc distribution, takes into account all electronic

correlations and therefore makes DMC a rigorously exact

method. While this is true when solving for bosonic

particles, solving for fermionic particles requires some

approximations to remain computationally feasible and

maintain the anti-symmetric nature of the wavefunction.

Some of these approximations are controlled and can be

rigorously extrapolated out (such as time steps, use of

electron-core potentials or pseudopotential, etc...). The

only uncontrolled source of error8is the ﬁxed-node (FN)

approximation introduced to suppress the ﬂuctuations

of the sign of the wavefunction (fermion sign problem).

This approximation means that any proposed conﬁgura-

tion changing the sign of the wavefunction is rejected,

while any conﬁguration lowering the local energy would

be promoted. DMC being variational, if the positions

of the nodes of the trial wavefunction are exact, the

averaged local energy is rigorously the exact ground

state energy. FN-DMC energies are an upper bound to

the exact ground state energy9. This implies that from

a FN-DMC perspective, trial wavefunctions diﬀer only

by their nodal surface, and the best nodal surface leads

to a lower energy and variance.

Despite the FN approximation, DMC was shown to

reach successfully accuracy below the chemical accuracy

threshold of 1 kcal/mol for chemical systems10–14 and

a few tens of meV/unit-cells for solids within periodic

boundary conditions15–18. Recently, multiple calcula-

tions using a selected Conﬁguration Interaction (sCI)

trial wavefunction have demonstrated how to system-

atically reduce the error from the ﬁxed nodes19–21.

Nevertheless, these errors have proven to be signiﬁcant

only for open shell or multi-reference molecules.12,22

From above discussion, we see that on the one hand,

stochastic numerical sampling permits independent eval-

uationsy, making the method embarrassingly parallel and

highly eﬃcient for high performance computing (HPC).

On the other hand, accuracy is a direct consequence of

the quality of the ﬁxed nodes in the trial wavefunction: If

the nodes are exact, the method is rigorously exact. Us-

ing DMC energies as reference for QML models will then

boost eﬃciency by several orders of magnitude, since the

property of any new out-of-sample query compound can

be predicted after training, solely based on inference from

the DMC information stored in the training data. In or-

der to retain the predictive accuracy of reference data,

however, a signiﬁcant amount of training data can be

necessary. This issue is essentially caused by the use of

random selection of training instances which should be

representative of query compounds. To rise to this chal-

lenge, some of us (BH, OAvL) recently introduced the

amon (A) based QML method which enables a dramatic

reduction in training set size as well as size of training

molecules23.

Amons correspond to systematically fragmented enti-

ties of query target molecules, containing an increasing

number of heavy atoms (typically no more than 7). They

can be seen as eﬀective building blocks of target com-

pounds with atomic states being perturbed according to

their chemical environment. With amons used as train-

ing set, QML models trained on the ﬂy (AQML) repre-

sent a scalable approach which can be applied through-

out chemical space to predict quantum properties of large

molecules.

For this study, we combine DMC reference calculations

with AQML and ∆-AQML models, and we numerically

demonstrate the feasibility of these approaches to achieve

chemical accuracy, ∆-AQML in particular, by making use

of a dictionary of 1175 small amons of QM924 with up to

only 5 heavy atoms (not counting hydrogens), together

with reference energies calculated at much cheaper lev-

els of theory, including mean-ﬁeld theory (Hartree-Fock

(HF) and density functional theory (DFT) level using

various levels of approximations according to Jacob’s lad-

der) and Møller–Plesset perturbation theory (MP) to sec-

ond (MP2).

II. DATA-SETS

1175 unique amon graphs (i.e., molecular graphs) with

up to 5 atoms were ﬁrstly identiﬁed by application of

the amon-selection algorithm23 (also brieﬂy summarized

below in the methodology section IV A) to QM924,25

molecules, with SMILES strings as the only input. Then

for each amon one hundred conformers were sampled us-

ing RDKit26, optimized by MMFF94 force ﬁeld and only

the global minimum conﬁguration (i.e., lowest force ﬁeld

energy conformer) was chosen. The geometries of the

thus-selected amon conformers were further optimized

at the level of theory B3LYP/Def2TZVP. Based on these

geometries, single point energies were calculated at mul-

tiple levels of theory, including HF, PBE, PBE0, B3LYP,

MP2 (with basis cc-pVTZ) and DMC (see the next sec-

tion for details). The resulting dataset can be seen as

a compact dictionary of small molecules, which is to be

looked up later in AQML for any query molecule of larger

size.

For test purpose of QML models, 50 molecules all made

up of 9 heavy atoms are randomly drawn from the QM9

dataset. Geometries were optimized at the same level

of theory as for amons, followed by single point energy

calculations by all levels of theory mentioned above, in-

cluding DMC. For a depiction of all test molecules, see

Fig. 5.

III. COMPUTATIONAL DETAILS

We used B3LYP/Def2TZVP for geometry optimiza-

tion as implemented in the Gaussian 0927 code. For

HF, DFT and post-HF (MP2) single point energies,

we switched to the cc-pVTZ basis, and used instead

Molpro201828 with cc-pVQZ-jkﬁt density-ﬁtting basis to

speed-up computations of both Coulomb and exchange

integrals.

For DMC calculations, we used a trial wavefunction with

a Slater Jastrow form29

ΨT(~

R) = exp "X

Ji(~

R)#M

CkD↑

k(ϕ)D↓

k(ϕ) (3)

Where D↓

k(ϕ) is a slater determinant expressed in terms

of single particle orbitals (SPO) ϕi=PNb

lCi

lΦl. We

use DMC to evaluate the total and formation energies

of all molecules in data-sets, as implemented in the

QMCPACK code30,31. Our trial many-body wave func-

tions are constructed with the product of the Jastrow

functions and a single Slater determinant from Hartree

Fock, PBE32 PBE032,33 and B3LYP34–37 Kohn-Sham

(KS) orbitals obtained from the PYSCF38 package.

Using a variant of the linear method of Umrigar and

co-workers39, up to 40 variational parameters including

one-body, two-body and three-body Jastrow factors

are optimized within VMC, after which we can obtain

the ground-state energies with diﬀusion Monte Carlo

(DMC) under the ﬁxed node approximation5. For all

molecules, we used nodal surfaces coming from HF and

aforementioned DFT functionals as a way of assessing

the quality of the trial wavefunction. Jastrow param-

eters were optmized independently for each molecule

and each trial wavefunction. All calculations used a

0.001 time-step holding an error within error bars of

the extrapolated 0 time-step. This was veriﬁed by

randomly selecting 10 molecules of diﬀerent size and

running the time-step extrapolation. With such small

time-steps, we increased the size of decorrelation time to

avoid auto-correlation. Each molecule used 4096 walkers

and 2000 blocks to insure convergence. Using the re-

sources of the ALCF-Theta supercomputer (Cray XC40,

with Intel Xeon Phi KNL processors), each molecules

was run on 32 nodes and 128 threads (2 hyperthreads

) for 1 hour of compute time or a total of 9.6M core-hours.

For all QML models, we rely on a local representa-

tion called atomic Spectrum of London and Axilrod-

Teller-Muto potential (aSLATM)23, with a weight of 1

and 1/3 for the 2-body London potential and 3-body

Axilrod-Teller-Muto potential respectively (Note that 1-

body terms are not necessary), to describe atoms in

molecules (i.e., atomic environments). Default 1D grids

were used, as was implemented in the original aqml code

(available at https://github.com/binghuang2018/aqml),

with grid spacing 0.05 ˚

A (rad) for the 2-body (3-body)

potential, ranging from 0.2 to a maximal atomic cutoﬀ

of 4.8 ˚

A for the 2-body part and 0 to πfor the 3-body

part, respectively. A smearing width of 0.05 ˚

A (rad)

was used for the normalized 1D Gaussian distribution

centered on each bond distance (bond angle) within the

atomic cutoﬀ. L2norm was used to compute the dis-

tances between two atomic environments and Gaussian

kernel was used to measure their similarity, with atom

type (characterized by nuclear charge) dependent kernel

width set to the maximal value of aSLATM distance be-

tween all pairs of atomic environments (of the same kind)

divided by √2 ln 2. A universal regularization parameter

of 10−4was used to reduce the complexity of all QML

models.

IV. METHODOLOGY

A. From amons to ∆-AQML

To provide suﬃcient context, we now brieﬂy summa-

rize the key ideas underlying amons and their use within

∆-AQML models. The interested reader is referred to

the original papers23,40 for further details.

The amons approach attempts to mitigate the curse of

dimensionality in CCS through selection of the smallest

possible, yet “optimal”, training set on-the-ﬂy after hav-

ing been provided a given speciﬁc query test molecule (or

a set). The amons selection procedure23 can be roughly

divided into three major steps: a) Perceive the connec-

tivity graph Gof any query based on its 3D geometry.

b) Collect all isomorphic subgraphs {Gi}of Gwith no

more than NIheavy atoms (NIis set to 5 in this study)

based on an eﬃcient tree enumeration algorithm. Sub-

graph isomorphism help retain all hybridization states

of atoms during fragmentation (hydrogens are added to

heavy atoms when necessary). c) Perform geometry re-

laxation for each fragment with some force ﬁeld and sub-

sequently quantum chemical approach, with dihedral an-

gles constrained to match that in the query molecule so

as to avoid too much change in conformational degrees of

freedom. The resulting fragments, if unique and survived

(i.e., no dissociation or graph change after geometry re-

laxation), are then selected for the amon database.

In case of neglect of amon conformers, i.e., precise ge-

ometry information of the query is disregarded by provid-

ing only the molecular graph of the query (which allows

further reduction in training set size) the last step (step

c) is replaced by global minima search based on force

ﬁeld methods, followed by geometry relaxation at some

quantum chemical level of theory. Given the complete

set of amons, any query molecule, even if even substan-

tially larger than amons in molecular size, can then be

predicted to high accuracy by QML models, when used

in conjunction with atomic representations. Note that

the overall number of amons for any given query, can be

rather small, in particularly for highly regular systems

exhibiting repeating patterns or periodicity, e.g. poly-

mers, peptides, or crystals.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsDMCaccuracyacrosschemicalspacewithscalable-QMLBingHuang,1,a)O.AnatolevonLilienfeld,2,3,4,b)JaronT.Krogel,5,c)andAnouarBenali6,d)1)UniversityofVienna,FacultyofPhysics,Kolingasse14-16,1090Vienna,Austria2)DepartmentsofChemistry,MaterialsScienceandEngineering,andPhysics,UniversityofToronto,St.Ge...

展开>> 收起<<

Towards DMC accuracy across chemical space with scalable -QML Bing Huang1aO. Anatole von Lilienfeld2 3 4 bJaron T. Krogel5cand Anouar Benali6d 1University of Vienna Faculty of Physics Kolingasse 14-16 1090 Vienna Austria.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards DMC accuracy across chemical space with scalable -QML Bing Huang1aO. Anatole von Lilienfeld2 3 4 bJaron T. Krogel5cand Anouar Benali6d 1University of Vienna Faculty of Physics Kolingasse 14-16 1090 Vienna Austria

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: