Semiempirical Hamiltonians learned from data can have accuracy comparable to Density Functional Theory Frank HuFrancis He and David J. Yaron

2025-05-03 0 0 7.68MB 78 页 10玖币

侵权投诉

Semiempirical Hamiltonians learned from data can have

accuracy comparable to Density Functional Theory

Frank Hu,∗Francis He, and David J. Yaron∗

Department of Chemistry, Carnegie Mellon University, Pittsburgh, PA

E-mail: frankhu@stanford.edu; yaron@cmu.edu

Phone: (412)-951-4516

Abstract

Quantum chemistry provides chemists with invaluable information, but the high computational

cost limits the size and type of systems that can be studied. Machine learning (ML) has emerged as a

means to dramatically lower cost while maintaining high accuracy. However, ML models often sacriﬁce

interpretability by using components, such as the artiﬁcial neural networks of deep learning, that

function as black boxes. These components impart the ﬂexibility needed to learn from large volumes

of data but make it diﬃcult to gain insight into the physical or chemical basis for the predictions. Here,

we demonstrate that semiempirical quantum chemical (SEQC) models can learn from large volumes of

data without sacriﬁcing interpretability. The SEQC model is that of Density Functional based Tight

Binding (DFTB) with ﬁxed atomic orbital energies and interactions that are one-dimensional functions

of interatomic distance. This model is trained to ab initio data in a manner that is analogous to that

used to train deep learning models. Using benchmarks that reﬂect the accuracy of the training data,

we show that the resulting model maintains a physically reasonable functional form while achieving an

accuracy, relative to coupled cluster energies with a complete basis set extrapolation (CCSD(T)*/CBS),

that is comparable to that of density functional theory (DFT). This suggests that trained SEQC models

can achieve low computational cost and high accuracy without sacriﬁcing interpretability. Use of a

physically-motivated model form also substantially reduces the amount of ab initio data needed to

train the model compared to that required for deep learning models.

arXiv:2210.11682v2 [physics.chem-ph] 10 Jan 2023

1 Introduction

A substantial challenge for quantum chemistry is lowering the computational cost1–6 to enable accurate

predictions on large systems such as those of interest in biological and material applications. Molecular

systems have two properties that provide the basis for approximations that lower computational costs:

nearsightedness and molecular similarity. Nearsightedness provides the chemical basis for methods that have,

over the past few decades, substantially reduced computational cost without large sacriﬁces in accuracy. In

particular, large reductions in cost can be achieved by replacing detailed Coulomb interactions, required

at short range, with increasingly coarse-grained multi-polar interactions at long range.7–10 Methods have

also been developed that use molecular similarity to achieve dramatic reductions in computational cost,

including molecular mechanics11,12 and semiempirical quantum chemistry (SEQC).13 Unfortunately, these

cost reductions have typically come with a substantial decrease in accuracy. More recently, machine learning

(ML) has emerged as a means to leverage molecular similarity to develop models that are both low-cost and

accurate.14–18 However, current applications of ML in chemistry often incorporate little physics and function

as black boxes that are diﬃcult to interpret. Here, we combine ML with SEQC to create physics-based models

that achieve high accuracy and computational eﬃciency without sacriﬁcing interpretability.

The ability of ML to leverage molecular similarity stems from the use of highly ﬂexible model forms such

as the artiﬁcial neural networks (NNs)19–26 of deep learning. This ﬂexibility enables ML models to learn

from large volumes of training data. For example, the accuracy of the ANI-1 neural network potential20

improves as it is shown more training data, approaching chemical accuracy27–31 of 1 kcal/mol when trained

to ab initio results on millions of molecular conﬁgurations. However, this ﬂexibility of ML models is a

double-edged sword. It leads to high accuracy, but it also makes it diﬃcult to gain insight into the physical

or chemical basis for the predictions.

SEQC provides alternative model forms that are capable of learning from data. Traditional SEQC model

forms such as PM332 only have a handful of parameters and this limits their ability to take advantage of

large volumes of data.33 Replacing these single parameters with NNs imparts the ﬂexibility to learn from

large volumes of data,34 however the NNs function as black boxes and so decrease interpretability. Here,

we increase the ﬂexibility of SEQC models so that they can take advantage of larger volumes of data while

retaining a purely physics-based form. This is operationalized using the Density Functional based Tight

Binding (DFTB)35,36 Hamiltonian with model parameters that can be expressed in the Slater-Koster File

(SKF) format.37 DFTB includes only valence electrons and uses a minimal atomic orbital basis. The atomic

orbital energies are constants that can be adjusted during training, and the interactions and overlaps between

atomic orbitals are one-dimensional functions of interatomic distance. We will refer to this as the SKF-DFTB

model form and to our resulting trained models as DFTBML.

The ﬂexibility of DFTBML lies primarily in the one-dimensional functions. Over the distances present

in typical molecules, the interactions described by these functions vary by hundreds of kcal/mol. Because

the molecular energy arises from many such interactions, changes of a few tenths of a kcal/mol can have

signiﬁcant eﬀects on the total energy. For the model to learn eﬀectively from data, we need a functional

form with the sensitivity to ﬁne tune these interactions while preventing oscillations and other non-physical

behaviors. Here, the ﬂexibility and sensitivity is provided through splines, i.e. piecewise polynomials, with a

high polynomial order of ﬁve and a large number of 100 knots. To prevent oscillations and other non-physical

behaviors, a strong regularization scheme is developed and implemented in our training of DFTBML.

The DFTBML models explored here are trained to the ANI-1CCX dataset,23 which includes results

from a number of diﬀerent ab initio methods on organic molecules comprised of C, N, O and H. The

DFTBML models can reproduce the predictions of CCSD(T)*/CBS to about 3 kcal/mol, which is comparable

to the accuracy of DFT (see Figure 1). We also show that 20000 molecular conﬁgurations are suﬃcient

to train the model. This saturation of performance with increasing data suggests that the accuracy is

limited by the SKF-DFTB model form itself, not by the amount of training data. The data requirements of

DFTBML are considerably below the ∼1M data points typically used to train deep learning models, which

is signiﬁcant given that the generation of ab initio training data is a primary computational bottleneck in

model development. This opens the possibility of using trained SEQC models as replacements for DFT,

substantially reducing computational cost without, as in traditional SEQC models, sacriﬁcing accuracy or,

as in many ML models, sacriﬁcing interpretability.

2 Results and discussion

2.1 Experimental design

To explore the performance of DFTBML, we train the model under various conditions. To aid comparisons,

it is useful to introduce a standard notation for the resulting parameter sets. To evaluate the generalization

of the DFTBML models, we consider both near- and far-transfer, with the diﬀerence being the degree to

which the model is being transferred to larger systems. For near-transfer, where the training and testing

data contain systems with 1 - 8 heavy atoms, we use “DFTBML” followed by the energy target (DFT for

wB97x/def2-TZVPP; CC for CCSD(T)*/CBS) and the number of conﬁgurations in the training set, e.g.

“DFTBML CC 20000”. For far-transfer, where the training data has molecules with 1 - 5 heavy atoms while

the test data has molecules with 6 - 8 heavy atoms, we use “Transfer” as the preﬁx, e.g. “Transfer CC

Figure 1: Comparison of diﬀerent quantum chemistry methods on atomization energies (see Equation 1

in Section 4). The heatmap is generated from the ∼230k molecular conﬁgurations in the ANI-1CCX

dataset with up to eight heavy atoms, after removing conﬁgurations with incomplete entries. The DFTBML-

CC/DFT parameterizations were trained to CCSD(T)*/CBS or wB97x/def2-TZVPP energies, respectively,

on 20000 molecules with up to eight heavy atoms. DFTBML improves substantially on currently published

DFTB parameters (MIO35 and Auorg38), with the agreement between DFTBML-CC and CCSD(T)*/CBS

being somewhat better than that between DFT (wB97x/def2-TZVPP) and CCSD(T)*/CBS.

20000”. We also consider results obtained when only a short-range repulsive potential is trained to the data,

with the electronic parameters being those of Auorg.38 For these models, we use “Repulsive” as a preﬁx,

e.g. “Repulsive CC 20000”.

2.2 Eﬀects of regularization on model performance

A challenge with developing the DFTBML model was creating an eﬀective regularization scheme that would

prevent overﬁtting without degrading model performance by being too restrictive. Without regularization,

the resulting functions show highly oscillatory behavior (left column of Figure 2). Previous work34,39 pe-

nalized deviations from a set of physically-derived reference parameters, e.g. deviation from the Auorg

parameter set of DFTB. This approach to regularization is problematic because it may overly bias the

training towards the reference parameters and does not prevent non-physical behaviors such as oscillation

of a trained function around the smooth form of the reference function.39 A commonly used approach for

smoothing splines applies a penalty to the magnitude of the second derivative.40,41 However, for DFTBML,

such a smoothing penalty substantially degrades performance of the models because there is no reason to

expect the second derivative to have a limited magnitude.

We instead adapt an approach from Akshay et al.42 which is motivated by the shape of the functions

in reference parameter sets, such as those of Auorg in Figure 2. For the Hamiltonian (H1) matrix elements,

the functions decay smoothly to zero and have an upward curvature. To enforce this behavior, we apply

a “convex” penalty that enforces the second derivative of the trained potentials, evaluated on a dense grid

of 500 points, to have a physically motivated sign. For overlaps (S), there can also be an inﬂection point

associated with nodes in the atomic orbitals (upper panels of Figure 2). We therefore extend the convex

penalty to allow a single inﬂection point, whose location is optimized during training. The results indicate

that, although inclusion of an inﬂection point improves model performance, the results are not sensitive to

its precise location (see Section S12.3 of the Supporting Information). The magnitude of the weighting factor

for these convex penalties does not require ﬁne tuning beyond being large enough to prevent violations of the

constraints without being so large that it leads to numerical instabilities in gradient descent optimization.

The convex penalty successfully removes oscillatory behavior (middle column of Figure 2). However, the

resulting functions exhibit non-physical, piecewise-linear behavior, which is more pronounced in the overlap

integrals but also present in the Hamiltonian matrix elements (see inset in Figure 2).

To remove this piecewise-linear behavior, we apply a “smoothing” penalty to the third derivative, based

on the sum of squares of the third derivative evaluated on a grid of 500 points. Our use of a ﬁfth-order spline

for H1and Sis motivated by the high order needed for the spline to have a continuous third derivative. The

magnitude of the penalty is adjusted to remove the piecewise-linear behavior while minimizing degradation

of the model performance (see Section S12.4 of the Supporting Information). The short-range repulsion (R)

does not exhibit piecewise-linear behavior, so a smoothing penalty is not applied and we use a third-order

spline for R.

It is somewhat surprising, given the highly non-physical behavior observed without regularization, that

the eﬀects of regularization on model performance are not more dramatic (Table 1). For near-transfer, the

performance of the unregularized model (4.97 kcal/mol) is a factor of two better than the Auorg reference

model (10.55 kcal/mol). This is despite the highly oscillatory behavior of the functions and the fact that the

test data and training data have molecules with disjoint empirical formulas. This suggests coupling between

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SemiempiricalHamiltonianslearnedfromdatacanhaveaccuracycomparabletoDensityFunctionalTheoryFrankHu,FrancisHe,andDavidJ.YaronDepartmentofChemistry,CarnegieMellonUniversity,Pittsburgh,PAE-mail:frankhu@stanford.edu;yaron@cmu.eduPhone:(412)-951-4516AbstractQuantumchemistryprovideschemistswithinvaluable...

展开>> 收起<<

Semiempirical Hamiltonians learned from data can have accuracy comparable to Density Functional Theory Frank HuFrancis He and David J. Yaron.pdf

共78页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Semiempirical Hamiltonians learned from data can have accuracy comparable to Density Functional Theory Frank HuFrancis He and David J. Yaron

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: