Semiempirical Hamiltonians learned from data can have accuracy comparable to Density Functional Theory Frank HuFrancis He and David J. Yaron

2025-05-03 0 0 7.68MB 78 页 10玖币
侵权投诉
Semiempirical Hamiltonians learned from data can have
accuracy comparable to Density Functional Theory
Frank Hu,Francis He, and David J. Yaron
Department of Chemistry, Carnegie Mellon University, Pittsburgh, PA
E-mail: frankhu@stanford.edu; yaron@cmu.edu
Phone: (412)-951-4516
Abstract
Quantum chemistry provides chemists with invaluable information, but the high computational
cost limits the size and type of systems that can be studied. Machine learning (ML) has emerged as a
means to dramatically lower cost while maintaining high accuracy. However, ML models often sacrifice
interpretability by using components, such as the artificial neural networks of deep learning, that
function as black boxes. These components impart the flexibility needed to learn from large volumes
of data but make it difficult to gain insight into the physical or chemical basis for the predictions. Here,
we demonstrate that semiempirical quantum chemical (SEQC) models can learn from large volumes of
data without sacrificing interpretability. The SEQC model is that of Density Functional based Tight
Binding (DFTB) with fixed atomic orbital energies and interactions that are one-dimensional functions
of interatomic distance. This model is trained to ab initio data in a manner that is analogous to that
used to train deep learning models. Using benchmarks that reflect the accuracy of the training data,
we show that the resulting model maintains a physically reasonable functional form while achieving an
accuracy, relative to coupled cluster energies with a complete basis set extrapolation (CCSD(T)*/CBS),
that is comparable to that of density functional theory (DFT). This suggests that trained SEQC models
can achieve low computational cost and high accuracy without sacrificing interpretability. Use of a
physically-motivated model form also substantially reduces the amount of ab initio data needed to
train the model compared to that required for deep learning models.
1
arXiv:2210.11682v2 [physics.chem-ph] 10 Jan 2023
1 Introduction
A substantial challenge for quantum chemistry is lowering the computational cost1–6 to enable accurate
predictions on large systems such as those of interest in biological and material applications. Molecular
systems have two properties that provide the basis for approximations that lower computational costs:
nearsightedness and molecular similarity. Nearsightedness provides the chemical basis for methods that have,
over the past few decades, substantially reduced computational cost without large sacrifices in accuracy. In
particular, large reductions in cost can be achieved by replacing detailed Coulomb interactions, required
at short range, with increasingly coarse-grained multi-polar interactions at long range.7–10 Methods have
also been developed that use molecular similarity to achieve dramatic reductions in computational cost,
including molecular mechanics11,12 and semiempirical quantum chemistry (SEQC).13 Unfortunately, these
cost reductions have typically come with a substantial decrease in accuracy. More recently, machine learning
(ML) has emerged as a means to leverage molecular similarity to develop models that are both low-cost and
accurate.14–18 However, current applications of ML in chemistry often incorporate little physics and function
as black boxes that are difficult to interpret. Here, we combine ML with SEQC to create physics-based models
that achieve high accuracy and computational efficiency without sacrificing interpretability.
The ability of ML to leverage molecular similarity stems from the use of highly flexible model forms such
as the artificial neural networks (NNs)19–26 of deep learning. This flexibility enables ML models to learn
from large volumes of training data. For example, the accuracy of the ANI-1 neural network potential20
improves as it is shown more training data, approaching chemical accuracy27–31 of 1 kcal/mol when trained
to ab initio results on millions of molecular configurations. However, this flexibility of ML models is a
double-edged sword. It leads to high accuracy, but it also makes it difficult to gain insight into the physical
or chemical basis for the predictions.
SEQC provides alternative model forms that are capable of learning from data. Traditional SEQC model
forms such as PM332 only have a handful of parameters and this limits their ability to take advantage of
large volumes of data.33 Replacing these single parameters with NNs imparts the flexibility to learn from
large volumes of data,34 however the NNs function as black boxes and so decrease interpretability. Here,
we increase the flexibility of SEQC models so that they can take advantage of larger volumes of data while
retaining a purely physics-based form. This is operationalized using the Density Functional based Tight
Binding (DFTB)35,36 Hamiltonian with model parameters that can be expressed in the Slater-Koster File
(SKF) format.37 DFTB includes only valence electrons and uses a minimal atomic orbital basis. The atomic
orbital energies are constants that can be adjusted during training, and the interactions and overlaps between
atomic orbitals are one-dimensional functions of interatomic distance. We will refer to this as the SKF-DFTB
2
model form and to our resulting trained models as DFTBML.
The flexibility of DFTBML lies primarily in the one-dimensional functions. Over the distances present
in typical molecules, the interactions described by these functions vary by hundreds of kcal/mol. Because
the molecular energy arises from many such interactions, changes of a few tenths of a kcal/mol can have
significant effects on the total energy. For the model to learn effectively from data, we need a functional
form with the sensitivity to fine tune these interactions while preventing oscillations and other non-physical
behaviors. Here, the flexibility and sensitivity is provided through splines, i.e. piecewise polynomials, with a
high polynomial order of five and a large number of 100 knots. To prevent oscillations and other non-physical
behaviors, a strong regularization scheme is developed and implemented in our training of DFTBML.
The DFTBML models explored here are trained to the ANI-1CCX dataset,23 which includes results
from a number of different ab initio methods on organic molecules comprised of C, N, O and H. The
DFTBML models can reproduce the predictions of CCSD(T)*/CBS to about 3 kcal/mol, which is comparable
to the accuracy of DFT (see Figure 1). We also show that 20000 molecular configurations are sufficient
to train the model. This saturation of performance with increasing data suggests that the accuracy is
limited by the SKF-DFTB model form itself, not by the amount of training data. The data requirements of
DFTBML are considerably below the 1M data points typically used to train deep learning models, which
is significant given that the generation of ab initio training data is a primary computational bottleneck in
model development. This opens the possibility of using trained SEQC models as replacements for DFT,
substantially reducing computational cost without, as in traditional SEQC models, sacrificing accuracy or,
as in many ML models, sacrificing interpretability.
2 Results and discussion
2.1 Experimental design
To explore the performance of DFTBML, we train the model under various conditions. To aid comparisons,
it is useful to introduce a standard notation for the resulting parameter sets. To evaluate the generalization
of the DFTBML models, we consider both near- and far-transfer, with the difference being the degree to
which the model is being transferred to larger systems. For near-transfer, where the training and testing
data contain systems with 1 - 8 heavy atoms, we use “DFTBML” followed by the energy target (DFT for
wB97x/def2-TZVPP; CC for CCSD(T)*/CBS) and the number of configurations in the training set, e.g.
“DFTBML CC 20000”. For far-transfer, where the training data has molecules with 1 - 5 heavy atoms while
the test data has molecules with 6 - 8 heavy atoms, we use “Transfer” as the prefix, e.g. “Transfer CC
3
Figure 1: Comparison of different quantum chemistry methods on atomization energies (see Equation 1
in Section 4). The heatmap is generated from the 230k molecular configurations in the ANI-1CCX
dataset with up to eight heavy atoms, after removing configurations with incomplete entries. The DFTBML-
CC/DFT parameterizations were trained to CCSD(T)*/CBS or wB97x/def2-TZVPP energies, respectively,
on 20000 molecules with up to eight heavy atoms. DFTBML improves substantially on currently published
DFTB parameters (MIO35 and Auorg38), with the agreement between DFTBML-CC and CCSD(T)*/CBS
being somewhat better than that between DFT (wB97x/def2-TZVPP) and CCSD(T)*/CBS.
20000”. We also consider results obtained when only a short-range repulsive potential is trained to the data,
with the electronic parameters being those of Auorg.38 For these models, we use “Repulsive” as a prefix,
e.g. “Repulsive CC 20000”.
2.2 Effects of regularization on model performance
A challenge with developing the DFTBML model was creating an effective regularization scheme that would
prevent overfitting without degrading model performance by being too restrictive. Without regularization,
the resulting functions show highly oscillatory behavior (left column of Figure 2). Previous work34,39 pe-
nalized deviations from a set of physically-derived reference parameters, e.g. deviation from the Auorg
4
parameter set of DFTB. This approach to regularization is problematic because it may overly bias the
training towards the reference parameters and does not prevent non-physical behaviors such as oscillation
of a trained function around the smooth form of the reference function.39 A commonly used approach for
smoothing splines applies a penalty to the magnitude of the second derivative.40,41 However, for DFTBML,
such a smoothing penalty substantially degrades performance of the models because there is no reason to
expect the second derivative to have a limited magnitude.
We instead adapt an approach from Akshay et al.42 which is motivated by the shape of the functions
in reference parameter sets, such as those of Auorg in Figure 2. For the Hamiltonian (H1) matrix elements,
the functions decay smoothly to zero and have an upward curvature. To enforce this behavior, we apply
a “convex” penalty that enforces the second derivative of the trained potentials, evaluated on a dense grid
of 500 points, to have a physically motivated sign. For overlaps (S), there can also be an inflection point
associated with nodes in the atomic orbitals (upper panels of Figure 2). We therefore extend the convex
penalty to allow a single inflection point, whose location is optimized during training. The results indicate
that, although inclusion of an inflection point improves model performance, the results are not sensitive to
its precise location (see Section S12.3 of the Supporting Information). The magnitude of the weighting factor
for these convex penalties does not require fine tuning beyond being large enough to prevent violations of the
constraints without being so large that it leads to numerical instabilities in gradient descent optimization.
The convex penalty successfully removes oscillatory behavior (middle column of Figure 2). However, the
resulting functions exhibit non-physical, piecewise-linear behavior, which is more pronounced in the overlap
integrals but also present in the Hamiltonian matrix elements (see inset in Figure 2).
To remove this piecewise-linear behavior, we apply a “smoothing” penalty to the third derivative, based
on the sum of squares of the third derivative evaluated on a grid of 500 points. Our use of a fifth-order spline
for H1and Sis motivated by the high order needed for the spline to have a continuous third derivative. The
magnitude of the penalty is adjusted to remove the piecewise-linear behavior while minimizing degradation
of the model performance (see Section S12.4 of the Supporting Information). The short-range repulsion (R)
does not exhibit piecewise-linear behavior, so a smoothing penalty is not applied and we use a third-order
spline for R.
It is somewhat surprising, given the highly non-physical behavior observed without regularization, that
the effects of regularization on model performance are not more dramatic (Table 1). For near-transfer, the
performance of the unregularized model (4.97 kcal/mol) is a factor of two better than the Auorg reference
model (10.55 kcal/mol). This is despite the highly oscillatory behavior of the functions and the fact that the
test data and training data have molecules with disjoint empirical formulas. This suggests coupling between
5
摘要:

SemiempiricalHamiltonianslearnedfromdatacanhaveaccuracycomparabletoDensityFunctionalTheoryFrankHu,FrancisHe,andDavidJ.YaronDepartmentofChemistry,CarnegieMellonUniversity,Pittsburgh,PAE-mail:frankhu@stanford.edu;yaron@cmu.eduPhone:(412)-951-4516AbstractQuantumchemistryprovideschemistswithinvaluable...

展开>> 收起<<
Semiempirical Hamiltonians learned from data can have accuracy comparable to Density Functional Theory Frank HuFrancis He and David J. Yaron.pdf

共78页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:78 页 大小:7.68MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 78
客服
关注