A basic electro-topological descriptor for the prediction of organic molecule geometries by simple machine learning

2025-04-30 0 0 442.08KB 21 页 10玖币
侵权投诉
A basic electro-topological descriptor for the
prediction of organic molecule geometries by
simple machine learning
Carlos Manuel de Armas-Morej´on,,,,Ask Hjorth Larsen,,Luis A.
Montero-Cabrera,,Angel Rubio,,and Joaquim Jornet-Somoza,,
Nano-Bio Spectroscopy Group and ETSF Scientific Development Centre, Department of
Materials Physics, University of the Basque Country, CFM CSIC-UPV/EHU-MPC and DIPC,
Tolosa Hiribidea 72, E-20018 Donostia-San Sebasti´
an
Theory Department, Max Planck Institute for the Structure and Dynamics of Matter and Center
for Free-Electron Laser Science, Luruper Chaussee 149, 22761 Hamburg, Germany
Laboratorio de Qu´
ımica Computacional y Te´
orica, Facultad de Qu´
ımica, Universidad de La
Habana, 10400. La Habana, Cuba.
E-mail: carlosdearmasm@gmail.com; asklarsen@gmail.com; lmc@fq.uh.cu;
angel.rubio@mpsd.mpg.de; j.jornet.somoza@gmail.com
Abstract
This paper proposes a machine learning (ML) method to predict stable molecular geome-
tries from their chemical composition. The method is useful for generating molecular con-
formations which may serve as initial geometries for saving time during expensive structure
optimizations by quantum mechanical calculations of large molecules. Conformations are
found by predicting the local arrangement around each atom in the molecule after trained from
a database of previously optimized small molecules. It works by dividing each molecule in the
1
arXiv:2210.10700v1 [cond-mat.mtrl-sci] 19 Oct 2022
database into minimal building blocks of different type. The algorithm is then trained to predict
bond lengths and angles for each type of building block using an electro-topological fingerprint
as descriptor. A conformation is then generated by joining the predicted blocks. Our model is
able to give promising results for optimized molecular geometries from the basic knowledge of
the chemical formula and connectivity. The method trends to reproduce interatomic distances
within test blocks with RMSD under 0.05 ˚
A.
1 Introduction
This work assesses the problem of generating reliable conformers of molecules from proposed
chemical compositions. Realistic initial bond lengths and angles are essential for efficient geom-
etry optimizations. They are normally the first step of the usual computational workflow of sys-
tematic variations of the atomic coordinates inside a molecule and the calculation of the potential
energy and forces of the system in order to find a minimum value for the potential energy, which
indicates a theoretically optimized conformation. Moreover, approximate and reliable molecular
geometries serve for many modelling and process simulation purposes from docking to Molecular
Dynamics.
The function combining all possible variations of the geometry and the potential energy forms
a high-dimensional surface and it is the well known potential energy surface (PES). All possi-
ble conformers for a given compound are comprehended as minima in the appropriate PES. Two
processes are required in order to obtain the model of the conformer with the lowest potential en-
ergy of a given molecule: (1) a procedure to attain its corresponding and plausible PES and (2) a
method to navigate it to search for minima. Procedures to find PES’s and then computing poten-
tial energies can vary in efficiency depending on several factors, and mostly the number of atoms
of the compound. Options exist from the easily computed empirical force fields based on clas-
sical considerations of bodies in a molecule,1,2 passing fast quantum mechanical semi-empirical
calculations being parameterised for specific scenarios,3to more general and reliable but compu-
tationally expensive ab-initio and DFT calculations.4The method to navigate the PES can also be
computationally expensive depending on how fast the global minimum can be found.5Obviously,
2
if the initial choice of a guess conformation is near to the final optimized structure, the correspnd-
ing PES’s minimum will be faster reached. This is crucial for accurate geometry optimizations of
large molecules.
Machine learning (ML) has been recently used in a variety of topics inside the field of quantum
mechanics.6–37 For molecular geometry optimizations, progress has been made mostly by using
neural networks to parameterise classical force fields.38,39
The proposal made in this work is to use ML, and specifically the Kernel Ridge Regression al-
gorithm, for predicting molecular conformations by producing the local arrangement of each atom
belonging to a molecule. It can be achieved by using a large and confident database of optimised
small molecules40 as a source for both the training and testing sets. For this purpose, certain molec-
ular structural blocks are characterised and defined by an electro-topological descriptor18,41 from
the structures of the previously optimised molecules. Blocks are then reconstructed by applying
ML tools. Using the ETKDG42–44 (Experimental Torsion Knowledge Distance Geometry) present
in RDKit45 with a new ab-initio torsion angle database, we join all predicted blocks to produce the
desired molecular structures with reasonable reliability. Results promise fast and confident predic-
tions of molecular geometries and conformations from their formulas taken as structural graphs.
2 Learning data
The database used in this work is part of a larger one46,47 of quantum PES’s minima geometries
of small organic molecules containing up to 8 C, O, N and/or F atoms. The optimised geometries
of this database are reported to be found using DFT/B3LYP48 with the 6-31G(2df,p) basis set as
a commonly accepted reliable PES. We will refer to this database as 8CONF. The size of this
resulting database subset is 21k molecules.40
To facilitate predictions based on this data, we seek a representation which minimizes the
amount of redundant information in the learning set, and also could group together similar kinds
of data.
First, each molecule is split into blocks. A block is the main building part of our model, and
it is characterised by: (1) a central atom with more than one bond and (2) the first neighbors
3
of such central atom. Figure 1 shows a block decomposition for an example molecule from the
8CONF database. Note that each atom can normally be included in multiple blocks: the block
centered around itself as well as each of the blocks surrounding its neighboring atoms. Atoms with
only one neighbor are not considered to define a block. Blocks therefore have from two to four
neighboring atoms in the selected molecular sets where all atoms belong to the first and second
rows of the periodic table.
1
C
4
H
5
H6
H
2
C
7
H8
H
3
C
9
H
10
H11
H
Figure 1: Block decomposition for the molecule C3H8. Each C atom is bonded to multiple atoms
and hence defines a block. The molecule can therefore be divided in 3 blocks, two of which (for
atoms 1 and 3) belong to the same block-class. In each block we have redundant atoms to indicate
where blocks join together.
A unique Cartesian representation is not well suited for predictions because coordinate values
depend on the chosen reference center. Instead, we represent local coordinates within a block by 1)
subtracting the molecular Cartesian coordinates of the central atom position in that block to define
it as the local coordinate origin, and 2) computing the matrix B of scalar products ai·ajbetween
each pair (i,j)of coordinates of local position vectors corresponding to the non-central atoms in
the block. This matrix is the feature we use for training and predictions.
The matrix B of scalar products is symmetric and at most 4 ×4 in size, and therefore has up
to ten unique degrees of freedom. B contains enough information to rebuild the set of molec-
ular Cartesian coordinates (see Appendix 8.1) for a block except for translations, rotations, and
chirality, with which the matrix is invariant.
The non-central atoms in a block have no natural ordering. Hence a way must be chosen to
assign an index ito each of them with a minimum of ambiguity for ML training and testing. To
4
this end we define an equivalence relation for the set of all blocks, i.e., each block belongs to a
single, specific equivalence class or block-class. The ML algorithm is then independently trained
for each block-class.
Two blocks belong to the same class if 1) the species of the central atom in each block is the
same, 2) the species of each neighboring atom is the same (some ambiguity is solved by sorting by
atomic numbers) and 3) the arrangement of the atoms in space is the same, i.e. either tetrahedral
(TH), triangular (TR) or linear (L) .
The definition of block-classes can be applied with different levels of restrictions. As a result,
the number of different block-classes can vary, as blocks with the same atoms can appear in very
different environments.
We denote a block-class by a series of chemical symbols followed by certain indicators of
spatial configurations when necessary. The first symbol is that of the central atom, followed by the
symbols of the neighboring atoms ordered with higher atomic numbers first. Figure 2 shows the
distribution of blocks inside the database.
0
10500
20000
·104
C-C-C-H-H
C-C-H-H-H
O-H-H
C-O-H-H
Number of blocks in database
Figure 2: Distribution of block-classes within all molecules in the database. A few common and
uncommon block-classes as O-H-H and C-O-H-H are referred.
When choosing the definition of block-classes there is a tradeoff: We can make predictions
easier by maximizing the amount of chemical knowledge which defines a block. It can be achieved
by dividing the blocks into a large number of classes each of which contains very similar blocks.
5
摘要:

Abasicelectro-topologicaldescriptorforthepredictionoforganicmoleculegeometriesbysimplemachinelearningCarlosManueldeArmas-Morejon,,†,‡,¶AskHjorthLarsen,,†LuisA.Montero-Cabrera,,¶AngelRubio,,‡andJoaquimJornet-Somoza,†,‡†Nano-BioSpectroscopyGroupandETSFScienticDevelopmentCentre,DepartmentofMater...

展开>> 收起<<
A basic electro-topological descriptor for the prediction of organic molecule geometries by simple machine learning.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:442.08KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注