A basic electro-topological descriptor for the prediction of organic molecule geometries by simple machine learning

2025-04-30 0 0 442.08KB 21 页 10玖币

侵权投诉

A basic electro-topological descriptor for the

prediction of organic molecule geometries by

simple machine learning

Carlos Manuel de Armas-Morej´on,∗,†,‡,¶Ask Hjorth Larsen,∗,†Luis A.

Montero-Cabrera,∗,¶Angel Rubio,∗,‡and Joaquim Jornet-Somoza∗,†,‡

†Nano-Bio Spectroscopy Group and ETSF Scientiﬁc Development Centre, Department of

Materials Physics, University of the Basque Country, CFM CSIC-UPV/EHU-MPC and DIPC,

Tolosa Hiribidea 72, E-20018 Donostia-San Sebasti´

‡Theory Department, Max Planck Institute for the Structure and Dynamics of Matter and Center

for Free-Electron Laser Science, Luruper Chaussee 149, 22761 Hamburg, Germany

¶Laboratorio de Qu´

ımica Computacional y Te´

orica, Facultad de Qu´

ımica, Universidad de La

Habana, 10400. La Habana, Cuba.

E-mail: carlosdearmasm@gmail.com; asklarsen@gmail.com; lmc@fq.uh.cu;

angel.rubio@mpsd.mpg.de; j.jornet.somoza@gmail.com

Abstract

This paper proposes a machine learning (ML) method to predict stable molecular geome-

tries from their chemical composition. The method is useful for generating molecular con-

formations which may serve as initial geometries for saving time during expensive structure

optimizations by quantum mechanical calculations of large molecules. Conformations are

found by predicting the local arrangement around each atom in the molecule after trained from

a database of previously optimized small molecules. It works by dividing each molecule in the

arXiv:2210.10700v1 [cond-mat.mtrl-sci] 19 Oct 2022

database into minimal building blocks of different type. The algorithm is then trained to predict

bond lengths and angles for each type of building block using an electro-topological ﬁngerprint

as descriptor. A conformation is then generated by joining the predicted blocks. Our model is

able to give promising results for optimized molecular geometries from the basic knowledge of

the chemical formula and connectivity. The method trends to reproduce interatomic distances

within test blocks with RMSD under 0.05 ˚

1 Introduction

This work assesses the problem of generating reliable conformers of molecules from proposed

chemical compositions. Realistic initial bond lengths and angles are essential for efﬁcient geom-

etry optimizations. They are normally the ﬁrst step of the usual computational workﬂow of sys-

tematic variations of the atomic coordinates inside a molecule and the calculation of the potential

energy and forces of the system in order to ﬁnd a minimum value for the potential energy, which

indicates a theoretically optimized conformation. Moreover, approximate and reliable molecular

geometries serve for many modelling and process simulation purposes from docking to Molecular

Dynamics.

The function combining all possible variations of the geometry and the potential energy forms

a high-dimensional surface and it is the well known potential energy surface (PES). All possi-

ble conformers for a given compound are comprehended as minima in the appropriate PES. Two

processes are required in order to obtain the model of the conformer with the lowest potential en-

ergy of a given molecule: (1) a procedure to attain its corresponding and plausible PES and (2) a

method to navigate it to search for minima. Procedures to ﬁnd PES’s and then computing poten-

tial energies can vary in efﬁciency depending on several factors, and mostly the number of atoms

of the compound. Options exist from the easily computed empirical force ﬁelds based on clas-

sical considerations of bodies in a molecule,1,2 passing fast quantum mechanical semi-empirical

calculations being parameterised for speciﬁc scenarios,3to more general and reliable but compu-

tationally expensive ab-initio and DFT calculations.4The method to navigate the PES can also be

computationally expensive depending on how fast the global minimum can be found.5Obviously,

if the initial choice of a guess conformation is near to the ﬁnal optimized structure, the correspnd-

ing PES’s minimum will be faster reached. This is crucial for accurate geometry optimizations of

large molecules.

Machine learning (ML) has been recently used in a variety of topics inside the ﬁeld of quantum

mechanics.6–37 For molecular geometry optimizations, progress has been made mostly by using

neural networks to parameterise classical force ﬁelds.38,39

The proposal made in this work is to use ML, and speciﬁcally the Kernel Ridge Regression al-

gorithm, for predicting molecular conformations by producing the local arrangement of each atom

belonging to a molecule. It can be achieved by using a large and conﬁdent database of optimised

small molecules40 as a source for both the training and testing sets. For this purpose, certain molec-

ular structural blocks are characterised and deﬁned by an electro-topological descriptor18,41 from

the structures of the previously optimised molecules. Blocks are then reconstructed by applying

ML tools. Using the ETKDG42–44 (Experimental Torsion Knowledge Distance Geometry) present

in RDKit45 with a new ab-initio torsion angle database, we join all predicted blocks to produce the

desired molecular structures with reasonable reliability. Results promise fast and conﬁdent predic-

tions of molecular geometries and conformations from their formulas taken as structural graphs.

2 Learning data

The database used in this work is part of a larger one46,47 of quantum PES’s minima geometries

of small organic molecules containing up to 8 C, O, N and/or F atoms. The optimised geometries

of this database are reported to be found using DFT/B3LYP48 with the 6-31G(2df,p) basis set as

a commonly accepted reliable PES. We will refer to this database as 8CONF. The size of this

resulting database subset is 21k molecules.40

To facilitate predictions based on this data, we seek a representation which minimizes the

amount of redundant information in the learning set, and also could group together similar kinds

of data.

First, each molecule is split into blocks. A block is the main building part of our model, and

it is characterised by: (1) a central atom with more than one bond and (2) the ﬁrst neighbors

of such central atom. Figure 1 shows a block decomposition for an example molecule from the

8CONF database. Note that each atom can normally be included in multiple blocks: the block

centered around itself as well as each of the blocks surrounding its neighboring atoms. Atoms with

only one neighbor are not considered to deﬁne a block. Blocks therefore have from two to four

neighboring atoms in the selected molecular sets where all atoms belong to the ﬁrst and second

rows of the periodic table.

H11

Figure 1: Block decomposition for the molecule C3H8. Each C atom is bonded to multiple atoms

and hence deﬁnes a block. The molecule can therefore be divided in 3 blocks, two of which (for

atoms 1 and 3) belong to the same block-class. In each block we have redundant atoms to indicate

where blocks join together.

A unique Cartesian representation is not well suited for predictions because coordinate values

depend on the chosen reference center. Instead, we represent local coordinates within a block by 1)

subtracting the molecular Cartesian coordinates of the central atom position in that block to deﬁne

it as the local coordinate origin, and 2) computing the matrix B of scalar products ai·ajbetween

each pair (i,j)of coordinates of local position vectors corresponding to the non-central atoms in

the block. This matrix is the feature we use for training and predictions.

The matrix B of scalar products is symmetric and at most 4 ×4 in size, and therefore has up

to ten unique degrees of freedom. B contains enough information to rebuild the set of molec-

ular Cartesian coordinates (see Appendix 8.1) for a block except for translations, rotations, and

chirality, with which the matrix is invariant.

The non-central atoms in a block have no natural ordering. Hence a way must be chosen to

assign an index ito each of them with a minimum of ambiguity for ML training and testing. To

this end we deﬁne an equivalence relation for the set of all blocks, i.e., each block belongs to a

single, speciﬁc equivalence class or block-class. The ML algorithm is then independently trained

for each block-class.

Two blocks belong to the same class if 1) the species of the central atom in each block is the

same, 2) the species of each neighboring atom is the same (some ambiguity is solved by sorting by

atomic numbers) and 3) the arrangement of the atoms in space is the same, i.e. either tetrahedral

(TH), triangular (TR) or linear (L) .

The deﬁnition of block-classes can be applied with different levels of restrictions. As a result,

the number of different block-classes can vary, as blocks with the same atoms can appear in very

different environments.

We denote a block-class by a series of chemical symbols followed by certain indicators of

spatial conﬁgurations when necessary. The ﬁrst symbol is that of the central atom, followed by the

symbols of the neighboring atoms ordered with higher atomic numbers ﬁrst. Figure 2 shows the

distribution of blocks inside the database.

10500

20000

·104

C-C-C-H-H

C-C-H-H-H

O-H-H

C-O-H-H

Number of blocks in database

Figure 2: Distribution of block-classes within all molecules in the database. A few common and

uncommon block-classes as O-H-H and C-O-H-H are referred.

When choosing the deﬁnition of block-classes there is a tradeoff: We can make predictions

easier by maximizing the amount of chemical knowledge which deﬁnes a block. It can be achieved

by dividing the blocks into a large number of classes each of which contains very similar blocks.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Abasicelectro-topologicaldescriptorforthepredictionoforganicmoleculegeometriesbysimplemachinelearningCarlosManueldeArmas-Morejon,,,,¶AskHjorthLarsen,,LuisA.Montero-Cabrera,,¶AngelRubio,,andJoaquimJornet-Somoza,,Nano-BioSpectroscopyGroupandETSFScienticDevelopmentCentre,DepartmentofMater...

展开>> 收起<<

A basic electro-topological descriptor for the prediction of organic molecule geometries by simple machine learning.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A basic electro-topological descriptor for the prediction of organic molecule geometries by simple machine learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: