CHEM ALGEBRA ALGEBRAIC REASONING ON CHEMICAL REACTIONS Andrea Valenti_2

2025-04-29 0 0 733.98KB 24 页 10玖币

侵权投诉

CHEMALGEBRA: ALGEBRAIC REASONING ON

CHEMICAL REACTIONS

Andrea Valenti

Department of Computer Science

University of Pisa

Pisa, Italy

andrea.valenti@phd.unipi.it

Davide Bacciu

Department of Computer Science

University of Pisa

Pisa, Italy

davide.bacciu@unipi.it

Antonio Vergari

School of Informatics

University of Edinburgh

Edinburgh, Scotland

avergari@ed.ac.uk

ABSTRACT

While showing impressive performance on various kinds of learning tasks, it is yet

unclear whether deep learning models have the ability to robustly tackle reasoning

tasks. Measuring the robustness of reasoning in machine learning models is chal-

lenging as one needs to provide a task that cannot be easily shortcut by exploiting

spurious statistical correlations in the data, while operating on complex objects and

constraints. To address this issue, we propose CHEMALGEBRA, a benchmark for

measuring the reasoning capabilities of deep learning models through the predic-

tion of stoichiometrically-balanced chemical reactions. CHEMALGEBRA requires

manipulating sets of complex discrete objects – molecules represented as formulas

or graphs – under algebraic constraints such as the mass preservation principle. We

believe that CHEMALGEBRA can serve as a useful test bed for the next generation

of machine reasoning models and as a promoter of their development.

1 INTRODUCTION

Deep learning models, and Transformer architectures in particular, currently achieve the state-of-the-

art for a number of application domains such as natural language and audio processing, computer

vision, and computational chemistry (Lin et al.,2021;Khan et al.,2021;Bra¸soveanu & Andonie,

2020). Given enough data and enough parameters to ﬁt, these models are able to learn intricate

correlations (Brown et al.,2020). These impressive performance on machine learning tasks suggests

that they could be suitable candidates for machine reasoning tasks (Helwe et al.,2021).

Reasoning is the ability to manipulate a knowledge representation into a form that is more suitable

to solve a new problem (Bottou,2014;Garcez et al.,2019). In particular, algebraic reasoning

includes a set of reasoning manipulations such as abstraction, arithmetic operations, and systematic

composition over complex objects. Algebraic reasoning is related to the ability of a learning system

to perform systematic generalization (Marcus,2003;Bahdanau et al.,2018;Sinha et al.,2019), i.e. to

robustly make predictions beyond the data distribution it has been trained on. This is inherently more

challenging than discovering correlations from data, as it requires the learning system to actually

capture the true underlying mechanism for the speciﬁc task (Pearl,2009;Marcus,2018).

Lately, much attention has been put on training Transformers to learn how to reason (Helwe et al.,

2021;Al-Negheimish et al.,2021;Storks et al.,2019;Gontier et al.,2020). This is usually done

by embedding an algebraic reasoning problem in a natural language formulation. Natural language,

despite its ﬂexibility, is imprecise and prone to shortcuts (Geirhos et al.,2020). As a result, it is often

difﬁcult to determine whether the models’ performance on reasoning tasks is genuine or it is merely

arXiv:2210.02095v1 [cs.LG] 5 Oct 2022

due to the exploitation of spurious statistical correlations in the data. Several works in this direction

suggest (Agrawal et al.,2016;Jia & Liang,2017;Helwe et al.,2021) the latter is probably the case.

In order to effectively assess the reasoning capabilities of deep learning models, we need to accurately

design tasks that i) operate on complex objects ii) require algebraic reasoning to be carried out and

iii) cannot be shortcut by exploiting latent correlations in the data. We identify chemical reaction

prediction as a suitable candidate for these desiderata. First, chemical reactions can be naturally

interpreted as transformations over bags of complex objects: reactant molecules are turned into

product molecules by manipulating their graph structures while abiding certain constraints such as the

law of mass conservation. Second, these transformations can be analysed as algebraic operations over

(sub-)graphs (e.g., by observing bonds forming and dissolving (Bradshaw et al.,2019)), and balancing

them to preserve mass conservation can be formalised as solving a linear system of equations, as

we will show in Section 2. Third, the language of chemical molecules and reactions is much less

ambiguous than natural language and by controlling the stoichiometric coefﬁcients, i.e., the molecule

multiplicities, at training and test time we can more precisely measure systematic generalization.

Lastly, Transformers already excel at learning reaction predictions (Tetko et al.,2020;Irwin et al.,

2022).

Therefore, we think this can be a solid test bed to measure the current gap between learning

and reasoning capabilities of modern deep learning models.

The main contributions of this paper are the following:

We cast chemical reaction prediction as a reasoning task where the learner has not only to

predict a set of products but also correct stoichiometric coefﬁcient variations (Section 2).

We evaluate the current state-of-the-art Transformers for chemical reaction predictions,

showing that they fail to robustly generalise when reasoning on simple variants of the

chemical reaction dataset they have been trained on (Section 3).

We introduce CHEMALGEBRA as a novel challenging benchmark for machine reasoning, in

which we can more precisely measure the ability of deep learning models to algebraically

reason over bags of graphs in in-, cross- and out-of-distribution settings (Section 4).

2 PREDICTING CHEMICAL REACTIONS AS ALGEBRAIC REASONING

To illustrate our point, let us consider the Sabatier reaction: it yields methane (

CH4

) and water (

H2O

)

out of hydrogen (

) and carbon dioxide (

CO2

), in the presence of nickel (

) as a catalyst. In

chemical formulas:

1CO2+4H2

Ni 1CH4+2H2O (1)

where formulas encode complex graph structures where atoms are nodes and chemical bonds edges:

A reaction prediction learning task hence consists of outputting a bag of graphs, the products (right

hand side), given the bag of graphs consisting of reactants (left hand side) and reagents (i.e. the

catalysts, over the reaction’s arrow).

The multiplicities of the molecules, also called their stoichiometric coefﬁcients, express the fractional

proportions of reactants to yield a certain proportion of products. For example, one needs a 4:1 ratio

of hydrogen molecules and carbon dioxide to produce a 1:2 ratio of methane and water. A reaction is

(mass) balanced when its stoichiometric coefﬁents are well placed such that the sum of the number

of atoms for every element across products shall be the same of that across reactants, i.e., it satisﬁes

the principle of mass conservation (Whitaker,1975). Unbalanced reactions, on the other hand, would

be chemically implausible.

This constraint over atoms of the molecules undergoes the true chemical mechanism behind reactions:

bonds between atoms break and form under certain conditions but atoms do not change. In reasoning

terms, this is a symbol-manipulating process where bags of graphs are deconstructed into other bags

1An extended overview of the related works in chemical reaction prediction is given in Appendix A.

of graphs. A machine reasoning system that would have learned this true chemical mechanism, would

be able to perfectly solve the chemical reaction prediction task for all balanced reactions and for all

possible variations of stoichiometric coefﬁcients.

As humans, we can balance fairly complex chemical reactions quite easily.

For machines, this

process can be formalised as ﬁnding a solution of a potentially undetermined system of linear

equations. For example, we can write the Sabatier reaction as:

r1·CO2+r2·H2+r3·Ni =p1·CH4+p2·H2O+p3·Ni (2)

where the variables

represent the stoichiometric coefﬁcients of the molecules

they refer to. Molecules of reagents act as a special kind of confounders: since they are not changed

during the reaction, they must appear on both sides of the equation with the same coefﬁcient (i.e.,

r3=p3

). Under this rewriting, it becomes even more evident how a full chemical reaction can be

interpreted as an algebraic equation where stoichiometric coefﬁcients are the unknown variables.

Then, we can represent the molecule of CO

with the vector

[1,0,2,0]

, indicating one atom of carbon,

zero hydrogens, two oxygens, and zero nickles. Analogously, H

can be encoded as

[0,2,0,0]

, and

so on. Therefore, Eq. (1) can be rewritten as the following linear system:







100

020

200

001





"r1

r3#=





100

420

010

001





"p1

p3#.(3)

It is straightforward to verify that the minimum norm solution of Eq. (3) is

r= [1,4,1]

p= [1,2,1]

thus conforming to Eq. (1). We now devise a set of reasoning tasks exploiting this perspective.

2.1 “TYPE 1” AND “TYPE 2” CHEMICAL REACTION REASONING TASKS

We propose to cast chemical reaction prediction as a reasoning task, where the model has to predict

both the graphs corresponding to the product molecules (i.e. the right hand side of the reaction) and

the exact multiplicities of such molecules, given a particular input consisting of reagents and reactants

equipped with varying stoichiometric coefﬁcients. This is in stark contrast with the vanilla reaction

prediction learning setting: in it, models are trained on reactions without stoichiometric information,

largely on unbalanced reactions (see Section 3.2); in our setting, stoichiometric coefﬁcients are not

only present but, at prediction time, can greatly differ from those the model has seen at training time.

By doing so, we can better control and measure systematic generalization.

For example, given a reference reaction, we can multiply by a factor all the stoichiometric coefﬁcients

in it. For Sabatier reaction, we would obtain the following input-output pair for a factor of two:

2CO2+ 8H2+ 2Ni (INPUT) 2CH4+ 4H2O+ 2Ni (OUTPUT).(4)

In the rest of this paper, we will refer to this type of reasoning task as

Type 1

task. Alternatively, we

can just add a certain number of molecules on the left hand side. Some of these additional molecules

might not take part in the reaction, since there might not be enough reactants to bond with. For

example, if we add two

CO2

and two

to Sabatier reaction we expect a model to predict them in the

right hand side in addition to its usual outputs:

3 CO2+ 4 H2+ 3 Ni (INPUT)CH4+ 2 H2O + 3 Ni + 2 CO2(OUTPUT).(5)

We refer to this as a

Type 2

task. Type 2 reactions are harder to reason with than Type 1, but should

still be easy for a machine learning model that has learned the true underlying chemical mechanism.

There is one ﬁnal aspect to consider: if the test coefﬁcients are sampled from the same data distribution

as the training coefﬁcients, there is still a chance that the model would be able to predict them just by

pattern matching. Conversely, learning the actual algebraic reasoning behind the stoichiometry of

We learn to do it from very few examples, e.g. a handful of reactions taken from chemistry textbooks in

high school. Without following an explicit algorithm, we can usually perform balancing in an intuitive way, by

leveraging our quick arithmetic skills to count the atoms for an element and match the numbers on both sides of

the equation, iteratively changing the stoichiometric coefﬁcients until all elements are balanced.

Table 1: Statistics of the training, validation and test splits of the USPTO dataset before and after our

rebalancing. Percentages in parenthesis are w.r.t. the original dataset.

TOTAL BALANCED (%) RE-BALANCED (%) USPTO-BAL (%)

TRAINING SET 409035 18815 (4.59) 162220 (39.66) 181035 (44.26)

VALIDATION SET 30000 1292 (4.31) 11868 (39.56) 13160 (43.87)

TEST SET 40000 1809 (4.52) 15973 (39.93) 17782 (44.45)

chemical reactions, e.g., by solving the linear system of Eq. (3), empowers the model to solve any

stoichiometry problem, regardless of the actual values of the coefﬁcients observed during training.

To control for these aspects, in addition to the usual

in-distribution

scenario where both training

and test coefﬁcients are sampled from the same set of integer numbers

Sin

, we consider also an

out-of-distribution

setting where the training and test reactions are instantiated with coefﬁcients

coming from the disjoint sets of integers

Sin

and

Sout

, respectively. Analogously, we can test a

cross-distribution

scenario, where the training set is divided into two halves. The reactions of the

ﬁrst half are instantiated with coefﬁcients selected from

Sin

, while the second half’s coefﬁcients are

selected form Sout. The cross-distribution test set will contain the same reactions of the training set

but with “swapped stoichiometry”: that is, the ﬁrst half of the test set will have coefﬁcients from

Sout

the second half from

Sin

. We argue that a well-trained reasoning model should have no problem to

semantically disentangle the stoichiometric coefﬁcients from the molecules, thus achieving the same

performance in both the in-distribution, cross-distribution, and out-of-distribution scenarios.

3 CAN TRANSFORMERS PERFORM ALGEBRAIC REASONING?

In this section, we question whether current deep learning models can effectively solve the algebraic

reasoning tasks induced by the stoichiometry of chemical reactions, as discussed in Section 2. In

order to do so, we ﬁrst need to build a suitable dataset of chemical reactions that can be processed by

state-of-the-art models for reaction predictions. We then evaluate these models on some meaningful

variations of the task and discuss their limitations.

3.1 HOW TO BUILD A BENCHMARK OF BALANCED REACTIONS

A natural candidate is the USPTO-MIT dataset (Lowe,2012), a large collection of chemical reactions

extracted from US patents from 1976 to 2016 and represented in the SMILES format (Weininger,

1988). SMILES reactions are text strings encoding a linearization of the molecular graphs of each

molecule participating in the reaction.

We employ the version popularized by Jin et al. (2017),

often referred to as USPTO-MIT. This version contains a subset of polished reactions, obtained

by removing duplicate and incorrect ones. For simplicity, in the rest of this paper we will refer

to “USPTO-MIT” as simply “USPTO”. This dataset is composed of 409k training reactions, 30k

validation reactions and 40k test reactions (see Table 1).

Unfortunately, reactions in the USPTO dataset are mostly unbalanced: we found that less than 4.6%

of the reactions in the dataset are mass balanced and readily usable for our tasks. For the remaining

reactions, in many cases only the major product of the reaction was recorded, while disregarding

minor byproducts, such as H

O or HCl, probably not deemed interesting from a patent perspective.

Examples of these reactions are illustrated in Fig. 1. While this does not impact patents, it can affect

the generalization capability of the learned models, as we will show next.

Fortunately, the missing byproducts can be sometimes deduced from the unbalanced reaction: if the

set of missing atoms of one side of the reaction is enough to unambiguously reconstruct one (or more)

valid molecules on the other side, we add such reconstructed molecules to that side of the reaction.

The rightmost column in Table 1 shows the number of reactions contained in the balanced subset of

USPTO that we used as the basis for the augmentation procedures described in Section 4. Using this

strategy, we were able to re-balance about 44% of the original reactions for the training, validation

and test sets. We denote this re-balanced version of USPTO as USPTO-BAL and use it next.

3Note that SMILES strings represent hydrogen implicitly, when the resulting molecule is not ambiguous.

Figure 1: Examples of re-balanced reaction of the USPTO-BAL dataset. The inferred byproducts are

shown between brackets (the reagents have been omitted for simplicity).

Table 2: Results of state-of-the-art Transformers for reaction prediction on the USPTO dataset and

the BAL, T1 and T2 variations. We report the top-1 accuracy (ACC) for all models. We also report

the at-least-one (ALO) accuracy for the T1 and T2 variants. All metrics are reported as percentages.

USPTO USPTO-BAL USPTO-T1 USPTO-T2

STATE-OF-THE-ART MODEL ACC ACC ACC ALO ACC ALO

MOLTRANS. (SCHWALLER ET AL.,2019) 90.4 1.39 0.04 56.66 0.03 28.32

CHEMFORMER (IRWIN ET AL.,2022) 92.8 <10−20.23 34.41 <10−22.61

G2S (DGCN) (TU& COLEY,2021) 90.3 1.37 <10−283.02 <10−240.79

G2S (DGAT) (TU& COLEY,2021) 90.3 1.40 <10−284.21 <10−241.43

3.2 TYPE 1AND TYPE 2VARIANTS OF USPTO

We build two variants of the original USPTO dataset, called USPTO-T1 and USPTO-T2, as instances

of the Type 1 and Type 2 reasoning tasks introduced in Section 2.1. For T1, we multiply all coefﬁcients

by two, in practice duplicating the molecule representations in the SMILES string of a reaction (as

the SMILES format does not allow to represent stoichiometric coefﬁcients). For T2 we add a

randomly-selected molecule to the reactants. The detailed procedure is discussed in Appendix B.

These two variants should not, in theory, pose signiﬁcant challenges for chemical reaction models.

In the case of USPTO-T1, the reactants contain all the required additional molecules to perform the

reaction twice, so the output should correspond to the same multiple of the original products. This is

even easier considering that the original reactions in USPTO generally involve a single molecule per

type. On the other hand, in the case of USPTO-T2, the added molecule is not sufﬁcient to trigger

multiple reactions, so it should just be copied in output by the model. As we will show in the next

section, in practice this is not the case.

3.3 CURRENT CHEMICAL MODELS ARE ACTUALLY ALCHEMICAL MODELS

We now evaluate the systematic generalization of state-of-the-art models for reaction prediction

trained on USPTO on the chemically-sound variation and the reasoning variants introduced in

Section 3.1 and Section 3.2. We focus on Transformer-based language models that operate on

SMILES representations (Schwaller et al.,2019;Irwin et al.,2022) as well Transformers that employ

graph neural networks (GNNs) to parse the molecules structures and be permutation invariant (Tu &

Coley,2021). For additional details about these models, we refer the reader to Appendix B.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CHEMALGEBRA:ALGEBRAICREASONINGONCHEMICALREACTIONSAndreaValentiDepartmentofComputerScienceUniversityofPisaPisa,Italyandrea.valenti@phd.unipi.itDavideBacciuDepartmentofComputerScienceUniversityofPisaPisa,Italydavide.bacciu@unipi.itAntonioVergariSchoolofInformaticsUniversityofEdinburghEdinburgh,Scotlan...

展开>> 收起<<

CHEM ALGEBRA ALGEBRAIC REASONING ON CHEMICAL REACTIONS Andrea Valenti_2.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CHEM ALGEBRA ALGEBRAIC REASONING ON CHEMICAL REACTIONS Andrea Valenti_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: