CHEM ALGEBRA ALGEBRAIC REASONING ON CHEMICAL REACTIONS Andrea Valenti_2

2025-04-29 0 0 733.98KB 24 页 10玖币
侵权投诉
CHEMALGEBRA: ALGEBRAIC REASONING ON
CHEMICAL REACTIONS
Andrea Valenti
Department of Computer Science
University of Pisa
Pisa, Italy
andrea.valenti@phd.unipi.it
Davide Bacciu
Department of Computer Science
University of Pisa
Pisa, Italy
davide.bacciu@unipi.it
Antonio Vergari
School of Informatics
University of Edinburgh
Edinburgh, Scotland
avergari@ed.ac.uk
ABSTRACT
While showing impressive performance on various kinds of learning tasks, it is yet
unclear whether deep learning models have the ability to robustly tackle reasoning
tasks. Measuring the robustness of reasoning in machine learning models is chal-
lenging as one needs to provide a task that cannot be easily shortcut by exploiting
spurious statistical correlations in the data, while operating on complex objects and
constraints. To address this issue, we propose CHEMALGEBRA, a benchmark for
measuring the reasoning capabilities of deep learning models through the predic-
tion of stoichiometrically-balanced chemical reactions. CHEMALGEBRA requires
manipulating sets of complex discrete objects – molecules represented as formulas
or graphs – under algebraic constraints such as the mass preservation principle. We
believe that CHEMALGEBRA can serve as a useful test bed for the next generation
of machine reasoning models and as a promoter of their development.
1 INTRODUCTION
Deep learning models, and Transformer architectures in particular, currently achieve the state-of-the-
art for a number of application domains such as natural language and audio processing, computer
vision, and computational chemistry (Lin et al.,2021;Khan et al.,2021;Bra¸soveanu & Andonie,
2020). Given enough data and enough parameters to fit, these models are able to learn intricate
correlations (Brown et al.,2020). These impressive performance on machine learning tasks suggests
that they could be suitable candidates for machine reasoning tasks (Helwe et al.,2021).
Reasoning is the ability to manipulate a knowledge representation into a form that is more suitable
to solve a new problem (Bottou,2014;Garcez et al.,2019). In particular, algebraic reasoning
includes a set of reasoning manipulations such as abstraction, arithmetic operations, and systematic
composition over complex objects. Algebraic reasoning is related to the ability of a learning system
to perform systematic generalization (Marcus,2003;Bahdanau et al.,2018;Sinha et al.,2019), i.e. to
robustly make predictions beyond the data distribution it has been trained on. This is inherently more
challenging than discovering correlations from data, as it requires the learning system to actually
capture the true underlying mechanism for the specific task (Pearl,2009;Marcus,2018).
Lately, much attention has been put on training Transformers to learn how to reason (Helwe et al.,
2021;Al-Negheimish et al.,2021;Storks et al.,2019;Gontier et al.,2020). This is usually done
by embedding an algebraic reasoning problem in a natural language formulation. Natural language,
despite its flexibility, is imprecise and prone to shortcuts (Geirhos et al.,2020). As a result, it is often
difficult to determine whether the models’ performance on reasoning tasks is genuine or it is merely
1
arXiv:2210.02095v1 [cs.LG] 5 Oct 2022
due to the exploitation of spurious statistical correlations in the data. Several works in this direction
suggest (Agrawal et al.,2016;Jia & Liang,2017;Helwe et al.,2021) the latter is probably the case.
In order to effectively assess the reasoning capabilities of deep learning models, we need to accurately
design tasks that i) operate on complex objects ii) require algebraic reasoning to be carried out and
iii) cannot be shortcut by exploiting latent correlations in the data. We identify chemical reaction
prediction as a suitable candidate for these desiderata. First, chemical reactions can be naturally
interpreted as transformations over bags of complex objects: reactant molecules are turned into
product molecules by manipulating their graph structures while abiding certain constraints such as the
law of mass conservation. Second, these transformations can be analysed as algebraic operations over
(sub-)graphs (e.g., by observing bonds forming and dissolving (Bradshaw et al.,2019)), and balancing
them to preserve mass conservation can be formalised as solving a linear system of equations, as
we will show in Section 2. Third, the language of chemical molecules and reactions is much less
ambiguous than natural language and by controlling the stoichiometric coefficients, i.e., the molecule
multiplicities, at training and test time we can more precisely measure systematic generalization.
Lastly, Transformers already excel at learning reaction predictions (Tetko et al.,2020;Irwin et al.,
2022).
1
Therefore, we think this can be a solid test bed to measure the current gap between learning
and reasoning capabilities of modern deep learning models.
The main contributions of this paper are the following:
1.
We cast chemical reaction prediction as a reasoning task where the learner has not only to
predict a set of products but also correct stoichiometric coefficient variations (Section 2).
2.
We evaluate the current state-of-the-art Transformers for chemical reaction predictions,
showing that they fail to robustly generalise when reasoning on simple variants of the
chemical reaction dataset they have been trained on (Section 3).
3.
We introduce CHEMALGEBRA as a novel challenging benchmark for machine reasoning, in
which we can more precisely measure the ability of deep learning models to algebraically
reason over bags of graphs in in-, cross- and out-of-distribution settings (Section 4).
2 PREDICTING CHEMICAL REACTIONS AS ALGEBRAIC REASONING
To illustrate our point, let us consider the Sabatier reaction: it yields methane (
CH4
) and water (
H2O
)
out of hydrogen (
H2
) and carbon dioxide (
CO2
), in the presence of nickel (
Ni
) as a catalyst. In
chemical formulas:
1CO2+4H2
Ni 1CH4+2H2O (1)
where formulas encode complex graph structures where atoms are nodes and chemical bonds edges:
A reaction prediction learning task hence consists of outputting a bag of graphs, the products (right
hand side), given the bag of graphs consisting of reactants (left hand side) and reagents (i.e. the
catalysts, over the reaction’s arrow).
The multiplicities of the molecules, also called their stoichiometric coefcients, express the fractional
proportions of reactants to yield a certain proportion of products. For example, one needs a 4:1 ratio
of hydrogen molecules and carbon dioxide to produce a 1:2 ratio of methane and water. A reaction is
(mass) balanced when its stoichiometric coeffients are well placed such that the sum of the number
of atoms for every element across products shall be the same of that across reactants, i.e., it satisfies
the principle of mass conservation (Whitaker,1975). Unbalanced reactions, on the other hand, would
be chemically implausible.
This constraint over atoms of the molecules undergoes the true chemical mechanism behind reactions:
bonds between atoms break and form under certain conditions but atoms do not change. In reasoning
terms, this is a symbol-manipulating process where bags of graphs are deconstructed into other bags
1An extended overview of the related works in chemical reaction prediction is given in Appendix A.
2
of graphs. A machine reasoning system that would have learned this true chemical mechanism, would
be able to perfectly solve the chemical reaction prediction task for all balanced reactions and for all
possible variations of stoichiometric coefficients.
As humans, we can balance fairly complex chemical reactions quite easily.
2
For machines, this
process can be formalised as finding a solution of a potentially undetermined system of linear
equations. For example, we can write the Sabatier reaction as:
r1·CO2+r2·H2+r3·Ni =p1·CH4+p2·H2O+p3·Ni (2)
where the variables
r1
,
r2
,
r3
,
p1
,
p2
,
p3
represent the stoichiometric coefficients of the molecules
they refer to. Molecules of reagents act as a special kind of confounders: since they are not changed
during the reaction, they must appear on both sides of the equation with the same coefficient (i.e.,
r3=p3
). Under this rewriting, it becomes even more evident how a full chemical reaction can be
interpreted as an algebraic equation where stoichiometric coefficients are the unknown variables.
Then, we can represent the molecule of CO
2
with the vector
[1,0,2,0]
, indicating one atom of carbon,
zero hydrogens, two oxygens, and zero nickles. Analogously, H
2
can be encoded as
[0,2,0,0]
, and
so on. Therefore, Eq. (1) can be rewritten as the following linear system:
100
020
200
001
"r1
r2
r3#=
100
420
010
001
"p1
p2
p3#.(3)
It is straightforward to verify that the minimum norm solution of Eq. (3) is
r= [1,4,1]
,
p= [1,2,1]
,
thus conforming to Eq. (1). We now devise a set of reasoning tasks exploiting this perspective.
2.1 “TYPE 1AND TYPE 2CHEMICAL REACTION REASONING TASKS
We propose to cast chemical reaction prediction as a reasoning task, where the model has to predict
both the graphs corresponding to the product molecules (i.e. the right hand side of the reaction) and
the exact multiplicities of such molecules, given a particular input consisting of reagents and reactants
equipped with varying stoichiometric coefficients. This is in stark contrast with the vanilla reaction
prediction learning setting: in it, models are trained on reactions without stoichiometric information,
largely on unbalanced reactions (see Section 3.2); in our setting, stoichiometric coefficients are not
only present but, at prediction time, can greatly differ from those the model has seen at training time.
By doing so, we can better control and measure systematic generalization.
For example, given a reference reaction, we can multiply by a factor all the stoichiometric coefficients
in it. For Sabatier reaction, we would obtain the following input-output pair for a factor of two:
2CO2+ 8H2+ 2Ni (INPUT) 2CH4+ 4H2O+ 2Ni (OUTPUT).(4)
In the rest of this paper, we will refer to this type of reasoning task as
Type 1
task. Alternatively, we
can just add a certain number of molecules on the left hand side. Some of these additional molecules
might not take part in the reaction, since there might not be enough reactants to bond with. For
example, if we add two
CO2
and two
Ni
to Sabatier reaction we expect a model to predict them in the
right hand side in addition to its usual outputs:
3 CO2+ 4 H2+ 3 Ni (INPUT)CH4+ 2 H2O + 3 Ni + 2 CO2(OUTPUT).(5)
We refer to this as a
Type 2
task. Type 2 reactions are harder to reason with than Type 1, but should
still be easy for a machine learning model that has learned the true underlying chemical mechanism.
There is one final aspect to consider: if the test coefficients are sampled from the same data distribution
as the training coefficients, there is still a chance that the model would be able to predict them just by
pattern matching. Conversely, learning the actual algebraic reasoning behind the stoichiometry of
2
We learn to do it from very few examples, e.g. a handful of reactions taken from chemistry textbooks in
high school. Without following an explicit algorithm, we can usually perform balancing in an intuitive way, by
leveraging our quick arithmetic skills to count the atoms for an element and match the numbers on both sides of
the equation, iteratively changing the stoichiometric coefficients until all elements are balanced.
3
Table 1: Statistics of the training, validation and test splits of the USPTO dataset before and after our
rebalancing. Percentages in parenthesis are w.r.t. the original dataset.
TOTAL BALANCED (%) RE-BALANCED (%) USPTO-BAL (%)
TRAINING SET 409035 18815 (4.59) 162220 (39.66) 181035 (44.26)
VALIDATION SET 30000 1292 (4.31) 11868 (39.56) 13160 (43.87)
TEST SET 40000 1809 (4.52) 15973 (39.93) 17782 (44.45)
chemical reactions, e.g., by solving the linear system of Eq. (3), empowers the model to solve any
stoichiometry problem, regardless of the actual values of the coefficients observed during training.
To control for these aspects, in addition to the usual
in-distribution
scenario where both training
and test coefficients are sampled from the same set of integer numbers
Sin
, we consider also an
out-of-distribution
setting where the training and test reactions are instantiated with coefficients
coming from the disjoint sets of integers
Sin
and
Sout
, respectively. Analogously, we can test a
cross-distribution
scenario, where the training set is divided into two halves. The reactions of the
first half are instantiated with coefficients selected from
Sin
, while the second halfs coefficients are
selected form Sout. The cross-distribution test set will contain the same reactions of the training set
but with “swapped stoichiometry”: that is, the first half of the test set will have coefficients from
Sout
,
the second half from
Sin
. We argue that a well-trained reasoning model should have no problem to
semantically disentangle the stoichiometric coefficients from the molecules, thus achieving the same
performance in both the in-distribution, cross-distribution, and out-of-distribution scenarios.
3 CAN TRANSFORMERS PERFORM ALGEBRAIC REASONING?
In this section, we question whether current deep learning models can effectively solve the algebraic
reasoning tasks induced by the stoichiometry of chemical reactions, as discussed in Section 2. In
order to do so, we first need to build a suitable dataset of chemical reactions that can be processed by
state-of-the-art models for reaction predictions. We then evaluate these models on some meaningful
variations of the task and discuss their limitations.
3.1 HOW TO BUILD A BENCHMARK OF BALANCED REACTIONS
A natural candidate is the USPTO-MIT dataset (Lowe,2012), a large collection of chemical reactions
extracted from US patents from 1976 to 2016 and represented in the SMILES format (Weininger,
1988). SMILES reactions are text strings encoding a linearization of the molecular graphs of each
molecule participating in the reaction.
3
We employ the version popularized by Jin et al. (2017),
often referred to as USPTO-MIT. This version contains a subset of polished reactions, obtained
by removing duplicate and incorrect ones. For simplicity, in the rest of this paper we will refer
to “USPTO-MIT” as simply “USPTO”. This dataset is composed of 409k training reactions, 30k
validation reactions and 40k test reactions (see Table 1).
Unfortunately, reactions in the USPTO dataset are mostly unbalanced: we found that less than 4.6%
of the reactions in the dataset are mass balanced and readily usable for our tasks. For the remaining
reactions, in many cases only the major product of the reaction was recorded, while disregarding
minor byproducts, such as H
2
O or HCl, probably not deemed interesting from a patent perspective.
Examples of these reactions are illustrated in Fig. 1. While this does not impact patents, it can affect
the generalization capability of the learned models, as we will show next.
Fortunately, the missing byproducts can be sometimes deduced from the unbalanced reaction: if the
set of missing atoms of one side of the reaction is enough to unambiguously reconstruct one (or more)
valid molecules on the other side, we add such reconstructed molecules to that side of the reaction.
The rightmost column in Table 1 shows the number of reactions contained in the balanced subset of
USPTO that we used as the basis for the augmentation procedures described in Section 4. Using this
strategy, we were able to re-balance about 44% of the original reactions for the training, validation
and test sets. We denote this re-balanced version of USPTO as USPTO-BAL and use it next.
3Note that SMILES strings represent hydrogen implicitly, when the resulting molecule is not ambiguous.
4
Figure 1: Examples of re-balanced reaction of the USPTO-BAL dataset. The inferred byproducts are
shown between brackets (the reagents have been omitted for simplicity).
Table 2: Results of state-of-the-art Transformers for reaction prediction on the USPTO dataset and
the BAL, T1 and T2 variations. We report the top-1 accuracy (ACC) for all models. We also report
the at-least-one (ALO) accuracy for the T1 and T2 variants. All metrics are reported as percentages.
USPTO USPTO-BAL USPTO-T1 USPTO-T2
STATE-OF-THE-ART MODEL ACC ACC ACC ALO ACC ALO
MOLTRANS. (SCHWALLER ET AL.,2019) 90.4 1.39 0.04 56.66 0.03 28.32
CHEMFORMER (IRWIN ET AL.,2022) 92.8 <1020.23 34.41 <1022.61
G2S (DGCN) (TU& COLEY,2021) 90.3 1.37 <10283.02 <10240.79
G2S (DGAT) (TU& COLEY,2021) 90.3 1.40 <10284.21 <10241.43
3.2 TYPE 1AND TYPE 2VARIANTS OF USPTO
We build two variants of the original USPTO dataset, called USPTO-T1 and USPTO-T2, as instances
of the Type 1 and Type 2 reasoning tasks introduced in Section 2.1. For T1, we multiply all coefficients
by two, in practice duplicating the molecule representations in the SMILES string of a reaction (as
the SMILES format does not allow to represent stoichiometric coefficients). For T2 we add a
randomly-selected molecule to the reactants. The detailed procedure is discussed in Appendix B.
These two variants should not, in theory, pose significant challenges for chemical reaction models.
In the case of USPTO-T1, the reactants contain all the required additional molecules to perform the
reaction twice, so the output should correspond to the same multiple of the original products. This is
even easier considering that the original reactions in USPTO generally involve a single molecule per
type. On the other hand, in the case of USPTO-T2, the added molecule is not sufficient to trigger
multiple reactions, so it should just be copied in output by the model. As we will show in the next
section, in practice this is not the case.
3.3 CURRENT CHEMICAL MODELS ARE ACTUALLY ALCHEMICAL MODELS
We now evaluate the systematic generalization of state-of-the-art models for reaction prediction
trained on USPTO on the chemically-sound variation and the reasoning variants introduced in
Section 3.1 and Section 3.2. We focus on Transformer-based language models that operate on
SMILES representations (Schwaller et al.,2019;Irwin et al.,2022) as well Transformers that employ
graph neural networks (GNNs) to parse the molecules structures and be permutation invariant (Tu &
Coley,2021). For additional details about these models, we refer the reader to Appendix B.
5
摘要:

CHEMALGEBRA:ALGEBRAICREASONINGONCHEMICALREACTIONSAndreaValentiDepartmentofComputerScienceUniversityofPisaPisa,Italyandrea.valenti@phd.unipi.itDavideBacciuDepartmentofComputerScienceUniversityofPisaPisa,Italydavide.bacciu@unipi.itAntonioVergariSchoolofInformaticsUniversityofEdinburghEdinburgh,Scotlan...

展开>> 收起<<
CHEM ALGEBRA ALGEBRAIC REASONING ON CHEMICAL REACTIONS Andrea Valenti_2.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:733.98KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注