
Table 1: Statistics of the training, validation and test splits of the USPTO dataset before and after our
rebalancing. Percentages in parenthesis are w.r.t. the original dataset.
TOTAL BALANCED (%) RE-BALANCED (%) USPTO-BAL (%)
TRAINING SET 409035 18815 (4.59) 162220 (39.66) 181035 (44.26)
VALIDATION SET 30000 1292 (4.31) 11868 (39.56) 13160 (43.87)
TEST SET 40000 1809 (4.52) 15973 (39.93) 17782 (44.45)
chemical reactions, e.g., by solving the linear system of Eq. (3), empowers the model to solve any
stoichiometry problem, regardless of the actual values of the coefficients observed during training.
To control for these aspects, in addition to the usual
in-distribution
scenario where both training
and test coefficients are sampled from the same set of integer numbers
Sin
, we consider also an
out-of-distribution
setting where the training and test reactions are instantiated with coefficients
coming from the disjoint sets of integers
Sin
and
Sout
, respectively. Analogously, we can test a
cross-distribution
scenario, where the training set is divided into two halves. The reactions of the
first half are instantiated with coefficients selected from
Sin
, while the second half’s coefficients are
selected form Sout. The cross-distribution test set will contain the same reactions of the training set
but with “swapped stoichiometry”: that is, the first half of the test set will have coefficients from
Sout
,
the second half from
Sin
. We argue that a well-trained reasoning model should have no problem to
semantically disentangle the stoichiometric coefficients from the molecules, thus achieving the same
performance in both the in-distribution, cross-distribution, and out-of-distribution scenarios.
3 CAN TRANSFORMERS PERFORM ALGEBRAIC REASONING?
In this section, we question whether current deep learning models can effectively solve the algebraic
reasoning tasks induced by the stoichiometry of chemical reactions, as discussed in Section 2. In
order to do so, we first need to build a suitable dataset of chemical reactions that can be processed by
state-of-the-art models for reaction predictions. We then evaluate these models on some meaningful
variations of the task and discuss their limitations.
3.1 HOW TO BUILD A BENCHMARK OF BALANCED REACTIONS
A natural candidate is the USPTO-MIT dataset (Lowe,2012), a large collection of chemical reactions
extracted from US patents from 1976 to 2016 and represented in the SMILES format (Weininger,
1988). SMILES reactions are text strings encoding a linearization of the molecular graphs of each
molecule participating in the reaction.
3
We employ the version popularized by Jin et al. (2017),
often referred to as USPTO-MIT. This version contains a subset of polished reactions, obtained
by removing duplicate and incorrect ones. For simplicity, in the rest of this paper we will refer
to “USPTO-MIT” as simply “USPTO”. This dataset is composed of 409k training reactions, 30k
validation reactions and 40k test reactions (see Table 1).
Unfortunately, reactions in the USPTO dataset are mostly unbalanced: we found that less than 4.6%
of the reactions in the dataset are mass balanced and readily usable for our tasks. For the remaining
reactions, in many cases only the major product of the reaction was recorded, while disregarding
minor byproducts, such as H
2
O or HCl, probably not deemed interesting from a patent perspective.
Examples of these reactions are illustrated in Fig. 1. While this does not impact patents, it can affect
the generalization capability of the learned models, as we will show next.
Fortunately, the missing byproducts can be sometimes deduced from the unbalanced reaction: if the
set of missing atoms of one side of the reaction is enough to unambiguously reconstruct one (or more)
valid molecules on the other side, we add such reconstructed molecules to that side of the reaction.
The rightmost column in Table 1 shows the number of reactions contained in the balanced subset of
USPTO that we used as the basis for the augmentation procedures described in Section 4. Using this
strategy, we were able to re-balance about 44% of the original reactions for the training, validation
and test sets. We denote this re-balanced version of USPTO as USPTO-BAL and use it next.
3Note that SMILES strings represent hydrogen implicitly, when the resulting molecule is not ambiguous.
4