
mathematical reasoning (Rabe et al.,2021), sym-
bolic reasoning (Yang and Deng,2021), and com-
monsense reasoning (Li et al.,2019). Impressive
signs of progress have been reported in teaching
PLMs to gain reasoning ability rather than just
memorising knowledge facts (Kassner et al.,2020;
Talmor et al.,2020), suggesting that PLMs could
serve as effective reasoners for identifying analo-
gies and inferring facts not explicitly/directly seen
in the data (Kassner et al.,2020;Ushio et al.,2021).
In particular, deductive reasoning
2
is one of the
most promising directions (Sanyal et al.,2022;Tal-
mor et al.,2020;Li et al.,2019). By definition,
deduction yields valid conclusions, which must be
true given that their premises are true (Johnson-
Laird,1999). In the NLP community, given all
the premises in natural language statements, some
large-scale PLMs have shown to be able to deduc-
tively draw appropriate conclusions under proper
training schemes (Clark et al.,2020;Talmor et al.,
2020). Figure 1shows an example of the training
and inference processes of deductive reasoning.
Despite promising applications of PLMs, some
recent studies have pointed out that they could only
perform a shallow level of reasoning on textual
data (Helwe et al.,2021). Indeed, PLMs can be
easily affected by mispriming (Misra et al.,2020)
and still hardly differentiate between positive and
negative statements (i.e., the so-called negation is-
sue) (Ettinger,2020). However, given that some ev-
idence suggests that PLMs can learn factual knowl-
edge beyond mere rote memorisation (Heinzerling
and Inui,2021) and their limitations (Helwe et al.,
2021), it is natural to ask, “Can the current PLMs
potentially serve as reliable deductive reasoners
over factual knowledge?” To answer it, as the main
contribution of this work, we conduct a compre-
hensive experimental study on testing the learnable
deductive reasoning capability of the PLMs.
In particular, we test various reasoning training
approaches on two knowledge reasoning datasets.
Our experimental results indicate that such deduc-
tive reasoning training of the PLMs (e.g., BERT
and RoBERTa) yields strong results on the stan-
dard benchmarks, but it inadequately generalises
learned logic rules to unseen cases. That is, they
perform inconsistently against simple surface form
perturbations (e.g., simple synonym substitution,
paraphrasing or negation insertion), advocating a
2
This type of reasoning is also often referred to as explicit
reasoning in the literature (Broome,2013;Aditya et al.,2018).
careful rethinking of the details behind the seem-
ingly flawless empirical performance of deduc-
tive reasoning using the PLMs. We hope our
work will inspire further research on probing and
improving the deductive reasoning capabilities
of the PLMs. Our code and data are available
online at
https://github.com/cambridgeltl/
deductive_reasoning_probing.
2 Related Work
Knowledge Probing, Infusing, and Editing with
PLMs.
PLMs appear to memorise (world) knowl-
edge facts during pretraining, and such cap-
tured knowledge is useful for knowledge-intensive
tasks (Petroni et al.,2019,2021). A body of re-
cent research has aimed to (i) understand how
much knowledge PLMs store, i.e., knowledge
probing (Petroni et al.,2019;Meng et al.,2022);
(ii) how to inject external knowledge into them,
i.e., knowledge infusing (Wang et al.,2021b;Meng
et al.,2021); and (iii) how to edit the stored knowl-
edge, i.e. knowledge editing (De Cao et al.,2021).
In particular, De Cao et al. (2021) have shown that
it is possible to modify a single knowledge fact
without affecting all the other stored knowledge.
However, some empirical evidence suggests that
existing PLMs generalise poorly to unseen sen-
tences and are easily misled (Kassner and Schütze,
2020).
3
Moreover, this body of research focuses
only on investigating how to recall or expose the
factual and commonsense knowledge that has been
encoded in the PLMs, rather than exploring their
capabilities of deriving previously unknown knowl-
edge via deductive reasoning, as done in this work.
Knowledge Reasoning with PLMs.
In re-
cent years, PLMs have also achieved impressive
progress in knowledge reasoning (Helwe et al.,
2021). For example, PLMs can infer a conclusion
from a set of knowledge statements and rules (Tal-
mor et al.,2020;Clark et al.,2020), with both the
knowledge and the rules being mentioned explicitly
and linguistically in the model input. Some gener-
ative PLMs, such as T5 (Raffel et al.,2020), are
even able to generate natural language proofs that
support implications over logical rules expressed
in natural language (Tafjord et al.,2021). In par-
ticular, some large PLMs, such as LaMDA (Thop-
3
For instance, if we add the talk token into the statement
“Birds can [MASK].” (i.e. “Talk. Birds can [MASK].”), the
PLM might be misled by the added token and predict talk
rather than the originally predicted fly token (Kassner and
Schütze,2020).