Can Pretrained Language Models Yet Reason Deductively Zhangdie Yuan Songbo Hu Ivan Vuli c Anna Korhonen Zaiqiao Mengy Department of Computer Science and Technology University of Cambridge

2025-04-27 0 0 576.97KB 16 页 10玖币
侵权投诉
Can Pretrained Language Models (Yet) Reason Deductively?
Zhangdie Yuan
, Songbo Hu♠∗, Ivan Vuli´
c, Anna Korhonen, Zaiqiao Meng
Department of Computer Science and Technology, University of Cambridge
Language Technology Lab, University of Cambridge
School of Computing Science, University of Glasgow
♦♠{zy317,sh2091,iv250,alk23}@cam.ac.uk
zaiqiao.meng@glasgow.ac.uk
Abstract
Acquiring factual knowledge with Pretrained
Language Models (PLMs) has attracted in-
creasing attention, showing promising per-
formance in many knowledge-intensive tasks.
Their good performance has led the commu-
nity to believe that the models do possess a
modicum of reasoning competence rather than
merely memorising the knowledge. In this pa-
per, we conduct a comprehensive evaluation
of the learnable deductive (also known as ex-
plicit) reasoning capability of PLMs. Through
a series of controlled experiments, we posit
two main findings. (i) PLMs inadequately
generalise learned logic rules and perform in-
consistently against simple adversarial surface
form edits. (ii) While the deductive reason-
ing fine-tuning of PLMs does improve their
performance on reasoning over unseen knowl-
edge facts, it results in catastrophically for-
getting the previously learnt knowledge. Our
main results suggest that PLMs cannot yet
perform reliable deductive reasoning, demon-
strating the importance of controlled exami-
nations and probing of PLMs’ deductive rea-
soning abilities; we reach beyond (misleading)
task performance, revealing that PLMs are still
far from robust reasoning capabilities, even for
simple deductive tasks.
1 Introduction
Pretrained Language Models (PLMs) such as
BERT (Devlin et al.,2019) and RoBERTa (Liu
et al.,2019) have orchestrated tremendous progress
in NLP across a large variety of downstream ap-
plications. For knowledge-intensive tasks in par-
ticular, these large-scale PLMs are surprisingly
good at memorising factual knowledge presented
in pretraining corpora (Petroni et al.,2019;Jiang
et al.,2020b) and infusing knowledge from exter-
nal sources (Wang et al.,2021a;Zhou et al.,2022,
Indicates equal contribution.
Corresponding author.
BERT
Training
A raven can fly.
Original
bird
bird
bird
Prediction
Inference
A bird can fly.
A raven is a bird.
A [MASK] can fly.
A [MASK] can fly.
A [MASK] cannot fly.
The [MASK] species
is decreasing.
R-BERT
BERT
raven
raven
raven
R-BERT
Negation
Query
Neutral
(a)
(b)
(1)
(2)
(3)
(1)
(2)
(3)
(1)
(2)
(3)
Figure 1: Training and inference for deductive reason-
ing. Given the explicit premises (a), the input BERT
model is trained to get transformed into a reasoner R-
BERT model by deductively predicting a previously un-
seen conclusion (b). This inference process requires
R-BERT to understand factual knowledge and interpret
rules (e.g. taxonomic relations), intervening directly in
the deduction process.
among others), demonstrating their effectiveness
in learning and capturing knowledge.
Automatic reasoning, a systematic process of de-
riving previously unknown conclusions from given
formal representations of knowledge (Lenat et al.,
1990;Newell and Simon,1956), has been a long-
standing goal of AI research. In the NLP commu-
nity, a modern view of this problem (Clark et al.,
2020), where the formal representations of knowl-
edge are substituted by the natural language state-
ments, has recently received increasing attention,1
yielding multiple exploratory research directions:
1Following Clark et al. (2020), we also define natural lan-
guage rules as linguistic expressions of conjunctive impli-
cations,
condition[condition]conclusion
, with the
semantics of logic programs with negations (Apt et al.,1988).
arXiv:2210.06442v2 [cs.CL] 13 Feb 2023
mathematical reasoning (Rabe et al.,2021), sym-
bolic reasoning (Yang and Deng,2021), and com-
monsense reasoning (Li et al.,2019). Impressive
signs of progress have been reported in teaching
PLMs to gain reasoning ability rather than just
memorising knowledge facts (Kassner et al.,2020;
Talmor et al.,2020), suggesting that PLMs could
serve as effective reasoners for identifying analo-
gies and inferring facts not explicitly/directly seen
in the data (Kassner et al.,2020;Ushio et al.,2021).
In particular, deductive reasoning
2
is one of the
most promising directions (Sanyal et al.,2022;Tal-
mor et al.,2020;Li et al.,2019). By definition,
deduction yields valid conclusions, which must be
true given that their premises are true (Johnson-
Laird,1999). In the NLP community, given all
the premises in natural language statements, some
large-scale PLMs have shown to be able to deduc-
tively draw appropriate conclusions under proper
training schemes (Clark et al.,2020;Talmor et al.,
2020). Figure 1shows an example of the training
and inference processes of deductive reasoning.
Despite promising applications of PLMs, some
recent studies have pointed out that they could only
perform a shallow level of reasoning on textual
data (Helwe et al.,2021). Indeed, PLMs can be
easily affected by mispriming (Misra et al.,2020)
and still hardly differentiate between positive and
negative statements (i.e., the so-called negation is-
sue) (Ettinger,2020). However, given that some ev-
idence suggests that PLMs can learn factual knowl-
edge beyond mere rote memorisation (Heinzerling
and Inui,2021) and their limitations (Helwe et al.,
2021), it is natural to ask, “Can the current PLMs
potentially serve as reliable deductive reasoners
over factual knowledge? To answer it, as the main
contribution of this work, we conduct a compre-
hensive experimental study on testing the learnable
deductive reasoning capability of the PLMs.
In particular, we test various reasoning training
approaches on two knowledge reasoning datasets.
Our experimental results indicate that such deduc-
tive reasoning training of the PLMs (e.g., BERT
and RoBERTa) yields strong results on the stan-
dard benchmarks, but it inadequately generalises
learned logic rules to unseen cases. That is, they
perform inconsistently against simple surface form
perturbations (e.g., simple synonym substitution,
paraphrasing or negation insertion), advocating a
2
This type of reasoning is also often referred to as explicit
reasoning in the literature (Broome,2013;Aditya et al.,2018).
careful rethinking of the details behind the seem-
ingly flawless empirical performance of deduc-
tive reasoning using the PLMs. We hope our
work will inspire further research on probing and
improving the deductive reasoning capabilities
of the PLMs. Our code and data are available
online at
https://github.com/cambridgeltl/
deductive_reasoning_probing.
2 Related Work
Knowledge Probing, Infusing, and Editing with
PLMs.
PLMs appear to memorise (world) knowl-
edge facts during pretraining, and such cap-
tured knowledge is useful for knowledge-intensive
tasks (Petroni et al.,2019,2021). A body of re-
cent research has aimed to (i) understand how
much knowledge PLMs store, i.e., knowledge
probing (Petroni et al.,2019;Meng et al.,2022);
(ii) how to inject external knowledge into them,
i.e., knowledge infusing (Wang et al.,2021b;Meng
et al.,2021); and (iii) how to edit the stored knowl-
edge, i.e. knowledge editing (De Cao et al.,2021).
In particular, De Cao et al. (2021) have shown that
it is possible to modify a single knowledge fact
without affecting all the other stored knowledge.
However, some empirical evidence suggests that
existing PLMs generalise poorly to unseen sen-
tences and are easily misled (Kassner and Schütze,
2020).
3
Moreover, this body of research focuses
only on investigating how to recall or expose the
factual and commonsense knowledge that has been
encoded in the PLMs, rather than exploring their
capabilities of deriving previously unknown knowl-
edge via deductive reasoning, as done in this work.
Knowledge Reasoning with PLMs.
In re-
cent years, PLMs have also achieved impressive
progress in knowledge reasoning (Helwe et al.,
2021). For example, PLMs can infer a conclusion
from a set of knowledge statements and rules (Tal-
mor et al.,2020;Clark et al.,2020), with both the
knowledge and the rules being mentioned explicitly
and linguistically in the model input. Some gener-
ative PLMs, such as T5 (Raffel et al.,2020), are
even able to generate natural language proofs that
support implications over logical rules expressed
in natural language (Tafjord et al.,2021). In par-
ticular, some large PLMs, such as LaMDA (Thop-
3
For instance, if we add the talk token into the statement
“Birds can [MASK].” (i.e. “Talk. Birds can [MASK].”), the
PLM might be misled by the added token and predict talk
rather than the originally predicted fly token (Kassner and
Schütze,2020).
pilan et al.,2022), have been shown to be able
to conduct multi-step reasoning under the chain
of thought prompting (Wei et al.,2022) or proper
simple prompting template (Kojima et al.,2022).
Although the generated ‘reasoning’ statements po-
tentially benefit some downstream tasks, there is
currently no evidence that the statements are gener-
ated via deductive reasoning, rather than obtained
via pure memorisation. Generative reasoning mod-
els are difficult to evaluate since this requires huge
effort of manual assessment (Bostrom et al.,2021).
Although some research has demonstrated that
PLMs can learn to effectively perform inference
which involves taxonomic and world knowledge,
chaining, and counting (Talmor et al.,2020), pre-
liminary experiments on a single test set in more re-
cent research have revealed that fine-tuning PLMs
for editing knowledge might negatively affect the
previously acquired knowledge (De Cao et al.,
2021). Our work performs systematic and con-
trolled examinations of the deductive reasoning ca-
pabilities of PLMs and reaches beyond (sometimes
misleading) task performance.
3 Deductive Reasoning
What is Deductive Reasoning?
Psychologists de-
fine reasoning as a process of thought that yields
a conclusion from precepts, thoughts, or asser-
tions (Johnson-Laird,1999). Three main schools
describe what people may compute to derive this
conclusion: relying on factual knowledge (Ander-
son,2014;Newell,1990), formal rules (Braine,
1998;Braine and O’Brien,1991), mental mod-
els (Johnson-Laird,1983), or some mixture of
them (Falmagne and Gonsalves,1995). Our experi-
mental study focuses on a ‘computational’ aspect of
reasoning — namely, whether computational PLMs
for reasoning inadequately generalise learned logic
rules and perform inconsistently against simple ad-
versarial reasoning examples.
We investigate deductive reasoning in the con-
text of NLP and neural PLMs. In particular, the
goal of this deductive reasoning task is to train a
PLM (e.g. BERT) over some reasoning examples
(each with a set of premises and a conclusion) to
become a potential reasoner (e.g. R-BERT as illus-
trated in Figure 1). Then, the trained reasoner can
be used to infer deductive conclusions consistently
over explicit premises, where the derived conclu-
sions are usually unseen during the PLM pretrain-
ing/training. This inference process requires the
Softmax
BERT
(a) CLS-BERT (b) MLM-BERT (c) Cloze-BERT
BERT BERT
A bird [MASK] fly.
A raven is a bird.
A raven can [MASK] .
A bird can fly.
A raven is a bird.
A [MASK] can fly.
[can]
Softmax Softmax
[raven]
[fly]
[CLS] A bird can fly.
A raven is a bird.
A raven can fly.
Figure 2: Different reasoning training approaches.
underlying PLMs to understand factual knowledge
and interpret rules intervening in the deduction pro-
cess. In this paper, we only focus on the encoder-
based PLMs (e.g. BERT and RoBERTa) as they
can be evaluated under more controllable condi-
tions and scrutinised via automatic evaluation. In
particular, we investigate two task formulations of
the deductive reasoning training:
1)
classification-
based and 2) prompt-based reasoning, as follows.
3.1 Classification-based Reasoning
The classification-based approach formulates the
deductive reasoning task as a sequence classifica-
tion task. Let
D={D(1),D(2),· · · ,D(n)}
be a rea-
soning dataset, where
n
is the number of examples.
Each example
D(i)∈ D
contains a set of premises
P(i)={p(i)
1,p(i)
2. . . p(i)
j}
, a hypothesis
h(i)
, and
a binary label
l(i)∈ {0,1}
. A classification-based
reasoner takes the input of
P(i)
and
h(i)
, then out-
puts a binary label
l(i)
indicating the faithfulness
of h(i), given that P(i)is hypothetically factual.
The goal of the classification-based reasoning
training is to build a statistical model param-
eterised by
θ
to characterise
Pθ(l(i)|h(i),P(i))
.
Those PLMs built on the transformer encoder ar-
chitecture, such as BERT (Devlin et al.,2019)
and RoBERTa (Liu et al.,2019), can be used as
the backbone of such a classification-based rea-
soner. Figure 2(a) shows an example of using the
BERT model to train a classification-based rea-
soner (CLS-BERT). In particular, given a training
example
D(i)={l(i),h(i),P(i)}
, the BERT model
is trained to predict the hypothesis label by encod-
ing
[h(i);P(i)]
and computing
Pθ(l(i)|h(i),P(i))
.
To do so, the contextualised representation of the
[CLS]
’ token is subsequently projected down to
two logits and passed through a softmax layer to
form a Bernoulli distribution indicating that a hy-
pothesis is true or false.
3.2 Prompt-based Reasoning
Deductive reasoning can also be approached as a
cloze-completion task by formulating a valid con-
clusion as a cloze test. Specifically, given a rea-
soning example, i.e.,
D(i)
with its premises
P(i)
,
and a cloze prompt
c(i)
(e.g. “A
[MASK]
can fly”),
instead of predicting a binary label, this cloze-
completion task is to predict the masked token
a(i)
(e.g. raven) to the cloze question c(i).
The BERT-based models have been widely used
in the prompt-based reasoning tasks (Helwe et al.,
2021;Liu et al.,2022), by concatenating the
premises and the prompt as input and predicting the
masked token based on the bidirectional context.
In general, there are two training objectives for the
prompt-based reasoning task, i.e., the mask lan-
guage modelling (MLM) and task-specific (cloze-
filling) objectives. For MLM, the given PLMs are
trained over the reasoning examples using their
original pretraining MLM objective to impose de-
ductive reasoning ability; see Figure 2(b) for an
example of the BERT reasoner MLM-BERT. For
the cloze-filling objective, the PLMs are trained
with a task-specific cloze filling objective. As il-
lustrated in Figure 2(c), Cloze-BERT is trained to
predict the masked token in the cloze prompt, by
computing the probability
Pθ(a(i)|c(i),P(i))
. We
note that, unlike the original pretraining MLM ob-
jective where 15% tokens of the input are masked
randomly, the cloze-filling objective only masks
the answer token a(i)in the cloze prompt c(i).
This prompt-based reasoning task matches the
mask-filling nature of BERT. In this way, we can
probe the native reasoning ability of BERT without
any further fine-tuning and evaluate the contribu-
tion of reasoning training to the PLMs’ reasoning
ability. Foreshadowing, our experimental results in
Section 5indicate that reasoning training impacts
the model both positively and negatively.
4 Experiments and Results
Recent PLMs have shown surprisingly near-perfect
performance in deductive reasoning (Zhou et al.,
2020). However, we argue that high performance
does not mean PLMs have mastered reasoning
skills. To validate this, we run controlled exper-
iments to examine whether PLM-based reasoners
genuinely understand the natural language con-
text, produce conclusions robustly against lexical
and syntactic variance in surface forms, and apply
learned rules to unseen cases.
4.1 Datasets
Two datasets are used to examine the PLM-based
reasoners, namely, the Leap of Thought (
LoT
)
dataset (Talmor et al.,2020) and the WikiData
(WD) dataset (Vrandecic and Krötzsch,2014).
LoT
was originally proposed for conducting the
classification-based reasoning experiments for de-
ductive reasoning (Talmor et al.,2020) and has
been used as a standard (and sole) benchmark to
probe the deductive reasoning capabilities of PLMs
(Tafjord et al.,2021;Helwe et al.,2021). This
dataset is automatically generated by prompting
knowledge graphs, including ConceptNet (Speer
et al.,2017), WordNet (Fellbaum,1998) and Wiki-
Data (Vrandecic and Krötzsch,2014). LoT con-
tains 30,906 training instances and 1,289 instances
for each validation and testing set. Each data point
in LoT also contains a set of distractors that are
similar but irrelevant to deriving the conclusion.
For the prompt-based reasoning task, we can
reformulate the LoT dataset to fit our cloze-
completion task. Instead of having a set of premises
P
, a hypothesis
h
, and a binary label
l
, we rewrite
the hypothesis in LoT into a cloze
c
and the answer
a
(e.g. A raven can fly.
A [MASK] can fly.).
Note that we only generate those cloze questions
on the positive examples. Consequently, the results
across these two tasks are not directly comparable.
The
WD
dataset is an auxiliary reasoning dataset
which we generated and extracted from Wiki-
data5m (Wang et al.,2021b). Similar to previous
work (Petroni et al.,2019;Talmor et al.,2020),
we converted a set of knowledge graph triples
into linguistic statements using manually designed
prompts. The full description of the dataset con-
struction is provided in Appendix C. The final WD
dataset contains 4,124 training instances, 413 vali-
dation instances, and 314 test instances. WD only
contains positive examples: therefore, we only use
this dataset for the cloze-completion task.
4.2 Adversarial Probing
Previous work demonstrates that PLMs can achieve
near-perfect empirically results in reasoning tasks.
For example, RoBERTa-based models record a
near-perfect accuracy of 99.7% in the deductive
reasoning task on LoT (Talmor et al.,2020). How-
ever, another recent study shows that in some natu-
ral language inference benchmarks, PLMs are still
not robust to the negation examples (Hossain et al.,
2020), while humans can handle negations with
摘要:

CanPretrainedLanguageModels(Yet)ReasonDeductively?ZhangdieYuan},SongboHu,IvanVuli´c,AnnaKorhonen,ZaiqiaoMeng|y}DepartmentofComputerScienceandTechnology,UniversityofCambridgeLanguageTechnologyLab,UniversityofCambridge|SchoolofComputingScience,UniversityofGlasgow}{zy317,sh2091,iv250,alk23}@cam...

展开>> 收起<<
Can Pretrained Language Models Yet Reason Deductively Zhangdie Yuan Songbo Hu Ivan Vuli c Anna Korhonen Zaiqiao Mengy Department of Computer Science and Technology University of Cambridge.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:576.97KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注