Can Pretrained Language Models Yet Reason Deductively Zhangdie Yuan Songbo Hu Ivan Vuli c Anna Korhonen Zaiqiao Mengy Department of Computer Science and Technology University of Cambridge

2025-04-27 0 0 576.97KB 16 页 10玖币

侵权投诉

Can Pretrained Language Models (Yet) Reason Deductively?

Zhangdie Yuan♦∗

, Songbo Hu♠∗, Ivan Vuli´

c♠, Anna Korhonen♠, Zaiqiao Meng♣♠†

♦Department of Computer Science and Technology, University of Cambridge

♠Language Technology Lab, University of Cambridge

♣School of Computing Science, University of Glasgow

♦♠{zy317,sh2091,iv250,alk23}@cam.ac.uk

♣zaiqiao.meng@glasgow.ac.uk

Abstract

Acquiring factual knowledge with Pretrained

Language Models (PLMs) has attracted in-

creasing attention, showing promising per-

formance in many knowledge-intensive tasks.

Their good performance has led the commu-

nity to believe that the models do possess a

modicum of reasoning competence rather than

merely memorising the knowledge. In this pa-

per, we conduct a comprehensive evaluation

of the learnable deductive (also known as ex-

plicit) reasoning capability of PLMs. Through

a series of controlled experiments, we posit

two main ﬁndings. (i) PLMs inadequately

generalise learned logic rules and perform in-

consistently against simple adversarial surface

form edits. (ii) While the deductive reason-

ing ﬁne-tuning of PLMs does improve their

performance on reasoning over unseen knowl-

edge facts, it results in catastrophically for-

getting the previously learnt knowledge. Our

main results suggest that PLMs cannot yet

perform reliable deductive reasoning, demon-

strating the importance of controlled exami-

nations and probing of PLMs’ deductive rea-

soning abilities; we reach beyond (misleading)

task performance, revealing that PLMs are still

far from robust reasoning capabilities, even for

simple deductive tasks.

1 Introduction

Pretrained Language Models (PLMs) such as

BERT (Devlin et al.,2019) and RoBERTa (Liu

et al.,2019) have orchestrated tremendous progress

in NLP across a large variety of downstream ap-

plications. For knowledge-intensive tasks in par-

ticular, these large-scale PLMs are surprisingly

good at memorising factual knowledge presented

in pretraining corpora (Petroni et al.,2019;Jiang

et al.,2020b) and infusing knowledge from exter-

nal sources (Wang et al.,2021a;Zhou et al.,2022,

∗Indicates equal contribution.

†Corresponding author.

BERT

Training

A raven can ﬂy.

Original

bird

Prediction

Inference

A bird can ﬂy.

A raven is a bird.

A [MASK] can ﬂy.

A [MASK] cannot ﬂy.

The [MASK] species

is decreasing.

R-BERT

BERT

raven

R-BERT

Negation

Query

Neutral

(a)

(b)

(1)

(2)

(3)

(1)

(2)

(3)

(1)

(2)

(3)

Figure 1: Training and inference for deductive reason-

ing. Given the explicit premises (a), the input BERT

model is trained to get transformed into a reasoner R-

BERT model by deductively predicting a previously un-

seen conclusion (b). This inference process requires

R-BERT to understand factual knowledge and interpret

rules (e.g. taxonomic relations), intervening directly in

the deduction process.

among others), demonstrating their effectiveness

in learning and capturing knowledge.

Automatic reasoning, a systematic process of de-

riving previously unknown conclusions from given

formal representations of knowledge (Lenat et al.,

1990;Newell and Simon,1956), has been a long-

standing goal of AI research. In the NLP commu-

nity, a modern view of this problem (Clark et al.,

2020), where the formal representations of knowl-

edge are substituted by the natural language state-

ments, has recently received increasing attention,1

yielding multiple exploratory research directions:

1Following Clark et al. (2020), we also deﬁne natural lan-

guage rules as linguistic expressions of conjunctive impli-

cations,

condition[∧condition]∗→conclusion

, with the

semantics of logic programs with negations (Apt et al.,1988).

arXiv:2210.06442v2 [cs.CL] 13 Feb 2023

mathematical reasoning (Rabe et al.,2021), sym-

bolic reasoning (Yang and Deng,2021), and com-

monsense reasoning (Li et al.,2019). Impressive

signs of progress have been reported in teaching

PLMs to gain reasoning ability rather than just

memorising knowledge facts (Kassner et al.,2020;

Talmor et al.,2020), suggesting that PLMs could

serve as effective reasoners for identifying analo-

gies and inferring facts not explicitly/directly seen

in the data (Kassner et al.,2020;Ushio et al.,2021).

In particular, deductive reasoning

is one of the

most promising directions (Sanyal et al.,2022;Tal-

mor et al.,2020;Li et al.,2019). By deﬁnition,

deduction yields valid conclusions, which must be

true given that their premises are true (Johnson-

Laird,1999). In the NLP community, given all

the premises in natural language statements, some

large-scale PLMs have shown to be able to deduc-

tively draw appropriate conclusions under proper

training schemes (Clark et al.,2020;Talmor et al.,

2020). Figure 1shows an example of the training

and inference processes of deductive reasoning.

Despite promising applications of PLMs, some

recent studies have pointed out that they could only

perform a shallow level of reasoning on textual

data (Helwe et al.,2021). Indeed, PLMs can be

easily affected by mispriming (Misra et al.,2020)

and still hardly differentiate between positive and

negative statements (i.e., the so-called negation is-

sue) (Ettinger,2020). However, given that some ev-

idence suggests that PLMs can learn factual knowl-

edge beyond mere rote memorisation (Heinzerling

and Inui,2021) and their limitations (Helwe et al.,

2021), it is natural to ask, “Can the current PLMs

potentially serve as reliable deductive reasoners

over factual knowledge?” To answer it, as the main

contribution of this work, we conduct a compre-

hensive experimental study on testing the learnable

deductive reasoning capability of the PLMs.

In particular, we test various reasoning training

approaches on two knowledge reasoning datasets.

Our experimental results indicate that such deduc-

tive reasoning training of the PLMs (e.g., BERT

and RoBERTa) yields strong results on the stan-

dard benchmarks, but it inadequately generalises

learned logic rules to unseen cases. That is, they

perform inconsistently against simple surface form

perturbations (e.g., simple synonym substitution,

paraphrasing or negation insertion), advocating a

This type of reasoning is also often referred to as explicit

reasoning in the literature (Broome,2013;Aditya et al.,2018).

careful rethinking of the details behind the seem-

ingly ﬂawless empirical performance of deduc-

tive reasoning using the PLMs. We hope our

work will inspire further research on probing and

improving the deductive reasoning capabilities

of the PLMs. Our code and data are available

online at

https://github.com/cambridgeltl/

deductive_reasoning_probing.

2 Related Work

Knowledge Probing, Infusing, and Editing with

PLMs.

PLMs appear to memorise (world) knowl-

edge facts during pretraining, and such cap-

tured knowledge is useful for knowledge-intensive

tasks (Petroni et al.,2019,2021). A body of re-

cent research has aimed to (i) understand how

much knowledge PLMs store, i.e., knowledge

probing (Petroni et al.,2019;Meng et al.,2022);

(ii) how to inject external knowledge into them,

i.e., knowledge infusing (Wang et al.,2021b;Meng

et al.,2021); and (iii) how to edit the stored knowl-

edge, i.e. knowledge editing (De Cao et al.,2021).

In particular, De Cao et al. (2021) have shown that

it is possible to modify a single knowledge fact

without affecting all the other stored knowledge.

However, some empirical evidence suggests that

existing PLMs generalise poorly to unseen sen-

tences and are easily misled (Kassner and Schütze,

2020).

Moreover, this body of research focuses

only on investigating how to recall or expose the

factual and commonsense knowledge that has been

encoded in the PLMs, rather than exploring their

capabilities of deriving previously unknown knowl-

edge via deductive reasoning, as done in this work.

Knowledge Reasoning with PLMs.

In re-

cent years, PLMs have also achieved impressive

progress in knowledge reasoning (Helwe et al.,

2021). For example, PLMs can infer a conclusion

from a set of knowledge statements and rules (Tal-

mor et al.,2020;Clark et al.,2020), with both the

knowledge and the rules being mentioned explicitly

and linguistically in the model input. Some gener-

ative PLMs, such as T5 (Raffel et al.,2020), are

even able to generate natural language proofs that

support implications over logical rules expressed

in natural language (Tafjord et al.,2021). In par-

ticular, some large PLMs, such as LaMDA (Thop-

For instance, if we add the talk token into the statement

“Birds can [MASK].” (i.e. “Talk. Birds can [MASK].”), the

PLM might be misled by the added token and predict talk

rather than the originally predicted ﬂy token (Kassner and

Schütze,2020).

pilan et al.,2022), have been shown to be able

to conduct multi-step reasoning under the chain

of thought prompting (Wei et al.,2022) or proper

simple prompting template (Kojima et al.,2022).

Although the generated ‘reasoning’ statements po-

tentially beneﬁt some downstream tasks, there is

currently no evidence that the statements are gener-

ated via deductive reasoning, rather than obtained

via pure memorisation. Generative reasoning mod-

els are difﬁcult to evaluate since this requires huge

effort of manual assessment (Bostrom et al.,2021).

Although some research has demonstrated that

PLMs can learn to effectively perform inference

which involves taxonomic and world knowledge,

chaining, and counting (Talmor et al.,2020), pre-

liminary experiments on a single test set in more re-

cent research have revealed that ﬁne-tuning PLMs

for editing knowledge might negatively affect the

previously acquired knowledge (De Cao et al.,

2021). Our work performs systematic and con-

trolled examinations of the deductive reasoning ca-

pabilities of PLMs and reaches beyond (sometimes

misleading) task performance.

3 Deductive Reasoning

What is Deductive Reasoning?

Psychologists de-

ﬁne reasoning as a process of thought that yields

a conclusion from precepts, thoughts, or asser-

tions (Johnson-Laird,1999). Three main schools

describe what people may compute to derive this

conclusion: relying on factual knowledge (Ander-

son,2014;Newell,1990), formal rules (Braine,

1998;Braine and O’Brien,1991), mental mod-

els (Johnson-Laird,1983), or some mixture of

them (Falmagne and Gonsalves,1995). Our experi-

mental study focuses on a ‘computational’ aspect of

reasoning — namely, whether computational PLMs

for reasoning inadequately generalise learned logic

rules and perform inconsistently against simple ad-

versarial reasoning examples.

We investigate deductive reasoning in the con-

text of NLP and neural PLMs. In particular, the

goal of this deductive reasoning task is to train a

PLM (e.g. BERT) over some reasoning examples

(each with a set of premises and a conclusion) to

become a potential reasoner (e.g. R-BERT as illus-

trated in Figure 1). Then, the trained reasoner can

be used to infer deductive conclusions consistently

over explicit premises, where the derived conclu-

sions are usually unseen during the PLM pretrain-

ing/training. This inference process requires the

Softmax

BERT

(a) CLS-BERT (b) MLM-BERT (c) Cloze-BERT

BERT BERT

A bird [MASK] ﬂy.

A raven is a bird.

A raven can [MASK] .

A bird can ﬂy.

A raven is a bird.

A [MASK] can ﬂy.

[can]

Softmax Softmax

[raven]

[ﬂy]

[CLS] A bird can ﬂy.

A raven is a bird.

A raven can ﬂy.

Figure 2: Different reasoning training approaches.

underlying PLMs to understand factual knowledge

and interpret rules intervening in the deduction pro-

cess. In this paper, we only focus on the encoder-

based PLMs (e.g. BERT and RoBERTa) as they

can be evaluated under more controllable condi-

tions and scrutinised via automatic evaluation. In

particular, we investigate two task formulations of

the deductive reasoning training:

classiﬁcation-

based and 2) prompt-based reasoning, as follows.

3.1 Classiﬁcation-based Reasoning

The classiﬁcation-based approach formulates the

deductive reasoning task as a sequence classiﬁca-

tion task. Let

D={D(1),D(2),· · · ,D(n)}

be a rea-

soning dataset, where

is the number of examples.

Each example

D(i)∈ D

contains a set of premises

P(i)={p(i)

1,p(i)

2. . . p(i)

, a hypothesis

h(i)

, and

a binary label

l(i)∈ {0,1}

. A classiﬁcation-based

reasoner takes the input of

P(i)

and

h(i)

, then out-

puts a binary label

l(i)

indicating the faithfulness

of h(i), given that P(i)is hypothetically factual.

The goal of the classiﬁcation-based reasoning

training is to build a statistical model param-

eterised by

to characterise

Pθ(l(i)|h(i),P(i))

Those PLMs built on the transformer encoder ar-

chitecture, such as BERT (Devlin et al.,2019)

and RoBERTa (Liu et al.,2019), can be used as

the backbone of such a classiﬁcation-based rea-

soner. Figure 2(a) shows an example of using the

BERT model to train a classiﬁcation-based rea-

soner (CLS-BERT). In particular, given a training

example

D(i)={l(i),h(i),P(i)}

, the BERT model

is trained to predict the hypothesis label by encod-

ing

[h(i);P(i)]

and computing

Pθ(l(i)|h(i),P(i))

To do so, the contextualised representation of the

‘

[CLS]

’ token is subsequently projected down to

two logits and passed through a softmax layer to

form a Bernoulli distribution indicating that a hy-

pothesis is true or false.

3.2 Prompt-based Reasoning

Deductive reasoning can also be approached as a

cloze-completion task by formulating a valid con-

clusion as a cloze test. Speciﬁcally, given a rea-

soning example, i.e.,

D(i)

with its premises

P(i)

and a cloze prompt

c(i)

(e.g. “A

[MASK]

can ﬂy”),

instead of predicting a binary label, this cloze-

completion task is to predict the masked token

a(i)

(e.g. raven) to the cloze question c(i).

The BERT-based models have been widely used

in the prompt-based reasoning tasks (Helwe et al.,

2021;Liu et al.,2022), by concatenating the

premises and the prompt as input and predicting the

masked token based on the bidirectional context.

In general, there are two training objectives for the

prompt-based reasoning task, i.e., the mask lan-

guage modelling (MLM) and task-speciﬁc (cloze-

ﬁlling) objectives. For MLM, the given PLMs are

trained over the reasoning examples using their

original pretraining MLM objective to impose de-

ductive reasoning ability; see Figure 2(b) for an

example of the BERT reasoner MLM-BERT. For

the cloze-ﬁlling objective, the PLMs are trained

with a task-speciﬁc cloze ﬁlling objective. As il-

lustrated in Figure 2(c), Cloze-BERT is trained to

predict the masked token in the cloze prompt, by

computing the probability

Pθ(a(i)|c(i),P(i))

. We

note that, unlike the original pretraining MLM ob-

jective where 15% tokens of the input are masked

randomly, the cloze-ﬁlling objective only masks

the answer token a(i)in the cloze prompt c(i).

This prompt-based reasoning task matches the

mask-ﬁlling nature of BERT. In this way, we can

probe the native reasoning ability of BERT without

any further ﬁne-tuning and evaluate the contribu-

tion of reasoning training to the PLMs’ reasoning

ability. Foreshadowing, our experimental results in

Section 5indicate that reasoning training impacts

the model both positively and negatively.

4 Experiments and Results

Recent PLMs have shown surprisingly near-perfect

performance in deductive reasoning (Zhou et al.,

2020). However, we argue that high performance

does not mean PLMs have mastered reasoning

skills. To validate this, we run controlled exper-

iments to examine whether PLM-based reasoners

genuinely understand the natural language con-

text, produce conclusions robustly against lexical

and syntactic variance in surface forms, and apply

learned rules to unseen cases.

4.1 Datasets

Two datasets are used to examine the PLM-based

reasoners, namely, the Leap of Thought (

LoT

)

dataset (Talmor et al.,2020) and the WikiData

(WD) dataset (Vrandecic and Krötzsch,2014).

LoT

was originally proposed for conducting the

classiﬁcation-based reasoning experiments for de-

ductive reasoning (Talmor et al.,2020) and has

been used as a standard (and sole) benchmark to

probe the deductive reasoning capabilities of PLMs

(Tafjord et al.,2021;Helwe et al.,2021). This

dataset is automatically generated by prompting

knowledge graphs, including ConceptNet (Speer

et al.,2017), WordNet (Fellbaum,1998) and Wiki-

Data (Vrandecic and Krötzsch,2014). LoT con-

tains 30,906 training instances and 1,289 instances

for each validation and testing set. Each data point

in LoT also contains a set of distractors that are

similar but irrelevant to deriving the conclusion.

For the prompt-based reasoning task, we can

reformulate the LoT dataset to ﬁt our cloze-

completion task. Instead of having a set of premises

, a hypothesis

, and a binary label

, we rewrite

the hypothesis in LoT into a cloze

and the answer

(e.g. A raven can ﬂy.

→

A [MASK] can ﬂy.).

Note that we only generate those cloze questions

on the positive examples. Consequently, the results

across these two tasks are not directly comparable.

The

dataset is an auxiliary reasoning dataset

which we generated and extracted from Wiki-

data5m (Wang et al.,2021b). Similar to previous

work (Petroni et al.,2019;Talmor et al.,2020),

we converted a set of knowledge graph triples

into linguistic statements using manually designed

prompts. The full description of the dataset con-

struction is provided in Appendix C. The ﬁnal WD

dataset contains 4,124 training instances, 413 vali-

dation instances, and 314 test instances. WD only

contains positive examples: therefore, we only use

this dataset for the cloze-completion task.

4.2 Adversarial Probing

Previous work demonstrates that PLMs can achieve

near-perfect empirically results in reasoning tasks.

For example, RoBERTa-based models record a

near-perfect accuracy of 99.7% in the deductive

reasoning task on LoT (Talmor et al.,2020). How-

ever, another recent study shows that in some natu-

ral language inference benchmarks, PLMs are still

not robust to the negation examples (Hossain et al.,

2020), while humans can handle negations with

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CanPretrainedLanguageModels(Yet)ReasonDeductively?ZhangdieYuan},SongboHu,IvanVuli´c,AnnaKorhonen,ZaiqiaoMeng|y}DepartmentofComputerScienceandTechnology,UniversityofCambridgeLanguageTechnologyLab,UniversityofCambridge|SchoolofComputingScience,UniversityofGlasgow}{zy317,sh2091,iv250,alk23}@cam...

展开>> 收起<<

Can Pretrained Language Models Yet Reason Deductively Zhangdie Yuan Songbo Hu Ivan Vuli c Anna Korhonen Zaiqiao Mengy Department of Computer Science and Technology University of Cambridge.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Can Pretrained Language Models Yet Reason Deductively Zhangdie Yuan Songbo Hu Ivan Vuli c Anna Korhonen Zaiqiao Mengy Department of Computer Science and Technology University of Cambridge

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: