Language Models Are Poor Learners of Directional Inference Tianyi LiMohammad Javad HosseiniSabine WeberMark Steedman School of Informatics University of Edinburgh

2025-05-04 0 0 718.31KB 19 页 10玖币

Language Models Are Poor Learners of Directional Inference

Tianyi LiMohammad Javad Hosseini∗Sabine WeberMark Steedman

School of Informatics, University of Edinburgh

{tianyi.li, javad.hosseini}@ed.ac.uk

s.weber@sms.ed.ac.uk, steedman@inf.ed.ac.uk

Abstract

We examine LMs’ competence of direc-

tional predicate entailments by supervised ﬁne-

tuning with prompts. Our analysis shows

that contrary to their apparent success on stan-

dard NLI, LMs show limited ability to learn

such directional inference; moreover, exist-

ing datasets fail to test directionality, and/or

are infested by artefacts that can be learnt as

proxy for entailments, yielding over-optimistic

results. In response, we present BoOQA

(Boolean Open QA), a robust multi-lingual

evaluation benchmark for directional predicate

entailments, extrinsic to existing training sets.

On BoOQA, we establish baselines and show

evidence of existing LM-prompting models be-

ing incompetent directional entailment learn-

ers, in contrast to entailment graphs, however

limited by sparsity.

1 Introduction

Pre-trained language models have shown impres-

sive performance in natural language understand-

ing (NLU), where prompting methods are widely

used for ﬁne-tuning. (Raffel et al.,2020;Brown

et al.,2020;Schick and Schütze,2021)

In this paper, we speciﬁcally investigate predi-

cate entailment detection, an important sub-task of

NLU and speciﬁcally, NLI. The task is to predict,

given that predicate

p

holds between arguments

<

a

,

b

>, whether it can be inferred that predicate

q

also holds between <

a

,

b

>. For instance, “John

shopped in IKEA” entails “John went to IKEA”, but

not “John drove to IKEA”.

The primary distinction between predicate en-

tailments and semantic similarity, apart from being

focused on predicates, is that the former involves

directional

entailments as well as

symmetric

ones.

Directional entailments are those

pq

(

p

entails

q

)

where

q2p

; conversely, symmetric entailments are

those pqwhere qpas well (namely p≡q).

∗Now at Google Research.

Directional entailments are important for ques-

tion answering, since they help ﬁlter out the spuri-

ous connections from knowledge sources to ques-

tions: knowing that John went to IKEA, it is unsafe

to infer that he shopped in IKEA, as he may have

been there for other reasons. By symmetric sim-

ilarity (i.e. paraphrase), the two events would be

considered related, so a spurious inference chain

would emerge; by directional entailments, it would

be concluded that while the two events are related,

the entailment holds only in the reverse direction,

so the spurious connection would be avoided.

Current LM-prompting methods have reported

positive results on predicate entailment detection

(Schmitt and Schütze,2021b,a). Since the masked-

language-modelling objective naturally enables

LMs to separate related and unrelated tokens, they

are expected to be good paraphrase detectors; on

the other hand, it is less clear whether they also

distinguish the directionality of entailments.

To answer this question, we adapt the SOTA LM-

prompting model (Schmitt and Schütze,2021b) as

a gauge for the competence of its LMs, in particu-

lar RoBERTa (Liu et al.,2019) and BERT (Devlin

et al.,2019). We apply it to various subsets of the

common benchmark LevyHolt (Levy and Dagan,

2016;Holt,2019). We ﬁnd that while it scores

highly on the directional subset by Holt (2019), it

otherwise shows poor ability in learning the direc-

tionality of predicate entailments. We ﬁnd instead

that the LevyHolt directional subset is infested with

artefacts, to which LMs are overﬁtting.

These observations show that we need a robust

evaluation benchmark for directional predicate en-

tailments, independent of training sets. Inspired by

McKenna et al. (2021) and Chen et al. (2017), we

present BoOQA, a Boolean Open QA dataset in En-

glish and Chinese, which is closer to applications,

adversarial to artefacts in supervision, and demands

sensitivity to the directionality of entailments.

On BoOQA, we re-examine supervised and un-

arXiv:2210.04695v2 [cs.CL] 14 Oct 2022

supervised LM methods along with various dis-

crete entailment graphs (EG) (Hosseini et al.,2018,

2021;Chen et al.,2022;Li et al.,2022). We ﬁnd

that the performances of supervised LM-prompting

methods are indifferent to directional supervision,

and are generally less competitive than suggested

on LevyHolt; on the other hand, EGs reach decent

precisions with their strongest edges, but are hit by

sparsity and noisy unsupervised signals.

Our contributions can be summarized as follows:

1) We show that LevyHolt, the common directional

predicate entailments benchmark, is infested by

artefacts, allowing supervised methods to perform

well by overﬁtting; 2) We verify that LMs, with su-

pervised ﬁne-tuning, show limited ability to learn

directional entailments; 3) We present BoOQA,

a robust, extrinsic, multilingual evaluation bench-

mark for directional predicate entailments, where

various baselines are provided and analysed.1

2 Background and Setup

Language models have been used under a pretrain-

ﬁnetune paradigm: the semantics of a token in

context are learnt during pre-training and reﬂected

in the dense encodings; when ﬁne-tuning with a

task-speciﬁc dataset, the model learns which area

of its encoding space to look at. Therefore, if a pre-

trained LM

cannot

be ﬁne-tuned to solve a task,

we cannot reject the null hypothesis that it does

not encode the task. In §3, we look into RoBERTa

(Liu et al.,2019) and BERT (Devlin et al.,2019) in

particular, and examine whether they can be ﬁne-

tuned to learn directional predicate entailments.

Model

We adapt the supervised SOTA (Schmitt

and Schütze,2021b), a prompt ﬁne-tuning method,

for examining LMs.

2

We call it S&S here and be-

low. S&S ﬁts each premise-hypothesis pair into

a few natural language prompts, such as “John

shopped in IKEA, which means that John went to

IKEA”; they then convert the task into sentence

classiﬁcation over instantiated prompts. It is a sim-

ple SOTA with few additional parameters, and the

architecture allows directional judgements. Thus,

it is an ideal “gauge” for directional ability of LMs.

Dataset

So far there are two popular predicate

entailment datasets: LevyHolt (Levy and Da-

1

Our code and datasets are published at

https://github.

com/Teddy-Li/LM-DirctionalInference.

2

There is a follow-up work (Schmitt and Schütze,2021a)

to this, but we found it to have inferior generalisation perfor-

mance; see Appendix Bfor details and a brief introduction.

gan,2016;Holt,2019) and Sherliic (Schmitt and

Schütze,2019). We use LevyHolt in our §3ex-

periments, as it contains data entries (

pq

?)

with their converses (

qp

?), making the ground

truth directionality annotations available. We use

the train/dev/test split as in Schmitt and Schütze

(2021b).

3

In each data split, We further classify the

entries into the following 4 sub-groups, in paren-

theses are the sizes of each sub-group in each split:

•

DirTrue (251 / 64 / 892): directional true en-

tailments where the premise entails the hy-

pothesis, but not vice versa; for instance, Per-

son shopped in Location



Person went to

Location;

•

DirFalse (251 / 64 / 892): directional non-

entailments where the hypothesis entails the

premise, but not vice versa; for instance, Per-

son went to Location

2

Person shopped in

Location;

•

Paraphrases (615 / 155 / 1939): symmetric

paraphrases where the premise and the hy-

pothesis entail each other; for instance, Person

arrived at Location

≡

Person got to Location;

•

Unrelated (3255 / 831 / 9198): unrelated pred-

icate pairs where the premise and the hypoth-

esis have no entailment relations; for instance,

Person shopped in Location

2

Person fell ill

in Location.

We deﬁne various subsets with pairs of sub-

groups, which we introduce and discuss in §3.

Evaluation Metric

In predicate entailment de-

tection, Area-Under-the-Curve with precision

>

50%

(

AUC50 %

) has been the metric in use (Hos-

seini et al.,2018;Schmitt and Schütze,2021b).

It is a solid metric for comparison on the same

dataset; however, we are comparing between differ-

ent subsets, each with a different random baseline

precision (i.e. the ratio of true entailments). If we

were to set a common precision lower-bound, we

would be biased toward those datasets with higher

random baseline precisions. To make performance

on different datasets comparable, we propose the

metric of Normalized AUC (AUCnorm ):

AUCnorm =AU Cξ−ξ

1−ξ(1)

3

Except when an entry and its converse appear in different

splits (e.g. one in train, the other in dev), where we randomly

assign both into the same split, so as to avoid information

leakage.

where

ξ

is the random baseline precision. Intu-

itively,

AUCnorm

measures the ratio of area-above-

random (

1−ξ

) that falls below the precision-recall

curve (

AUCξ−ξ

), see Appendix Afor graphic

illustration.

AUCnorm

allows us to take into ac-

count all gains against the random baseline, and

level performance on all datasets to the same scale.

3 Prompt Fine-tuning LM with LevyHolt

In this section, we test for LMs’ ability to learn

directional entailments with the S&S prompt-based

gauge model. We use RoBERTa-base as the pri-

mary subject, as it is used by SOTA Schmitt and

Schütze (2021b), and is sufﬁciently lightweight for

experiments to run efﬁciently. In Appendix C, we

also report results on RoBERTa-large and BERT

models for key experiments, where results are con-

sistent. We use

S&Ssubset

to denote S&S model

ﬁne-tuned on each subset.

Experiments are graphically summarized in Fig-

ure 1. Each edge denotes a LevyHolt subset made

of the two sub-groups; the number on each edge

is the

AUCnorm

that S&S achieves on each sub-

set. For separating an Entailed sub-group from a

Non-entailed one, the original labels are used; for

separating two Entailed or two Non-entailed sub-

groups, the one with more similar predicates (more

paraphrastic) is assigned label “1”, the other “0”.4

Note that we ﬁt S&S to a number of different

subsets, so we cannot simply re-use the original

hyper-parameters. Instead, to provide a level play-

ing ﬁeld, we follow Schmitt and Schütze (2021b) in

log-uniformly sampling 100 hyper-parameter sets

from their speciﬁed ranges; for each subset, we

choose the one with best dev set result.

If a method is insensitive to directional entail-

ments, then it would treat entailments as similarity

between unordered pairs of predicates; it would

model Paraphrases,DirTrue and DirFalse simi-

larly, where DirTrue and DirFalse entries are con-

ceptually somewhat “semi-paraphrastic”.

If a method is sensitive to directional entailments,

it should be able to discriminate between each pair

of the four sub-groups. Particularly, it should ad-

ditionally be able to separate sub-groups in the up-

4

We acknowledge that Schmitt and Schütze (2021b) use

hand-crafted prompts tuned for entailment detection, so they

may be sub-optimal for separating same-label sub-groups; we

argue that ﬁxed-prompt LM tuning models are not too sensitive

to their speciﬁc prompts (Logan IV et al.,2021;Webson and

Pavlick,2022); nonetheless, we also report results from a

continuous-prompt model (Schmitt and Schütze,2021a) in

Appendix B, where results are very similar.

Figure 1: S&S models on mesh of pairs of sub-groups,

results in AUCnorm .

Figure 2: Hypothesis-only artefact baselines on mesh

of pairs of sub-groups, results in AUCnorm .

per triangle of the mesh, coloured red in Figure 1.

Among these three tests of directionality, DirTrue-

DirFalse is the most challenging: a method with

no sensitivity to directionality should be at chance

and get 0%

AUCnorm

; this is traditionally called

the directional subset (Holt,2019). For the other

two subsets, a symmetric measure would do above

random by identifying entries in DirTrue /DirFalse

as statistically less similar than Paraphrases.

Below we discuss ﬁndings around the mesh

and the triangle. The S&S model yields 77.7%

AUCnorm

when trained and tested on full Levy-

Holt, which we provide for readers’ reference.

3.1 The S&S Triangle

The red triangle in Figure 1presents mixed mes-

sages about the directionality of RoBERTa LM:

on the most challenging DirTrue-DirFalse sub-

set, it achieves an apparently excellent

AUCnorm

of 82.9%; however, on the other subsets, it gets

mediocre results at 26.9% and 45.8% respectively.

For the directional subset (DirTrue-DirFalse),

the 82.9%

AUCnorm

not only surpasses the 77.7%

for Full LevyHolt, but is also on par with the 79.1%

for its mirroring Symmetric subset (Paraphrases-

Unrelated), which should be easier by human

Train/Dev

Test

Directional Symmetric Full

Directional 82.9 9.9 24.7

Symmetric 0.2 79.1 61.3

Full 46.4 84.8 77.7

Table 1: Generalization performance of RoBERTa-base

S&S classiﬁer on the Directional and Symmetric sub-

sets of LevyHolt. Values are in % of AUCnorm .

AUCnorm (%) S&S

’

S&S

Para-DirTrue 26.9 19.8 (-7.1)

Para-DirFalse 45.8 35.6 (-10.2)

Table 2: Comparison between S&Swith regular and

’

S&Swith symmetric prompts. Paraphrases-DirTrue

and Paraphrases-DirFalse subsets are concerned.

judgement. Paradoxically for such a challenging

subset, the Directional subset enjoys the best per-

formance in the mesh.

To understand this result, we did a generalisation

experiment between Directional and Symmetric,

the two disjoint, complementary subsets of Levy-

Holt. As results in Table 1show, classiﬁers from

the two subsets do not generalise to each other,

and neither does

S&SDirectional

generalise to full

LevyHolt. That is to say, either the two kinds of

“entailments”, Directional and Symmetric, are dif-

ferent tasks from the LM’s perspective, or the S&S

classiﬁer is overﬁtting to the directional subset.

For the Paraphrases-DirX subsets, results are far

less impressive. For reference, we train two strictly-

symmetric S&S models, one on Paraphrases-

DirTrue, the other on Paraphrases-DirFalse. For

these strictly-symmetric models we enforce all

prompts to be in pairs of reverses (e.g. for the

example in §2, we would add “John went to IKEA,

which means that John shopped in IKEA”). That

way we guarantee from the input that no directional

judgements can be made. We call these symmetric-

prompt models

’

S&S

. From the results in Table

2, we ﬁnd that for both Paraphrases-DirTrue and

Paraphrases-DirFalse, there is only a modest dif-

ference between the performance of

S&S

classiﬁer

and

’

S&S

. This shows that despite the results from

the Directional Subset, RoBERTa LM shows lim-

ited ability to detect directional entailments.

3.2 The Artefacts Triangle

From previous discussions, we notice that the

scores for the directional subset is anomalously

high. Below we reveal that this anomaly is an ef-

fect of dataset artefacts, and that the artefacts in

question are quite speciﬁc to the directional subset

and generally less prominent in the other subsets.

Artefacts aside, the S&S classiﬁers do not show

strong abilities to learn directional entailments.

It is difﬁcult to identify sources of artefacts

by manual inspection; on the other hand, Poliak

et al. (2018) have shown that hypothesis-only (H-

only) models can be a strong proxy for an artefact

baseline. Inspired by their ﬁndings, we instead

use H-only model as a proxy for the aggregated

strength of artefacts. For H-only model we use a

restricted version of S&S classiﬁer, where we mask

all premises with the word “true”.5

We report the H-only model’s results on the same

mesh of subsets in Figure 2. For every subset, the

H-only model still trails behind the S&S classiﬁers.

These gaps are partly explained by the fact that the

H-only model does not capture all existing artefacts,

but is merely a proxy to their strengths.

As shown, the Directional subset indeed has par-

ticularly strong artefacts to exploit, with the H-only

model reaching 48.9%

AUCnorm

, far above the

other subsets. Between Paraphrases-DirTrue and

Paraphrases-DirFalse, the relative performance of

S&S model on them is aligned with their relative

strengths of artefacts; this means, for RoBERTa,

Paraphrases is in fact similarly separable from

DirTrue and DirFalse, as is in line with expectation.

Also interesting is the comparison between the

directional and symmetric subsets. The two subsets

had similar S&S performances; however, there is a

difference of over 20% between their hypothesis-

only artefact strengths. That means the S&S clas-

siﬁer is after all far better at the symmetric subset

than the directional one.

For a crude ranking, we inspect each sub-

set according to the FullModel-HOnly ratio by

AUCnorm

(lower the stronger artifacts). We

ﬁnd at rock bottom the Paraphrases-DirFalse and

DirTrue-DirFalse subsets with this ratio at 1.55 and

1.70 respectively, indicating that their full-model

scores are the heaviest over-estimations; next up

is the 2.10 for DirFalse-Unrelated, all the other

5

We use “true” to mask the premise because, when the

premise is always true, the correctness of instantiated prompts

depends solely on the hypothesis.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LanguageModelsArePoorLearnersofDirectionalInferenceTianyiLiMohammadJavadHosseiniSabineWeberMarkSteedmanSchoolofInformatics,UniversityofEdinburgh{tianyi.li,javad.hosseini}@ed.ac.uks.weber@sms.ed.ac.uk,steedman@inf.ed.ac.ukAbstractWeexamineLMs'competenceofdirec-tionalpredicateentailmentsbysuperv...

展开>> 收起<<

Language Models Are Poor Learners of Directional Inference Tianyi LiMohammad Javad HosseiniSabine WeberMark Steedman School of Informatics University of Edinburgh.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

相关推荐

更多

立即下载

分类：图书资源 价格：10玖币 属性：19 页 大小：718.31KB 格式：PDF 时间：2025-05-04

开通VIP享超值会员特权

多端同步记录
高速下载文档
免费文档工具
分享文档赚钱
每日登录抽奖
优质衍生服务

作者详情

MAOOA..
高级编辑

文档 14218 粉丝 0

相关内容

更多

热门标签

人际关系配电装置动力学连接体力的合成高考理综全宋诗作者索引公务员考试

/ 19

评分收藏

立即下载

关于我们联系我们隐私政策用户协议免责申明会员服务协议
本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！ Copyright ©Jiubeiyunall rights reserved SITEMAP| 备案号：渝ICP备2024044455号| 渝公网安备50010702506394 | 违法与不良信息举报方式：微信:jiubeiyun2024,QQ:264159069,电话:15523442343,邮箱:jiubeiyun@126.com

客服

关注

二维码已失效
刷新

打开微信，点击“扫一扫”

安全高效便捷

免密登录