Language Models Are Poor Learners of Directional Inference Tianyi LiMohammad Javad HosseiniSabine WeberMark Steedman School of Informatics University of Edinburgh

2025-05-04 0 0 718.31KB 19 页 10玖币
侵权投诉
Language Models Are Poor Learners of Directional Inference
Tianyi LiMohammad Javad HosseiniSabine WeberMark Steedman
School of Informatics, University of Edinburgh
{tianyi.li, javad.hosseini}@ed.ac.uk
s.weber@sms.ed.ac.uk, steedman@inf.ed.ac.uk
Abstract
We examine LMs’ competence of direc-
tional predicate entailments by supervised fine-
tuning with prompts. Our analysis shows
that contrary to their apparent success on stan-
dard NLI, LMs show limited ability to learn
such directional inference; moreover, exist-
ing datasets fail to test directionality, and/or
are infested by artefacts that can be learnt as
proxy for entailments, yielding over-optimistic
results. In response, we present BoOQA
(Boolean Open QA), a robust multi-lingual
evaluation benchmark for directional predicate
entailments, extrinsic to existing training sets.
On BoOQA, we establish baselines and show
evidence of existing LM-prompting models be-
ing incompetent directional entailment learn-
ers, in contrast to entailment graphs, however
limited by sparsity.
1 Introduction
Pre-trained language models have shown impres-
sive performance in natural language understand-
ing (NLU), where prompting methods are widely
used for fine-tuning. (Raffel et al.,2020;Brown
et al.,2020;Schick and Schütze,2021)
In this paper, we specifically investigate predi-
cate entailment detection, an important sub-task of
NLU and specifically, NLI. The task is to predict,
given that predicate
p
holds between arguments
<
a
,
b
>, whether it can be inferred that predicate
q
also holds between <
a
,
b
>. For instance, “John
shopped in IKEA” entails “John went to IKEA”, but
not “John drove to IKEA”.
The primary distinction between predicate en-
tailments and semantic similarity, apart from being
focused on predicates, is that the former involves
directional
entailments as well as
symmetric
ones.
Directional entailments are those
pq
(
p
entails
q
)
where
q2p
; conversely, symmetric entailments are
those pqwhere qpas well (namely pq).
Now at Google Research.
Directional entailments are important for ques-
tion answering, since they help filter out the spuri-
ous connections from knowledge sources to ques-
tions: knowing that John went to IKEA, it is unsafe
to infer that he shopped in IKEA, as he may have
been there for other reasons. By symmetric sim-
ilarity (i.e. paraphrase), the two events would be
considered related, so a spurious inference chain
would emerge; by directional entailments, it would
be concluded that while the two events are related,
the entailment holds only in the reverse direction,
so the spurious connection would be avoided.
Current LM-prompting methods have reported
positive results on predicate entailment detection
(Schmitt and Schütze,2021b,a). Since the masked-
language-modelling objective naturally enables
LMs to separate related and unrelated tokens, they
are expected to be good paraphrase detectors; on
the other hand, it is less clear whether they also
distinguish the directionality of entailments.
To answer this question, we adapt the SOTA LM-
prompting model (Schmitt and Schütze,2021b) as
a gauge for the competence of its LMs, in particu-
lar RoBERTa (Liu et al.,2019) and BERT (Devlin
et al.,2019). We apply it to various subsets of the
common benchmark LevyHolt (Levy and Dagan,
2016;Holt,2019). We find that while it scores
highly on the directional subset by Holt (2019), it
otherwise shows poor ability in learning the direc-
tionality of predicate entailments. We find instead
that the LevyHolt directional subset is infested with
artefacts, to which LMs are overfitting.
These observations show that we need a robust
evaluation benchmark for directional predicate en-
tailments, independent of training sets. Inspired by
McKenna et al. (2021) and Chen et al. (2017), we
present BoOQA, a Boolean Open QA dataset in En-
glish and Chinese, which is closer to applications,
adversarial to artefacts in supervision, and demands
sensitivity to the directionality of entailments.
On BoOQA, we re-examine supervised and un-
arXiv:2210.04695v2 [cs.CL] 14 Oct 2022
supervised LM methods along with various dis-
crete entailment graphs (EG) (Hosseini et al.,2018,
2021;Chen et al.,2022;Li et al.,2022). We find
that the performances of supervised LM-prompting
methods are indifferent to directional supervision,
and are generally less competitive than suggested
on LevyHolt; on the other hand, EGs reach decent
precisions with their strongest edges, but are hit by
sparsity and noisy unsupervised signals.
Our contributions can be summarized as follows:
1) We show that LevyHolt, the common directional
predicate entailments benchmark, is infested by
artefacts, allowing supervised methods to perform
well by overfitting; 2) We verify that LMs, with su-
pervised fine-tuning, show limited ability to learn
directional entailments; 3) We present BoOQA,
a robust, extrinsic, multilingual evaluation bench-
mark for directional predicate entailments, where
various baselines are provided and analysed.1
2 Background and Setup
Language models have been used under a pretrain-
finetune paradigm: the semantics of a token in
context are learnt during pre-training and reflected
in the dense encodings; when fine-tuning with a
task-specific dataset, the model learns which area
of its encoding space to look at. Therefore, if a pre-
trained LM
cannot
be fine-tuned to solve a task,
we cannot reject the null hypothesis that it does
not encode the task. In §3, we look into RoBERTa
(Liu et al.,2019) and BERT (Devlin et al.,2019) in
particular, and examine whether they can be fine-
tuned to learn directional predicate entailments.
Model
We adapt the supervised SOTA (Schmitt
and Schütze,2021b), a prompt fine-tuning method,
for examining LMs.
2
We call it S&S here and be-
low. S&S fits each premise-hypothesis pair into
a few natural language prompts, such as “John
shopped in IKEA, which means that John went to
IKEA”; they then convert the task into sentence
classification over instantiated prompts. It is a sim-
ple SOTA with few additional parameters, and the
architecture allows directional judgements. Thus,
it is an ideal “gauge” for directional ability of LMs.
Dataset
So far there are two popular predicate
entailment datasets: LevyHolt (Levy and Da-
1
Our code and datasets are published at
https://github.
com/Teddy-Li/LM-DirctionalInference.
2
There is a follow-up work (Schmitt and Schütze,2021a)
to this, but we found it to have inferior generalisation perfor-
mance; see Appendix Bfor details and a brief introduction.
gan,2016;Holt,2019) and Sherliic (Schmitt and
Schütze,2019). We use LevyHolt in our §3ex-
periments, as it contains data entries (
pq
?)
with their converses (
qp
?), making the ground
truth directionality annotations available. We use
the train/dev/test split as in Schmitt and Schütze
(2021b).
3
In each data split, We further classify the
entries into the following 4 sub-groups, in paren-
theses are the sizes of each sub-group in each split:
DirTrue (251 / 64 / 892): directional true en-
tailments where the premise entails the hy-
pothesis, but not vice versa; for instance, Per-
son shopped in Location
Person went to
Location;
DirFalse (251 / 64 / 892): directional non-
entailments where the hypothesis entails the
premise, but not vice versa; for instance, Per-
son went to Location
2
Person shopped in
Location;
Paraphrases (615 / 155 / 1939): symmetric
paraphrases where the premise and the hy-
pothesis entail each other; for instance, Person
arrived at Location
Person got to Location;
Unrelated (3255 / 831 / 9198): unrelated pred-
icate pairs where the premise and the hypoth-
esis have no entailment relations; for instance,
Person shopped in Location
2
Person fell ill
in Location.
We define various subsets with pairs of sub-
groups, which we introduce and discuss in §3.
Evaluation Metric
In predicate entailment de-
tection, Area-Under-the-Curve with precision
>
50%
(
AUC50 %
) has been the metric in use (Hos-
seini et al.,2018;Schmitt and Schütze,2021b).
It is a solid metric for comparison on the same
dataset; however, we are comparing between differ-
ent subsets, each with a different random baseline
precision (i.e. the ratio of true entailments). If we
were to set a common precision lower-bound, we
would be biased toward those datasets with higher
random baseline precisions. To make performance
on different datasets comparable, we propose the
metric of Normalized AUC (AUCnorm ):
AUCnorm =AU Cξξ
1ξ(1)
3
Except when an entry and its converse appear in different
splits (e.g. one in train, the other in dev), where we randomly
assign both into the same split, so as to avoid information
leakage.
where
ξ
is the random baseline precision. Intu-
itively,
AUCnorm
measures the ratio of area-above-
random (
1ξ
) that falls below the precision-recall
curve (
AUCξξ
), see Appendix Afor graphic
illustration.
AUCnorm
allows us to take into ac-
count all gains against the random baseline, and
level performance on all datasets to the same scale.
3 Prompt Fine-tuning LM with LevyHolt
In this section, we test for LMs’ ability to learn
directional entailments with the S&S prompt-based
gauge model. We use RoBERTa-base as the pri-
mary subject, as it is used by SOTA Schmitt and
Schütze (2021b), and is sufficiently lightweight for
experiments to run efficiently. In Appendix C, we
also report results on RoBERTa-large and BERT
models for key experiments, where results are con-
sistent. We use
S&Ssubset
to denote S&S model
fine-tuned on each subset.
Experiments are graphically summarized in Fig-
ure 1. Each edge denotes a LevyHolt subset made
of the two sub-groups; the number on each edge
is the
AUCnorm
that S&S achieves on each sub-
set. For separating an Entailed sub-group from a
Non-entailed one, the original labels are used; for
separating two Entailed or two Non-entailed sub-
groups, the one with more similar predicates (more
paraphrastic) is assigned label “1”, the other “0”.4
Note that we fit S&S to a number of different
subsets, so we cannot simply re-use the original
hyper-parameters. Instead, to provide a level play-
ing field, we follow Schmitt and Schütze (2021b) in
log-uniformly sampling 100 hyper-parameter sets
from their specified ranges; for each subset, we
choose the one with best dev set result.
If a method is insensitive to directional entail-
ments, then it would treat entailments as similarity
between unordered pairs of predicates; it would
model Paraphrases,DirTrue and DirFalse simi-
larly, where DirTrue and DirFalse entries are con-
ceptually somewhat “semi-paraphrastic”.
If a method is sensitive to directional entailments,
it should be able to discriminate between each pair
of the four sub-groups. Particularly, it should ad-
ditionally be able to separate sub-groups in the up-
4
We acknowledge that Schmitt and Schütze (2021b) use
hand-crafted prompts tuned for entailment detection, so they
may be sub-optimal for separating same-label sub-groups; we
argue that fixed-prompt LM tuning models are not too sensitive
to their specific prompts (Logan IV et al.,2021;Webson and
Pavlick,2022); nonetheless, we also report results from a
continuous-prompt model (Schmitt and Schütze,2021a) in
Appendix B, where results are very similar.
Figure 1: S&S models on mesh of pairs of sub-groups,
results in AUCnorm .
Figure 2: Hypothesis-only artefact baselines on mesh
of pairs of sub-groups, results in AUCnorm .
per triangle of the mesh, coloured red in Figure 1.
Among these three tests of directionality, DirTrue-
DirFalse is the most challenging: a method with
no sensitivity to directionality should be at chance
and get 0%
AUCnorm
; this is traditionally called
the directional subset (Holt,2019). For the other
two subsets, a symmetric measure would do above
random by identifying entries in DirTrue /DirFalse
as statistically less similar than Paraphrases.
Below we discuss findings around the mesh
and the triangle. The S&S model yields 77.7%
AUCnorm
when trained and tested on full Levy-
Holt, which we provide for readers’ reference.
3.1 The S&S Triangle
The red triangle in Figure 1presents mixed mes-
sages about the directionality of RoBERTa LM:
on the most challenging DirTrue-DirFalse sub-
set, it achieves an apparently excellent
AUCnorm
of 82.9%; however, on the other subsets, it gets
mediocre results at 26.9% and 45.8% respectively.
For the directional subset (DirTrue-DirFalse),
the 82.9%
AUCnorm
not only surpasses the 77.7%
for Full LevyHolt, but is also on par with the 79.1%
for its mirroring Symmetric subset (Paraphrases-
Unrelated), which should be easier by human
Train/Dev
Test
Directional Symmetric Full
Directional 82.9 9.9 24.7
Symmetric 0.2 79.1 61.3
Full 46.4 84.8 77.7
Table 1: Generalization performance of RoBERTa-base
S&S classifier on the Directional and Symmetric sub-
sets of LevyHolt. Values are in % of AUCnorm .
AUCnorm (%) S&S
S&S
Para-DirTrue 26.9 19.8 (-7.1)
Para-DirFalse 45.8 35.6 (-10.2)
Table 2: Comparison between S&Swith regular and
S&Swith symmetric prompts. Paraphrases-DirTrue
and Paraphrases-DirFalse subsets are concerned.
judgement. Paradoxically for such a challenging
subset, the Directional subset enjoys the best per-
formance in the mesh.
To understand this result, we did a generalisation
experiment between Directional and Symmetric,
the two disjoint, complementary subsets of Levy-
Holt. As results in Table 1show, classifiers from
the two subsets do not generalise to each other,
and neither does
S&SDirectional
generalise to full
LevyHolt. That is to say, either the two kinds of
“entailments”, Directional and Symmetric, are dif-
ferent tasks from the LM’s perspective, or the S&S
classifier is overfitting to the directional subset.
For the Paraphrases-DirX subsets, results are far
less impressive. For reference, we train two strictly-
symmetric S&S models, one on Paraphrases-
DirTrue, the other on Paraphrases-DirFalse. For
these strictly-symmetric models we enforce all
prompts to be in pairs of reverses (e.g. for the
example in §2, we would add “John went to IKEA,
which means that John shopped in IKEA”). That
way we guarantee from the input that no directional
judgements can be made. We call these symmetric-
prompt models
S&S
. From the results in Table
2, we find that for both Paraphrases-DirTrue and
Paraphrases-DirFalse, there is only a modest dif-
ference between the performance of
S&S
classifier
and
S&S
. This shows that despite the results from
the Directional Subset, RoBERTa LM shows lim-
ited ability to detect directional entailments.
3.2 The Artefacts Triangle
From previous discussions, we notice that the
scores for the directional subset is anomalously
high. Below we reveal that this anomaly is an ef-
fect of dataset artefacts, and that the artefacts in
question are quite specific to the directional subset
and generally less prominent in the other subsets.
Artefacts aside, the S&S classifiers do not show
strong abilities to learn directional entailments.
It is difficult to identify sources of artefacts
by manual inspection; on the other hand, Poliak
et al. (2018) have shown that hypothesis-only (H-
only) models can be a strong proxy for an artefact
baseline. Inspired by their findings, we instead
use H-only model as a proxy for the aggregated
strength of artefacts. For H-only model we use a
restricted version of S&S classifier, where we mask
all premises with the word “true”.5
We report the H-only model’s results on the same
mesh of subsets in Figure 2. For every subset, the
H-only model still trails behind the S&S classifiers.
These gaps are partly explained by the fact that the
H-only model does not capture all existing artefacts,
but is merely a proxy to their strengths.
As shown, the Directional subset indeed has par-
ticularly strong artefacts to exploit, with the H-only
model reaching 48.9%
AUCnorm
, far above the
other subsets. Between Paraphrases-DirTrue and
Paraphrases-DirFalse, the relative performance of
S&S model on them is aligned with their relative
strengths of artefacts; this means, for RoBERTa,
Paraphrases is in fact similarly separable from
DirTrue and DirFalse, as is in line with expectation.
Also interesting is the comparison between the
directional and symmetric subsets. The two subsets
had similar S&S performances; however, there is a
difference of over 20% between their hypothesis-
only artefact strengths. That means the S&S clas-
sifier is after all far better at the symmetric subset
than the directional one.
For a crude ranking, we inspect each sub-
set according to the FullModel-HOnly ratio by
AUCnorm
(lower the stronger artifacts). We
find at rock bottom the Paraphrases-DirFalse and
DirTrue-DirFalse subsets with this ratio at 1.55 and
1.70 respectively, indicating that their full-model
scores are the heaviest over-estimations; next up
is the 2.10 for DirFalse-Unrelated, all the other
5
We use “true” to mask the premise because, when the
premise is always true, the correctness of instantiated prompts
depends solely on the hypothesis.
摘要:

LanguageModelsArePoorLearnersofDirectionalInferenceTianyiLiMohammadJavadHosseiniSabineWeberMarkSteedmanSchoolofInformatics,UniversityofEdinburgh{tianyi.li,javad.hosseini}@ed.ac.uks.weber@sms.ed.ac.uk,steedman@inf.ed.ac.ukAbstractWeexamineLMs'competenceofdirec-tionalpredicateentailmentsbysuperv...

展开>> 收起<<
Language Models Are Poor Learners of Directional Inference Tianyi LiMohammad Javad HosseiniSabine WeberMark Steedman School of Informatics University of Edinburgh.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:718.31KB 格式:PDF 时间:2025-05-04

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注