
supervised LM methods along with various dis-
crete entailment graphs (EG) (Hosseini et al.,2018,
2021;Chen et al.,2022;Li et al.,2022). We find
that the performances of supervised LM-prompting
methods are indifferent to directional supervision,
and are generally less competitive than suggested
on LevyHolt; on the other hand, EGs reach decent
precisions with their strongest edges, but are hit by
sparsity and noisy unsupervised signals.
Our contributions can be summarized as follows:
1) We show that LevyHolt, the common directional
predicate entailments benchmark, is infested by
artefacts, allowing supervised methods to perform
well by overfitting; 2) We verify that LMs, with su-
pervised fine-tuning, show limited ability to learn
directional entailments; 3) We present BoOQA,
a robust, extrinsic, multilingual evaluation bench-
mark for directional predicate entailments, where
various baselines are provided and analysed.1
2 Background and Setup
Language models have been used under a pretrain-
finetune paradigm: the semantics of a token in
context are learnt during pre-training and reflected
in the dense encodings; when fine-tuning with a
task-specific dataset, the model learns which area
of its encoding space to look at. Therefore, if a pre-
trained LM
cannot
be fine-tuned to solve a task,
we cannot reject the null hypothesis that it does
not encode the task. In §3, we look into RoBERTa
(Liu et al.,2019) and BERT (Devlin et al.,2019) in
particular, and examine whether they can be fine-
tuned to learn directional predicate entailments.
Model
We adapt the supervised SOTA (Schmitt
and Schütze,2021b), a prompt fine-tuning method,
for examining LMs.
2
We call it S&S here and be-
low. S&S fits each premise-hypothesis pair into
a few natural language prompts, such as “John
shopped in IKEA, which means that John went to
IKEA”; they then convert the task into sentence
classification over instantiated prompts. It is a sim-
ple SOTA with few additional parameters, and the
architecture allows directional judgements. Thus,
it is an ideal “gauge” for directional ability of LMs.
Dataset
So far there are two popular predicate
entailment datasets: LevyHolt (Levy and Da-
1
Our code and datasets are published at
https://github.
com/Teddy-Li/LM-DirctionalInference.
2
There is a follow-up work (Schmitt and Schütze,2021a)
to this, but we found it to have inferior generalisation perfor-
mance; see Appendix Bfor details and a brief introduction.
gan,2016;Holt,2019) and Sherliic (Schmitt and
Schütze,2019). We use LevyHolt in our §3ex-
periments, as it contains data entries (
pq
?)
with their converses (
qp
?), making the ground
truth directionality annotations available. We use
the train/dev/test split as in Schmitt and Schütze
(2021b).
3
In each data split, We further classify the
entries into the following 4 sub-groups, in paren-
theses are the sizes of each sub-group in each split:
•
DirTrue (251 / 64 / 892): directional true en-
tailments where the premise entails the hy-
pothesis, but not vice versa; for instance, Per-
son shopped in Location
Person went to
Location;
•
DirFalse (251 / 64 / 892): directional non-
entailments where the hypothesis entails the
premise, but not vice versa; for instance, Per-
son went to Location
2
Person shopped in
Location;
•
Paraphrases (615 / 155 / 1939): symmetric
paraphrases where the premise and the hy-
pothesis entail each other; for instance, Person
arrived at Location
≡
Person got to Location;
•
Unrelated (3255 / 831 / 9198): unrelated pred-
icate pairs where the premise and the hypoth-
esis have no entailment relations; for instance,
Person shopped in Location
2
Person fell ill
in Location.
We define various subsets with pairs of sub-
groups, which we introduce and discuss in §3.
Evaluation Metric
In predicate entailment de-
tection, Area-Under-the-Curve with precision
>
50%
(
AUC50 %
) has been the metric in use (Hos-
seini et al.,2018;Schmitt and Schütze,2021b).
It is a solid metric for comparison on the same
dataset; however, we are comparing between differ-
ent subsets, each with a different random baseline
precision (i.e. the ratio of true entailments). If we
were to set a common precision lower-bound, we
would be biased toward those datasets with higher
random baseline precisions. To make performance
on different datasets comparable, we propose the
metric of Normalized AUC (AUCnorm ):
AUCnorm =AU Cξ−ξ
1−ξ(1)
3
Except when an entry and its converse appear in different
splits (e.g. one in train, the other in dev), where we randomly
assign both into the same split, so as to avoid information
leakage.