
sai,2019;Derby et al.,2021); and the other which
uses cloze-testing, in which LMs are tasked to fill
in the blank in prompts that describe specific prop-
erties/factual knowledge about the world (Petroni
et al.,2019;Weir et al.,2020). We argue that both
approaches—though insightful—have key limita-
tions for evaluating property knowledge, and that
minimal pair testing overcomes these limitations to
a beneficial extent.
Apart from ongoing debates surrounding the va-
lidity of probing classifiers (see Hewitt and Liang,
2019;Ravichander et al.,2021;Belinkov,2022),
the probing setup does not allow the testing of prop-
erty knowledge in a precise manner. Specifically,
several properties are often perfectly correlated in
datasets such as the one we use here (see §2.2). For
example, the property of being an animal and being
able to breathe and grow, etc., are all perfectly cor-
related with one another. Even if the model’s true
knowledge of these properties is highly variable,
probing its representations for them would yield the
exact same result, leading to conclusions that over-
estimate the model’s capacity for some properties,
while underestimating for others. Evaluation using
minimal pair sentences overcomes this limitation
by allowing us to explicitly represent the proper-
ties of interest in language form, thereby allowing
precise testing of property knowledge.
Similarly, standard cloze-testing of PLMs
(Petroni et al.,2019;Weir et al.,2020;Jiang et al.,
2021) also faces multiple limitations. First, it does
not allow for testing of multi-word expressions,
as by definition, it involves prediction of a sin-
gle word/token. Second, it does not yield faithful
conclusions about one-to-many or many-to-many
relations: e.g. the cloze prompts “Ravens can .”
and “ can fly.” do not have a single correct
answer. This makes our conclusions about mod-
els’ knowledge contingent on choice of one correct
completion over the other. The minimal pair eval-
uation paradigm overcomes these issues by gen-
eralizing the cloze-testing method to multi-word
expressions—by focusing on entire sentences—
and at the same time, pairing every prompt with
a negative instance. This allows for a straightfor-
ward way to assess correctness: the choice between
multiple correct completions is transformed into
one between correct and incorrect, at the cost of
having several different instances (pairs) for test-
ing knowledge of the same property. Additionally,
the minimal pairs paradigm allows us also to shed
light on how the nature of negative samples affects
model behavior, which has been missing in ap-
proaches using probing and cloze-testing. The us-
age of minimal pairs is a well-established practice
in the literature, having been widely used in works
that analyze syntactic knowledge of LMs (Marvin
and Linzen,2018;Futrell et al.,2019;Warstadt
et al.,2020). We complement this growing liter-
ature by introducing minimal-pair testing to the
study of conceptual knowledge in PLMs.
Our property inheritance analyses closely relate
to the ‘Leap-of-Thought’ (LoT) framework of Tal-
mor et al. (2020). In particular, LoT holds the
taxonomic relations between concepts implicit and
tests whether models can abstract over them to
make property inferences—e.g., testing the extent
to which models assign Whales have bellybuttons
the ‘True’ label, given that Mammals have belly-
buttons (with the implicit knowledge here being
Whales are mammals). With COMPS-WUGS (and
COMPS-WUGS-DIST), we instead explicitly pro-
vide the relevant taxonomic knowledge in the con-
text and target whether PLMs can behave consis-
tently with knowledge they have already demon-
strated (in the base case, COMPS-BASE) and at-
tribute the property in question to the correct subor-
dinate concept. This also relates to recent work that
measures consistency of PLMs’ word prediction
capacities in eliciting factual knowledge (Elazar
et al.,2021;Ravichander et al.,2020).
2.2 Ground-truth Property Knowledge data
For our ground-truth property knowledge resource,
we use a subset of the CSLB property norms col-
lected by Devereux et al. (2014), which was fur-
ther extended by Misra et al. (2022). The origi-
nal dataset was constructed by asking 123 human
participants to generate properties for 638 every-
day concepts. Contemporary work has used this
dataset by taking as positive instances all concepts
for which a property was generated, while taking
the rest as negative instances (Lucy and Gauthier,
2017;Da and Kasai,2019, etc.) for each prop-
erty. While this dataset has been popularly used in
related literature, Misra et al. (2022) recently dis-
covered striking gaps in coverage among the prop-
erties included in the dataset.
1
For example, the
property can breathe was only generated for 6 out
of 152 animal concepts, despite being applicable
1
See also Sommerauer and Fokkens (2018) and Sommer-
auer (2022), who also discuss this limitation.