
Addressing both disambiguated representations
and sparsity resulting from free-text redundancy,
WebChild (Tandon et al.,2014) proposes a CKG,
grounded on WordNet senses, assembled from la-
bel propagation and pattern matching on Web cor-
pora. WebChild features a large CKG (over 4M
triples), but it predates large contextual LMs and
the ensuing progress in WSD, making this resource
unreliable by current standards. Recent works on
CKGs also focus on other aspects besides size and
accuracy, such as salience (Chalier et al.,2020) or
alternatives to triples (Nguyen et al.,2021).
Our work is most related to LAMA (Petroni
et al.,2019), which compiles masked assertions
based on triples from ConceptNet and other re-
sources, and measures how many triples can be ac-
curately recovered when masking the object term.
However, LAMA was designed for single-token
masked prediction based on the intersection of the
subword or byte-level token vocabularies used by
the particular set of LMs considered in that work
3
.
Consequently, LAMA is limited by design to a total
of 21k prediction candidates.
LAMA is an important early result of LM prob-
ing, but besides the previously mentioned technical
limitations, its findings have also been challenged
in later works. Kassner and Schütze (2020) demon-
strated that LMs are susceptible to mispriming and
often unable to handle negation. Poerner et al.
(2020) further showed that LMs could be biased by
the surface form of entity names. Moreover, Dufter
et al. (2021) found that static embeddings using
a nearest neighbors (
k
-NN) approach can outper-
form LMs on the LAMA benchmark, casting doubt
on the presumed advantages of large LMs for the
task. Still, LAMA inspired others to use knowl-
edge graphs (KGs) generated by LMs for intrinsic
evaluation. Swamy et al. (2021) proposes extract-
ing KGs from LMs to support interpretability and
direct comparison between different LMs, or train-
ing stages. Aspillaga et al. (2021) follows a similar
direction but proposes evaluating extracted KGs
by concept relatedness, using hypernymy relations
from WordNet and sense-tagged glosses.
Our approach overcomes the vocabulary limita-
tions of LAMA while outperforming a comparable
k
-NN baseline. We also explore using extracted
CKGs to evaluate LMs, alongside the generation
of novel CKGs.
3
This limitation stems from the fact that each word may be
split into several tokens, whose number conditions predictions
to words that match it, and is specific to each LM’s tokenizer.
3 SenseLAMA
We begin by describing our probing task to evaluate
the commonsense knowledge learned during LM
pre-training. SenseLAMA features verbalized re-
lations
4
between word senses from triples sourced
from WordNet, WikiData, and ConceptNet. In the
following, we describe how we compiled Sense-
LAMA using these resources, including mapping
triples to specific WordNet senses (i.e., synsets).
Unlike other works (e.g., Feng et al.,2020), we
do not merge similar relations. Since our approach
is unsupervised, we do not benefit from additional
examples per relation. Thus, we prefer preserving
performance metrics specific to each source.
We use the core WordNet synsets, initially de-
fined by Boyd-Graber et al. (2005), to create an
easier subset of SenseLAMA. While the full Word-
Net covers over 117k synsets, core synsets are re-
stricted to the 5k
5
most frequently occurring word
senses, dramatically reducing the number of predic-
tion candidates. Thus, our ‘Core’ subset is derived
from the ‘Full’ SenseLAMA, including only in-
stances where both arguments of the triple belong
to the set of core WordNet synsets. If this filter
results in a relation with less than ten instances,
that relation is discarded from the ‘Core’ subset.
Table 1 reports counts for each source and relation
in SenseLAMA.
WordNet
Our base ontology already contains
several relations which arguably fall under the
scope of commonsense knowledge, such as hy-
pernymy, meronymy, or antonymy. Since these
relations already target synsets within WordNet, no
additional mapping or disambiguation is required.
Very frequent relations are capped at 10k samples.
WikiData
This vast resource contains millions
of triples for thousands of relations. We only con-
sider a few select relations most associated with
commonsense knowledge. Furthermore, we only
admit triples for which the head and tail can be
mapped to WordNet v3.0, either via the direct link
available in WikiData’s item properties or through
linking to BabelNet, which we map to WordNet us-
ing the mapping from Navigli and Ponzetto (2012).
Alternatively, we map some triples via hapax link-
ing (McCrae and Cillessen,2021), when the triple’s
arguments correspond to unambiguous words.
4
Appendix A shows handcrafted templates used for Word-
Net and WikiData triples, following Petroni et al. (2019).
5Only 4,960 synsets can be mapped to WordNet v3.0.