
ConceptNet COMET-2020 Ascent++
banana
yellow, good to eat
one of the main ingredients, eaten
as a snack, one of many fruits,
found in garden, black
rich, ripe, yellow, green, brown,
sweet, great, black, useful, safe,
delicious, healthy, nutricious, ...
lion
a feline
found in jungle, one of many an-
imals, one of many species, two
legs, very large
free, extinct, hungry, close,
unique, active, nocturnal, old, dan-
gerous, great, happy, right, ...
airplane
good for quickly travelling long distances
flying, air travel, flying machine,
very small, flight
heavy, new, important, white, safe,
unique, full, larger, clean, slow,
low, unstable, electric, ...
Table 1: Properties of some example concepts, according to three commonsense knowledge resources.
predicts triples using a generative language model
that was trained on several commonsense knowl-
edge graphs, and Ascent++ (Nguyen et al.,2021),
which is a commonsense knowledge base that was
extracted from web text. Given the noisy nature of
such resources, we rely on a database with hyper-
nyms instead. The underlying intuition is that hy-
pernyms can be extracted from text relatively easily,
while fine-grained hypernyms often implicitly de-
scribe commonsense properties. For instance, Mi-
crosoft Concept Graph (Ji et al.,2019) lists potas-
sium rich food as a hypernym of banana and large
and dangerous carnivore as a hypernym of lion.
We also experiment with GenericsKB (Bhakthavat-
salam et al.,2020), a large collection of generic
sentences (e.g. “Coffee contains minerals and an-
tioxidants which help prevent diabetes”), to ob-
tain concept-property pairs for pre-training. Given
such pre-training data, we then train a concept en-
coder
Φcon
and a property encoder
Φprop
such that
σ(Φcon(c)·Φprop(p))
indicates the probability that
concept chas property p.
In summary, our main contributions are as fol-
lows: (i) we propose a new evaluation setting which
is more realistic than the standard benchmarks for
predicting commonsense properties; (ii) we anal-
yse the potential of hypernymy datasets and generic
sentences to act as pre-training data; and (iii) we de-
velop a simple but effective bi-encoder architecture
for modelling commonsense properties.
2 Related Work
Several authors have analysed the extent to which
language models such as BERT capture common-
sense knowledge. As already mentioned, Forbes
et al. (2019) evaluated the ability of BERT to
predict commonsense properties from the McRae
dataset (McRae et al.,2005), which we also use
allenai.org/model_comet2020_entities.
in our experiments. The same dataset was used by
Weir et al. (2020) to analyse whether BERT-based
language models could generate concept names
from their associated properties; e.g. given the
input “A
hmaski
has fur, is big, and has claws”,
the model is expected to predict that
hmaski
cor-
responds to the word bear. Conversely, Apidi-
anaki and Garí Soler (2021) considered the problem
of generating adjectival properties from prompts
such as “mittens are generally
hmaski
”. Note that
the latter two works evaluated pre-trained models
directly, without fine-tuning, whereas the experi-
ments Forbes et al. (2019) involved fine-tuning the
language model on a task-specific training set first.
When the main motivation is to probe the abilities
of language models, avoiding fine-tuning has the
advantage that any observed abilities reflect what is
captured by the pre-trained language model itself,
rather than learned during the fine-tuning phase.
However, Li et al. (2021) argue that the extent to
which pre-trained language models capture com-
monsense knowledge is limited, suggesting that
some form of fine-tuning is essential in practice.
Interestingly, this remains the case for large lan-
guage models. For instance, their model had 7 bil-
lion parameters, while West et al. (2021) report that
the predictions from GPT-3 (Brown et al.,2020)
had to be filtered by a so-called critic model when
distilling a commonsense knowledge graph.
The strategy taken by COMET (Bosselut et al.,
2019) is to fine-tune a GPT model (Radford et al.)
on triples from commonsense knowledge graphs.
Being based on an autoregressive language model,
COMET can be used to predict concepts that take
the form of short phrases, which is often needed
when reasoning about events (e.g. to express moti-
vations or effects). However, as illustrated in Table
1, COMET is less suitable for modelling the com-
monsense properties of concepts. Other approaches