Probing Commonsense Knowledge in Pre-trained Language Models with Sense-level Precision and Expanded Vocabulary Daniel Loureiro Alípio Mário Jorge

2025-04-26 0 0 369.64KB 14 页 10玖币
侵权投诉
Probing Commonsense Knowledge in Pre-trained Language Models
with Sense-level Precision and Expanded Vocabulary
Daniel Loureiro♦♣, Alípio Mário Jorge
Cardiff NLP, School of Computer Science and Informatics, Cardiff University, UK
LIAAD - INESC TEC, Faculty of Sciences, University of Porto, Portugal
boucanovaloureirod@cardiff.ac.uk, amjorge@fc.up.pt
Abstract
Progress on commonsense reasoning is usually
measured from performance improvements on
Question Answering tasks designed to re-
quire commonsense knowledge. However,
fine-tuning large Language Models (LMs) on
these specific tasks does not directly evalu-
ate commonsense learned during pre-training.
The most direct assessments of commonsense
knowledge in pre-trained LMs are arguably
cloze-style tasks targeting commonsense as-
sertions (e.g., A pen is used for [MASK].).
However, this approach is restricted by the
LM’s vocabulary available for masked predic-
tions, and its precision is subject to the con-
text provided by the assertion. In this work,
we present a method for enriching LMs with
a grounded sense inventory (i.e., WordNet)
available at the vocabulary level, without fur-
ther training. This modification augments the
prediction space of cloze-style prompts to the
size of a large ontology while enabling finer-
grained (sense-level) queries and predictions.
In order to evaluate LMs with higher preci-
sion, we propose SenseLAMA, a cloze-style
task featuring verbalized relations from disam-
biguated triples sourced from WordNet, Wiki-
Data, and ConceptNet. Applying our method
to BERT, producing a WordNet-enriched ver-
sion named SynBERT, we find that LMs
can learn non-trivial commonsense knowledge
from self-supervision, covering numerous re-
lations, and more effectively than comparable
similarity-based approaches.
1 Introduction
A relatively new direction for benchmarking Lan-
guage Models (LMs) are tasks designed to require
commonsense knowledge and reasoning. These
tasks usually target commonsense concepts under a
Question Answering (QA) format (Mihaylov et al.,
2018;Talmor et al.,2019;Bisk et al.,2020;Nie
et al.,2020) and follow scaling trends. Increasing
the model’s parameters leads to improved results,
specially in few-shot learning settings (Chowdh-
ery et al.,2022). Hybrid methods, particularly
those fusing LMs with Graph Neural Networks,
have shown that Commonsense Knowledge Graphs
(CKGs) can help improve performance on these
tasks (Xu et al.,2021;Yasunaga et al.,2021;Zhang
et al.,2022). The results obtained by these works,
using relatively small LMs, suggest that CKGs can
be an alternative (or complement) to increasing
model size, with the added benefit of supporting
more interpretable results.
Nevertheless, the QA approach provides only an
indirect measure of a pre-trained model’s ability
to understand and reason with commonsense con-
cepts. The models attaining best results on these
tasks are often too large for thorough analysis, and
the QA format can promote shallow learning from
annotation artifacts or spurious cues unrelated to
commonsense (Branco et al.,2021).
There are more direct ways of evaluating com-
monsense knowledge in LMs, such as scoring
generated triples (Davison et al.,2019), infilling
cloze-style statements (Petroni et al.,2019), or
fine-tuning for explicit generation of commonsense
statements (Bosselut et al.,2019). However, these
approaches are either limited by each LM’s partic-
ular vocabulary or biased by the available training
data (Wang et al.,2021). Additionally, existing
tasks and methods do not target grounded represen-
tations, which is essential for high-precision CKGs
(Tandon et al.,2014;Dalvi Mishra et al.,2017), and
context-independent reference (Eyal et al.,2022).
Commonsense tasks and approaches typically
leverage ConceptNet (Speer et al.,2017), a pop-
ular CKG built from an extensive crowdsourcing
effort (Storks et al.,2019). Although ConceptNet
is arguably the most popular CKG available, its
nodes are composed of free-form text rather than
disambiguated (canonical) representations, allow-
ing for misleading associations and aggravating
the network’s sparsity (Li et al.,2016;Jastrz˛eb-
arXiv:2210.06376v1 [cs.CL] 12 Oct 2022
ski et al.,2018;Wang et al.,2020). The WordNet
(Miller,1992) sense inventory is a natural choice
for a set of ontologically grounded concept-level
representations, having been curated by experts
over decades and spanning various knowledge do-
mains and syntactic categories of the English lan-
guage. Recent developments on WSD and Unin-
formed Sense Matching (USM) have shown that
WordNet senses can be mapped to naturally occur-
ring sentences with high precision (Loureiro et al.,
2022), including at higher-abstraction levels (e.g.,
‘Marlon Brando’ to actor
1
n
). WordNet’s utility for
commonsense tasks is limited by its narrow set of
relations, focused on lexical relations (mostly hy-
pernymy). However, its smaller size, compared to
WikiData (Vrandeˇ
ci´
c and Krötzsch,2014) or Ba-
belNet (Navigli and Ponzetto,2012), for example,
also presents an opportunity for effective expan-
sion with reduced sparsity, which is important for
symbolic reasoning (Huang et al.,2021).
In this work, we propose that a LM augmented
with explicit sense-level representations (see Fig-
ure 1) may present a solution for precise evalua-
tion of commonsense knowledge learned during
pre-training that is not limited by the LM’s vocab-
ulary. Additionally, we explore how this enriched
model can be used for grounded commonsense rela-
tion extraction towards precise and unbiased (w.r.t.
commonsense training data) CKG construction that
hybrid approaches may use. Considering there is
currently no set of grounded assertions available
to assess progress in this direction, we propose a
cloze-style probing task targeting specific senses
and commonsense relations, inspired by Petroni
et al. (2019). Our contributions
1
are the following:
A BERT
2
model with 117k new sense-specific
embeddings added to its vocabulary, based on
the model’s own internal states (SynBERT).
The SenseLAMA probing task targeting wide-
ranging and precise commonsense – based on
WordNet, WikiData, and ConceptNet.
Analyses on the impact of different input
types for eliciting accurate commonsense
knowledge from BERT.
A new CKG grounded on WordNet with 23k
unseen triples over 18 commonsense relations
(e.g., UsedFor) generated by prompting.
1https://github.com/danlou/synbert
2
While we focus on BERT and WordNet, our methods are
broadly applicable to LMs and alternative representations.
[MASK].
[MASK].
mouse4
n
isakindof
mouse4
n
ismadeof
...
Sense!Embeddings
BERT
Inject
1
2
Vocab.!Embeddings
Internal!Parameters
Masked!LM!Head
IsA
MadeOf
UsedFor
AtLocation
mouse4
n
computer_accessory1
n
program2
v
home_computer1
n
uorocarbon_plastic1
n
Predict
3
Figure 1: Our 3-step method for extracting unsuper-
vised commonsense relations between concepts (i.e.,
word senses) from pre-trained language models. Re-
lations are expressed as verbalizations that may be ex-
changed to target any other property of interest.
2 Related Work
Large LMs have featured prominently in the lat-
est efforts to build richer and more accurate CKGs.
COMET (Hwang et al.,2021) is a generative model
based on BART (Lewis et al.,2020) trained on
ConceptNet and ATOMIC (Sap et al.,2019) and
proven capable of producing novel accurate triples
for challenging relation types, such as HinderedBy.
More recently, West et al. (2021) have proposed
ATOMIC-10x, which leverages generated text from
GPT-3 (Brown et al.,2020) in combination with a
critic model to create the largest and most accurate
semi-automatically constructed CKG. This accu-
racy was determined using both qualitative human
ratings and quantitative measures. However, these
works are primarily concerned with extracting large
CKGs using fine-tuned or distilled LMs, and do not
focus on directly evaluating the CSK learned dur-
ing pre-training. Additionally, these works do not
target grounded representations, considering only
relations between free-text nodes, similarly to Con-
ceptNet.
Addressing both disambiguated representations
and sparsity resulting from free-text redundancy,
WebChild (Tandon et al.,2014) proposes a CKG,
grounded on WordNet senses, assembled from la-
bel propagation and pattern matching on Web cor-
pora. WebChild features a large CKG (over 4M
triples), but it predates large contextual LMs and
the ensuing progress in WSD, making this resource
unreliable by current standards. Recent works on
CKGs also focus on other aspects besides size and
accuracy, such as salience (Chalier et al.,2020) or
alternatives to triples (Nguyen et al.,2021).
Our work is most related to LAMA (Petroni
et al.,2019), which compiles masked assertions
based on triples from ConceptNet and other re-
sources, and measures how many triples can be ac-
curately recovered when masking the object term.
However, LAMA was designed for single-token
masked prediction based on the intersection of the
subword or byte-level token vocabularies used by
the particular set of LMs considered in that work
3
.
Consequently, LAMA is limited by design to a total
of 21k prediction candidates.
LAMA is an important early result of LM prob-
ing, but besides the previously mentioned technical
limitations, its findings have also been challenged
in later works. Kassner and Schütze (2020) demon-
strated that LMs are susceptible to mispriming and
often unable to handle negation. Poerner et al.
(2020) further showed that LMs could be biased by
the surface form of entity names. Moreover, Dufter
et al. (2021) found that static embeddings using
a nearest neighbors (
k
-NN) approach can outper-
form LMs on the LAMA benchmark, casting doubt
on the presumed advantages of large LMs for the
task. Still, LAMA inspired others to use knowl-
edge graphs (KGs) generated by LMs for intrinsic
evaluation. Swamy et al. (2021) proposes extract-
ing KGs from LMs to support interpretability and
direct comparison between different LMs, or train-
ing stages. Aspillaga et al. (2021) follows a similar
direction but proposes evaluating extracted KGs
by concept relatedness, using hypernymy relations
from WordNet and sense-tagged glosses.
Our approach overcomes the vocabulary limita-
tions of LAMA while outperforming a comparable
k
-NN baseline. We also explore using extracted
CKGs to evaluate LMs, alongside the generation
of novel CKGs.
3
This limitation stems from the fact that each word may be
split into several tokens, whose number conditions predictions
to words that match it, and is specific to each LM’s tokenizer.
3 SenseLAMA
We begin by describing our probing task to evaluate
the commonsense knowledge learned during LM
pre-training. SenseLAMA features verbalized re-
lations
4
between word senses from triples sourced
from WordNet, WikiData, and ConceptNet. In the
following, we describe how we compiled Sense-
LAMA using these resources, including mapping
triples to specific WordNet senses (i.e., synsets).
Unlike other works (e.g., Feng et al.,2020), we
do not merge similar relations. Since our approach
is unsupervised, we do not benefit from additional
examples per relation. Thus, we prefer preserving
performance metrics specific to each source.
We use the core WordNet synsets, initially de-
fined by Boyd-Graber et al. (2005), to create an
easier subset of SenseLAMA. While the full Word-
Net covers over 117k synsets, core synsets are re-
stricted to the 5k
5
most frequently occurring word
senses, dramatically reducing the number of predic-
tion candidates. Thus, our ‘Core’ subset is derived
from the ‘Full’ SenseLAMA, including only in-
stances where both arguments of the triple belong
to the set of core WordNet synsets. If this filter
results in a relation with less than ten instances,
that relation is discarded from the ‘Core’ subset.
Table 1 reports counts for each source and relation
in SenseLAMA.
WordNet
Our base ontology already contains
several relations which arguably fall under the
scope of commonsense knowledge, such as hy-
pernymy, meronymy, or antonymy. Since these
relations already target synsets within WordNet, no
additional mapping or disambiguation is required.
Very frequent relations are capped at 10k samples.
WikiData
This vast resource contains millions
of triples for thousands of relations. We only con-
sider a few select relations most associated with
commonsense knowledge. Furthermore, we only
admit triples for which the head and tail can be
mapped to WordNet v3.0, either via the direct link
available in WikiData’s item properties or through
linking to BabelNet, which we map to WordNet us-
ing the mapping from Navigli and Ponzetto (2012).
Alternatively, we map some triples via hapax link-
ing (McCrae and Cillessen,2021), when the triple’s
arguments correspond to unambiguous words.
4
Appendix A shows handcrafted templates used for Word-
Net and WikiData triples, following Petroni et al. (2019).
5Only 4,960 synsets can be mapped to WordNet v3.0.
摘要:

ProbingCommonsenseKnowledgeinPre-trainedLanguageModelswithSense-levelPrecisionandExpandedVocabularyDanielLoureiro}|,AlípioMárioJorge|}CardiffNLP,SchoolofComputerScienceandInformatics,CardiffUniversity,UK|LIAAD-INESCTEC,FacultyofSciences,UniversityofPorto,Portugalboucanovaloureirod@cardiff.ac.uk,amjo...

展开>> 收起<<
Probing Commonsense Knowledge in Pre-trained Language Models with Sense-level Precision and Expanded Vocabulary Daniel Loureiro Alípio Mário Jorge.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:369.64KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注