On the Curious Case of 2norm of Sense Embeddings Yi Zhou University of Liverpool

2025-05-02 0 0 1.58MB 10 页 10玖币
侵权投诉
On the Curious Case of `2norm of Sense Embeddings
Yi Zhou
University of Liverpool
y.zhou71@liverpool.ac.uk
Danushka Bollegala
University of Liverpool, Amazon
danushka@liverpool.ac.uk
Abstract
We show that the `2norm of a static sense
embedding encodes information related to the
frequency of that sense in the training corpus
used to learn the sense embeddings. This find-
ing can be seen as an extension of a previously
known relationship for word embeddings to
sense embeddings. Our experimental results
show that, in spite of its simplicity, the `2norm
of sense embeddings is a surprisingly effec-
tive feature for several word sense related tasks
such as (a) most frequent sense prediction, (b)
Word-in-Context (WiC), and (c) Word Sense
Disambiguation (WSD). In particular, by sim-
ply including the `2norm of a sense embed-
ding as a feature in a classifier, we show that
we can improve WiC and WSD methods that
use static sense embeddings.
1 Introduction
Background:
Given a text corpus, static word
embedding learning methods (Pennington et al.
2014,Mikolov et al. 2013a, etc.) learn a single
vector (aka embedding) to represent the meaning
of a word in the corpus. In contrast, static sense
embedding learning methods (Loureiro and Jorge
2019a,Scarlini et al. 2020b, etc.) learn multiple
embeddings for each word, corresponding to the
different senses of that word.
Arora et al. (2016) proposed a random walk
model on the word co-occurrence graph and
showed that if word embeddings are uniformly dis-
tributed over the unit sphere, the log-frequency of
a word in a corpus is proportional to the squared
`2
norm of the static word embedding, learnt from the
corpus. Hashimoto et al. (2016) showed that under
a simple metric random walk over words where
the probability of transitioning from one word to
another depends only on the squared Euclidean dis-
tance between their embeddings, the log-frequency
of word co-occurrences between two words con-
verges to the negative squared Euclidean distance
measured between the corresponding word embed-
dings. Mu and Viswanath (2018) later showed
that word embeddings are distributed in a narrow
cone, hence not satisfying the uniformity assump-
tion used by Arora et al. (2016), however their
result still holds for such anisotropic embeddings.
On the other hand, Arora et al. (2018) showed that a
word embedding can be represented as the linearly-
weighted combination of sense embeddings. How-
ever, to the best of our knowledge, it remains un-
known thus far as to
What is the relationship be-
tween the sense embeddings and the frequency
of a sense?
, the central question that we study in
this paper.
Contributions:
First, by extending the prior re-
sults for word embeddings into sense embed-
dings, we show that the
squared `2norm of a
static sense embedding is proportional to the
log-frequency of the sense in the training cor-
pus.
This finding has important practical implica-
tions. For example, it is known that assigning every
occurrence of an ambiguous word in a corpus to the
most frequent sense of that word (popularly known
as the Most Frequent Sense (MFS) baseline) is a
surprisingly strong baseline for WSD (McCarthy
et al.,2004,2007). Therefore, the theoretical rela-
tionship which we prove implies that we should be
able to use `2norm to predict the MFS of a word.
Second, we conduct a series of experiments to
empirically validate the above-mentioned relation-
ship. We find that the relationship holds for differ-
ent types of static sense embeddings learnt using
methods such as GloVe (Pennington et al.,2014)
and skip-gram with negative sampling (SGNS;
Mikolov et al.,2013b) on SemCor (Miller et al.,
1993).
Third, motivated by our finding that
`2
norm of
pretrained static sense embeddings encode sense-
frequency related information, we use
`2
norm of
sense embeddings as a feature for several sense-
arXiv:2210.14815v1 [cs.CL] 26 Oct 2022
related tasks such as (a) to predict the MFS of an
ambiguous word, (b) determining whether the same
sense of a word has been used in two different
contexts (WiC; Pilehvar and Camacho-Collados,
2019), and (c) disambiguating the sense of a word
in a sentence (WSD). We find that, regardless of
its simplicity,
`2
norm is a surprisingly effective
feature, consistently improving the performance
in all those benchmarks/tasks. The evaluation
scripts is available at:
https://github.com/LivNLP/
L2norm-of-sense-embeddings.
2`2norm vs. Frequency
Let us first revisit the generative model proposed
by Arora et al. (2016) for static word embeddings,
where the
t
-th word,
v
, in a corpus is generated
at step
t
of a random walk of a context vector
ct
,
which represents what is being talked about. The
probability,
p(v|ct)
, of emitting
v
at time
t
is mod-
elled using a log-linear word production model,
proportionally to
exp(ct>v)
. If
Gv
is a word co-
occurrence graph, where vertices correspond to the
words in the vocabulary,
V
, the random walker can
be seen as visiting the vertices in
Gv
according to
this probability distribution. Arora et al. (2016)
showed that the partition function,
Zc
, given by
(1)
for this probabilistic model is a constant
Z
, inde-
pendent of the context c.
Zc=X
v
exp(c>v)(1)
Assuming that the stationary distribution of this
random walk is uniform over the unit sphere, Arora
et al. (2016) proved the relationship in
(2)
, for
d
dimensional word embeddings, vRd.
log p(v) = ||v||2
2
2dlog Z(2)
Let the frequency of
v
in the corpus be
f(v)
, and
the total number of word occurrences be
N=
Pvf(v)
.
p(v)
can be estimated using corpus
counts as
f(v)/N
. Because
N
,
d
, and
Z
are con-
stants, independent of
v
,
(2)
implies a linear rela-
tionship between log f(v)and ||v||2
2.
To extend this result to sense embeddings, we
observe that the word
v
generated at step
t
by the
above-described random walk can be uniquely as-
sociated with a sense id
sv
, corresponding to the
meaning of
v
as used in
ct
. If we consider a second
sense co-occurrence graph
Gs
, where vertices cor-
respond to the sense ids, then the above-mentioned
Figure 1: Part of the word co-occurrence graph
Gv(bottom) shown with the corresponding sense co-
occurrence graph Gs(top). Each word in Gvis mapped
to its correct sense in Gs.
corpus generation process corresponds to a second
random walk on Gs, as shown in Figure 1.
Although an ambiguous word can be mapped
to multiple sense ids across the corpus in differ-
ent contexts, at any given time step
t
, a word
v
is mapped to only one vertex in
Gs
, determined
by the context
ct
. Indeed a WSD can be seen as
the process of finding such a mapping. The two
random walks over word and sense id graphs are
isomorphic and converge to the same set of final
states (Bauerschmidt et al.,2021). Therefore, an
analogous relationship given by
(3)
can be obtained
by replacing word embeddings,
v
, with sense em-
beddings, s, in (2).
log p(s) = ||s||2
2
2ds
log Z0(3)
Here,
ds
is the dimensionality of the sense em-
beddings
sRds
. Later in § 3, we empiri-
cally show that the normalisation coefficient,
Z0=
Psexp(c>s)
, for sense embeddings also satisfies
the self-normalising (Andreas and Klein,2015)
property, thus independent of
c
. If we abuse the no-
tation
f(s)
to denote also the frequency of
s
in the
corpus (i.e. the total number of times the random
walker visits the vertex
s
), from
(3)
it follows that
log f(s)is linearly related to ||s||2
2.
3 Empirical Validation
The theoretical analysis described in § 2 implies
a linear relationship between
log f(s)
and
||s||2
2
for the learnt sense embeddings. To empirically
verify this relationship, we learn static sense em-
beddings using GloVe and SGNS from SemCor,
which is the largest corpus manually annotated with
WordNet (Miller,1995) sense ids. Specifically, we
consider the co-occurrences of senses instead of
摘要:

OntheCuriousCaseof`2normofSenseEmbeddingsYiZhouUniversityofLiverpooly.zhou71@liverpool.ac.ukDanushkaBollegalaUniversityofLiverpool,Amazondanushka@liverpool.ac.ukAbstractWeshowthatthe`2normofastaticsenseembeddingencodesinformationrelatedtothefrequencyofthatsenseinthetrainingcorpususedtolearnthesensee...

展开>> 收起<<
On the Curious Case of 2norm of Sense Embeddings Yi Zhou University of Liverpool.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.58MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注