On the Curious Case of `2norm of Sense Embeddings
Yi Zhou
University of Liverpool
y.zhou71@liverpool.ac.uk
Danushka Bollegala
University of Liverpool, Amazon
danushka@liverpool.ac.uk
Abstract
We show that the `2norm of a static sense
embedding encodes information related to the
frequency of that sense in the training corpus
used to learn the sense embeddings. This find-
ing can be seen as an extension of a previously
known relationship for word embeddings to
sense embeddings. Our experimental results
show that, in spite of its simplicity, the `2norm
of sense embeddings is a surprisingly effec-
tive feature for several word sense related tasks
such as (a) most frequent sense prediction, (b)
Word-in-Context (WiC), and (c) Word Sense
Disambiguation (WSD). In particular, by sim-
ply including the `2norm of a sense embed-
ding as a feature in a classifier, we show that
we can improve WiC and WSD methods that
use static sense embeddings.
1 Introduction
Background:
Given a text corpus, static word
embedding learning methods (Pennington et al.
2014,Mikolov et al. 2013a, etc.) learn a single
vector (aka embedding) to represent the meaning
of a word in the corpus. In contrast, static sense
embedding learning methods (Loureiro and Jorge
2019a,Scarlini et al. 2020b, etc.) learn multiple
embeddings for each word, corresponding to the
different senses of that word.
Arora et al. (2016) proposed a random walk
model on the word co-occurrence graph and
showed that if word embeddings are uniformly dis-
tributed over the unit sphere, the log-frequency of
a word in a corpus is proportional to the squared
`2
norm of the static word embedding, learnt from the
corpus. Hashimoto et al. (2016) showed that under
a simple metric random walk over words where
the probability of transitioning from one word to
another depends only on the squared Euclidean dis-
tance between their embeddings, the log-frequency
of word co-occurrences between two words con-
verges to the negative squared Euclidean distance
measured between the corresponding word embed-
dings. Mu and Viswanath (2018) later showed
that word embeddings are distributed in a narrow
cone, hence not satisfying the uniformity assump-
tion used by Arora et al. (2016), however their
result still holds for such anisotropic embeddings.
On the other hand, Arora et al. (2018) showed that a
word embedding can be represented as the linearly-
weighted combination of sense embeddings. How-
ever, to the best of our knowledge, it remains un-
known thus far as to
What is the relationship be-
tween the sense embeddings and the frequency
of a sense?
, the central question that we study in
this paper.
Contributions:
First, by extending the prior re-
sults for word embeddings into sense embed-
dings, we show that the
squared `2norm of a
static sense embedding is proportional to the
log-frequency of the sense in the training cor-
pus.
This finding has important practical implica-
tions. For example, it is known that assigning every
occurrence of an ambiguous word in a corpus to the
most frequent sense of that word (popularly known
as the Most Frequent Sense (MFS) baseline) is a
surprisingly strong baseline for WSD (McCarthy
et al.,2004,2007). Therefore, the theoretical rela-
tionship which we prove implies that we should be
able to use `2norm to predict the MFS of a word.
Second, we conduct a series of experiments to
empirically validate the above-mentioned relation-
ship. We find that the relationship holds for differ-
ent types of static sense embeddings learnt using
methods such as GloVe (Pennington et al.,2014)
and skip-gram with negative sampling (SGNS;
Mikolov et al.,2013b) on SemCor (Miller et al.,
1993).
Third, motivated by our finding that
`2
norm of
pretrained static sense embeddings encode sense-
frequency related information, we use
`2
norm of
sense embeddings as a feature for several sense-
arXiv:2210.14815v1 [cs.CL] 26 Oct 2022