Word Sense Induction with Hierarchical Clustering and Mutual
Information Maximization
Hadi Abdine1, Moussa Kamal Eddine1, Michalis Vazirgiannis1,2, Davide Buscaldi3
1École Polytechnique, 2AUEB, 3Université Sorbonne Paris Nord
Abstract
Word sense induction (WSI) is a difficult prob-
lem in natural language processing that in-
volves the unsupervised automatic detection of
a word’s senses (i.e. meanings). Recent work
achieves significant results on the WSI task
by pre-training a language model that can ex-
clusively disambiguate word senses, whereas
others employ previously pre-trained language
models in conjunction with additional strate-
gies to induce senses. In this paper, we pro-
pose a novel unsupervised method based on
hierarchical clustering and invariant informa-
tion clustering (IIC). The IIC is used to train
a small model to optimize the mutual infor-
mation between two vector representations of
a target word occurring in a pair of synthetic
paraphrases. This model is later used in in-
ference mode to extract a higher quality vec-
tor representation to be used in the hierarchi-
cal clustering. We evaluate our method on
two WSI tasks and in two distinct clustering
configurations (fixed and dynamic number of
clusters). We empirically demonstrate that, in
certain cases, our approach outperforms prior
WSI state-of-the-art methods, while in others,
it achieves a competitive performance.
1 Introduction
The automatic identification of a word’s senses is
an open problem in natural language processing,
known as "word sense induction" (WSI). It is clear
that the task of word sense induction is closely
related to the task of word sense disambiguation
(WSD) relying on a predefined sense inventory (i.e.
WordNet (Fellbaum,1998;Wallace,2007;Feinerer
and Hornik,2020)) and aiming to solve the word’s
ambiguity in context. In WSI, given a target word,
we focus on clustering a collection of sentences us-
ing this word according to its senses. For example,
figure 1shows the different clusters obtained by us-
ing RoBERTa
LARGE
(Liu et al.,2019) of 3000
sentences that contain the word bank collected
from Wikipedia. We can see five different clus-
ters where the corresponding centroids represent
the 2D PCA projection of the average contextual
word vectors of the word bank. The clusters are
obtained using the agglomerative clustering with
cosine affinity and average linkage. Word senses
are more beneficial than simple word forms for a
variety of tasks including Information Retrieval,
Machine Translation and others (Pantel and Lin,
2002). The former are typically represented as a
fixed list of definitions from a manually constructed
lexical database. However, lexical databases are
missing important domain-specific senses. For ex-
ample, these databases often lack explicit semantic
or contextual links between concepts and defini-
tions (Agirre et al.,2009). Hand-crafted lexical
databases also frequently fail to convey the pre-
cise meaning of a target word in a specific context
(Véronis,2004). In order to address these issues,
WSI intends to learn in an unsupervised manner
the various meanings of a given word.
This paper includes the following contributions:
1) We propose a new unsupervised method using
contextual word embeddings (i.e. RoBERTa, BERT
and DeBERTa (He et al.,2021)) that are being
updated with more sense-related information by
maximizing the mutual information between two
instances of the same cluster. To achieve that, we
generate a randomly perturbated replicate of the
given sentence while preserving its meaning. Thus,
we extract different word representations of the
same target with two similar contexts. This method
presents competitive results on WSI tasks.
2) We apply for the first time a method to compute
a dynamic number of senses for each word. We
rely on a recent word polysemy score function (Xy-
polopoulos et al.,2020).
3) We study the sense information per hidden layer
for four different pretrained language models. We
share, for all models, the layers with the best per-
formance on sense-related tasks.
arXiv:2210.05422v1 [cs.CL] 11 Oct 2022