Word Sense Induction with Hierarchical Clustering and Mutual Information Maximization Hadi Abdine1 Moussa Kamal Eddine1 Michalis Vazirgiannis12 Davide Buscaldi3

2025-04-24 0 0 441.24KB 10 页 10玖币

侵权投诉

Word Sense Induction with Hierarchical Clustering and Mutual

Information Maximization

Hadi Abdine1, Moussa Kamal Eddine1, Michalis Vazirgiannis1,2, Davide Buscaldi3

1École Polytechnique, 2AUEB, 3Université Sorbonne Paris Nord

Abstract

Word sense induction (WSI) is a difﬁcult prob-

lem in natural language processing that in-

volves the unsupervised automatic detection of

a word’s senses (i.e. meanings). Recent work

achieves signiﬁcant results on the WSI task

by pre-training a language model that can ex-

clusively disambiguate word senses, whereas

others employ previously pre-trained language

models in conjunction with additional strate-

gies to induce senses. In this paper, we pro-

pose a novel unsupervised method based on

hierarchical clustering and invariant informa-

tion clustering (IIC). The IIC is used to train

a small model to optimize the mutual infor-

mation between two vector representations of

a target word occurring in a pair of synthetic

paraphrases. This model is later used in in-

ference mode to extract a higher quality vec-

tor representation to be used in the hierarchi-

cal clustering. We evaluate our method on

two WSI tasks and in two distinct clustering

conﬁgurations (ﬁxed and dynamic number of

clusters). We empirically demonstrate that, in

certain cases, our approach outperforms prior

WSI state-of-the-art methods, while in others,

it achieves a competitive performance.

1 Introduction

The automatic identiﬁcation of a word’s senses is

an open problem in natural language processing,

known as "word sense induction" (WSI). It is clear

that the task of word sense induction is closely

related to the task of word sense disambiguation

(WSD) relying on a predeﬁned sense inventory (i.e.

WordNet (Fellbaum,1998;Wallace,2007;Feinerer

and Hornik,2020)) and aiming to solve the word’s

ambiguity in context. In WSI, given a target word,

we focus on clustering a collection of sentences us-

ing this word according to its senses. For example,

ﬁgure 1shows the different clusters obtained by us-

ing RoBERTa

LARGE

(Liu et al.,2019) of 3000

sentences that contain the word bank collected

from Wikipedia. We can see ﬁve different clus-

ters where the corresponding centroids represent

the 2D PCA projection of the average contextual

word vectors of the word bank. The clusters are

obtained using the agglomerative clustering with

cosine afﬁnity and average linkage. Word senses

are more beneﬁcial than simple word forms for a

variety of tasks including Information Retrieval,

Machine Translation and others (Pantel and Lin,

2002). The former are typically represented as a

ﬁxed list of deﬁnitions from a manually constructed

lexical database. However, lexical databases are

missing important domain-speciﬁc senses. For ex-

ample, these databases often lack explicit semantic

or contextual links between concepts and deﬁni-

tions (Agirre et al.,2009). Hand-crafted lexical

databases also frequently fail to convey the pre-

cise meaning of a target word in a speciﬁc context

(Véronis,2004). In order to address these issues,

WSI intends to learn in an unsupervised manner

the various meanings of a given word.

This paper includes the following contributions:

1) We propose a new unsupervised method using

contextual word embeddings (i.e. RoBERTa, BERT

and DeBERTa (He et al.,2021)) that are being

updated with more sense-related information by

maximizing the mutual information between two

instances of the same cluster. To achieve that, we

generate a randomly perturbated replicate of the

given sentence while preserving its meaning. Thus,

we extract different word representations of the

same target with two similar contexts. This method

presents competitive results on WSI tasks.

2) We apply for the ﬁrst time a method to compute

a dynamic number of senses for each word. We

rely on a recent word polysemy score function (Xy-

polopoulos et al.,2020).

3) We study the sense information per hidden layer

for four different pretrained language models. We

share, for all models, the layers with the best per-

formance on sense-related tasks.

arXiv:2210.05422v1 [cs.CL] 11 Oct 2022

Figure 1: An illustration represents the different sense-based clusters of the word bank with the most frequent

words used in the corresponding contexts. These clusters are obtained using agglomerative clustering on a set of

RoBERTa vectors of the word bank extracted from 3000 sentences collected from Wikipedia. The centre of each

cluster is the 2D PCA vector of the average ’bank’ vectors of the cluster. The size of the points is proportional to

the frequency of its appearance in the context of each sense-based cluster.

2 Related Work

Previous works on WSI use generative statistical

models to solve this task. Mainly, they approach

this task as a topic modeling problem using La-

tent Dirichlet Allocation (LDA) (Lau et al.,2012;

Chang et al.,2014;Goyal and Hovy,2014;Wang

et al.,2015;Komninos and Manandhar,2016). Au-

toSense (Amplayo et al.,2018), one of the most

recent best-performing LDA methods, is based on

two principles: First, senses are represented as a

distribution over topics. Second, the model gen-

erates a pair composed of the target word and its

neighboring word, thus seperating the topic distri-

butions into ﬁne-grained senses based on lexical

semantics. AutoSense throws away the garbage

senses by removing topics distributions that don’t

belong to any instance. Furthermore, it adds new

ones according to the generated (target, neighbor)

pairs which means that ﬁxing the number of senses

by the model is not required. While most of the

WSI methods ﬁx the number of clusters for all

the words, in our work we explore two setups for

the number of clusters, ﬁxed and dynamic. Other

works (Song et al.,2016;Corrêa and Amancio,

2018) use the static word embedding Word2Vec

(Mikolov et al.,2013) to get the representations of

polysemous words before applying the clustering

method.

After the emergence of contextual word Embed-

dings, pretrained language models such as ELMo

(Peters et al.,2018) (based on BiLSTM) and BERT

(Devlin et al.,2019) (based on the transformers)

(Vaswani et al.,2017) are used with additional tech-

niques to induce senses of a target word. (Amrami

and Goldberg,2018) and (Amrami and Goldberg,

2019) use consecutively ELMo and BERT

LARGE

to predict probable substitutes for the target words.

Next, it gives each instance

representatives where

each one contains multiple possible substitutes

drawn randomly from the word distribution pre-

dicted by the language model. Each representative

is a vector conducted from TF-IDF. Following, the

representatives are clustered using the agglomera-

tive clustering where the number of clusters is ﬁxed

to 7. Finally, each instance will be assigned to one

or multiple clusters according to the corresponding

cluster of each of its representatives. Instead of

using the word substitutes approach, our work uses

the contextual word embedding extracted from pre-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WordSenseInductionwithHierarchicalClusteringandMutualInformationMaximizationHadiAbdine1,MoussaKamalEddine1,MichalisVazirgiannis1;2,DavideBuscaldi31ÉcolePolytechnique,2AUEB,3UniversitéSorbonneParisNordAbstractWordsenseinduction(WSI)isadifcultprob-leminnaturallanguageprocessingthatin-volvestheunsuper...

展开>> 收起<<

Word Sense Induction with Hierarchical Clustering and Mutual Information Maximization Hadi Abdine1 Moussa Kamal Eddine1 Michalis Vazirgiannis12 Davide Buscaldi3.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Word Sense Induction with Hierarchical Clustering and Mutual Information Maximization Hadi Abdine1 Moussa Kamal Eddine1 Michalis Vazirgiannis12 Davide Buscaldi3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: