Lexical Semantic s Enhanced Neural Word Embeddings Dongqiang Yang Ning Li Li Zou Hongwei Ma School of Computer Science and Technology

2025-04-29 0 0 560.91KB 27 页 10玖币

侵权投诉

Lexical Semantics Enhanced Neural Word Embeddings

Dongqiang Yang, Ning Li, Li Zou*, Hongwei Ma

School of Computer Science and Technology

Shandong Jianzhu University, China

Abstract

Current breakthroughs in natural language processing have benefited dramatically from- neural language

models, through which distributional semantics can leverage neural data representations to facilitate

downstream applications. Since neural embeddings use context prediction on word co-occurrences to yield

dense vectors, they are inevitably prone to capture more semantic association than semantic similarity. To

improve vector space models in deriving semantic similarity, we post-process neural word embeddings

through deep metric learning, through which we can inject lexical-semantic relations, including

syn/antonymy and hypo/hypernymy, into a distributional space. We introduce hierarchy-fitting, a novel

semantic specialization approach to modelling semantic similarity nuances inherently stored in the IS-A

hierarchies. Hierarchy-fitting attains state-of-the-art results on the common- and rare-word benchmark

datasets for deriving semantic similarity from neural word embeddings. It also incorporates an asymmetric

distance function to specialize hypernymy's directionality explicitly, through which it significantly improves

vanilla embeddings in multiple evaluation tasks of detecting hypernymy and directionality without negative

impacts on semantic similarity judgement. The results demonstrate the efficacy of hierarchy-fitting in

specializing neural embeddings with semantic relations in late fusion, potentially expanding its applicability

to aggregating heterogeneous data and various knowledge resources for learning multimodal semantic spaces.

1. Introduction

Neural language models employ context-predicting patterns rather than the traditional

context-counting statistics to yield continuous word embeddings for distributional

semantics. Neural word embeddings (NNEs), working either on the character level

(Bojanowski et al. 2017) or on the unified (Mikolov et al. 2013a, Mikolov et al. 2013b) vs

contextualized (Devlin et al. 2018, Peters et al. 2018) word level, have become a new

paradigm for achieving state-of-the-art performances in the benchmark evaluations such as

GLUE (Wang et al. 2018) and SuperGLUE (Wang et al. 2019a). Notably, in a broad set of

lexical-semantic tasks such as synonym and analogy detection (Baroni et al. 2014), NNEs

have significantly improved distributional semantics compared to the traditional co-

occurrence counting. For example, after linear vector arithmetic on word2vec (Mikolov et

al. 2013a), queen was found distributionally close to the composition result of king – man

+ woman in a distributional space.

Co-corresponding author: zouli20|mahongwei@sdjzu.edu.cn.

However, calculating distributional similarity in NNEs usually yields semantic

association or relatedness rather than semantic similarity (Hill et al. 2015), inevitably

caused by sharing co-occurrence patterns in a context window during self-supervised

learning. For example, after calculation of the cosine similarity on word embeddings such

as the word2vec Skip-gram with Negative Sampling (SGNS) (Mikolov et al. 2013a,

Mikolov et al. 2013b), GloVe (Pennington et al. 2014), and fastText (Bojanowski et al.

2017), we find that the most distributionally similar word to man is woman, and vice versa.

In SGNS, queen is one of the top 10 similar words to king, and vice versa; in GloVe and

fastText, king is one of the top 10 similar words to queen. Although man vs woman or king

vs queen belongs to antonymy, each pair in the embeddings is scored as highly similar.

Semantic relatedness contains various semantic relationships, whereas semantic similarity

usually manifests lexical entailment or the IS-A relationship. As hand-crafted knowledge

bases (KBs) such as WordNet (Miller 1995, Fellbaum 1998) and BabelNet (Navigli and

Ponzetto 2012) mainly consist of IS-A taxonomies, along with synonymy and antonymy,

they are often used for computing semantic similarity (Pedersen et al. 2004, Yang and Yin

2021). Distributional semantics needs to fuse semantic relations in KBs to enhance the

semantic content in NNEs, which is necessary for improving the generalization of neural

language models.

The current study often employs joint-training and post-processing methods to

harvest word usage knowledge from distributional semantics and human-curated concept

relations from KBs. Most joint-training methods directly impose semantic constraints on

their loss functions while jointly optimizing the weighting parameters of neural language

models (Yu and Dredze 2014, Nguyen et al. 2017, Alsuhaibani et al. 2018). Another way

of joint training is to revise the architecture of neural networks either through training

Graph Convolutional Networks with syntactic dependencies and semantic relationships

(Vashishth et al. 2019) or by introducing attention mechanisms (Yang and Mitchell 2017,

Peters et al. 2019). Joint training can tailor NNEs to specific needs of applications, albeit

with an excessive training workload in early fusion. In contrast, the post-processing

methods such as retrofitting (Faruqui et al. 2015), counter-fitting (Mrkšić et al. 2016) and

LEAR (Vulic and Mrkšić 2018) can avoid such burdensome training processes,

semantically specializing NNEs via optimizing a distance metric in late fusion.

Semantically enhanced NNEs can facilitate downstream applications, e.g. lexical

entailment detection (Nguyen et al. 2017, Vulic and Mrkšić 2018), sentiment analysis

(Faruqui et al. 2015, Arora et al. 2020), and dialogue state tracking (Mrkšić et al. 2016,

Mrkšić et al. 2017).

Inspired by previous works (Faruqui et al. 2015, Mrkšić et al. 2016, Vulic and Mrkšić

2018) on semantically specializing NNEs in late fusion, we investigate how to post-process

NNEs through merging symmetric syn/antonymy and asymmetric hypo/hypernymy. We

seek to leverage the IS-A hierarchies' multi-level semantic constraints to augment

distributional semantics. By learning distance metrics in a distributional space, we can

effectively inject lexical-semantic information into NNEs, pulling similar words closer and

pushing dissimilar words further. Consistent results on lexical-semantic tasks show that

our novel specialization method can significantly improve distributional semantics in

deriving semantic similarity and detecting hypernymy and its directionality.

This paper is organized as follows: Section 2 introduces deep metric learning and

examines typical post-processing approaches to injecting semantic relations into neural

word embeddings; Section 3 describes hierarchy-fitting, our new late fusion methodology

of specializing a distributional space under different semantic constraints; Section 4

outlines our experiments on evaluating hierarchy-fitting, and other popular post-processing

approaches in calculating distributional semantics; Section 5 and 6 investigate the efficacy

of hierarchy-fitting in refining neural word embeddings through deriving semantic

similarity and recognizing hypernymy and its directionality on the benchmark datasets,

respectively; Section 7 concludes with several observations and future work.

2. Metric learning

The self-supervised training objective of neural language models (NLMs) is to maximize

the prediction probability of a token given an input of its context, where cross-entropy is

often employed as a cost function for backpropagation to produce NNEs, e.g. word2vec

(Mikolov et al. 2013a, Mikolov et al. 2013b) in a simple feedforward network and BERT

(Devlin et al. 2018) in a deep transformer network. The joint-training approaches to

semantic specialization can directly refine the original training objective with hand-crafted

relations (Fried and Duh 2014, Yu and Dredze 2014, Nguyen et al. 2017, Alsuhaibani et al.

2018). To impose semantic constraints on generating neural embeddings, they can also

modify the attention mechanisms in recurrent neural networks (Yang and Mitchell 2017)

and transformers (Peters et al. 2019). Since the joint-training approaches often produce

task-specific NNEs, which are computationally demanding when learning from scratch

with massive corpora, we only investigate post-processing approaches that can work on

any distributional space.

As with semantic specialization on pre-trained NNEs, instead of cross-entropy loss,

ranking loss in deep metric learning (Kaya and BİLge 2019) is often used to learn a

Euclidean distance in a latent space under the constraints of semantic relations in KBs.

Deep metric learning has broad applications from computer vision (Schroff et al. 2015, Lu

et al. 2017) to natural language processing (Mueller and Thyagarajan 2016, Ein Dor et al.

2018, Zhu et al. 2018) to audio speech processing (Narayanaswamy et al. 2019, Wang et

al. 2019b). Given two tokens: and  in the original vector space of NNEs with a

weighting function 

  , metric learning constructs a distance-based loss function

 to yield the augmented embeddings with   . With the help of

KBs that specify the relationship between and  using a similar or dissimilar tag ,

metric learning continuously updates 

 to pull similar tokens closer or push dissimilar

ones farther, until  finally arrives at minimum for similar tokens and

maximum for dissimilar ones.

In deep metric learning, data sampling for computing ranking loss, either in Siamese

(Bromley et al. 1993) or Triplet (Hoffer and Ailon 2015) networks, plays a crucial role in

specializing neural embeddings. Correspondingly, contrastive or pairwise loss (Chopra et

al. 2005) and triplet loss (Schroff et al. 2015) are two popular cost functions in metric

learning, followed by many of their variants, such as Quadruple Loss (Ni et al. 2017) and

N-Pair Loss (Sohn 2016).

2.1 Contrastive loss

Contrastive or pairwise loss (Chopra et al. 2005) was first used for face recognition on the

hypothesis that similar faces from the same person should be positioned at a smaller

distance in a Euclidean space and different faces from different ones with a larger one. It

can be applied in post-processing NNEs as follows:

  

Here, for the similar tokens: and  with the tag   , contrastive loss  regards them

as a positive sample and seeks to decrease their distance ; and for the

dissimilar tokens with   , it considers them as a negative sample and recommends a

distance margin  to regularize . That is to say if  

, no backpropagation is needed; otherwise, metric learning has to increase their distance.

Contrastive loss only works on two token inputs in computing loss every time.

2.2 Triplet loss

Triplet loss (Schroff et al. 2015) simultaneously takes three inputs in computing rank loss,

which can be defined as follows:

   

For an anchor token , and  denote its positive and negative samples in a triplet input,

respectively. Here,  works as a margin gap to distinguish an easy negative sample from

a hard negative one (Kaya and BİLge 2019), and it also serves as a distance boundary

between  and  when selecting a triplet in metric

learning. If  > ,  is an easy negative sample as

no loss is generated, and it is not necessary to push  farther from ; and if

 <,  is a hard negative sample as  is

distributionally closer to  than , indicating that backpropagate is needed to update;

and any negative token located between and , is

categorized as semi-hard to push away.

In specializing NNEs with prior knowledge, most methods use contrastive loss and

triplet loss with different negative-sample selection policies, among which we list some

typical ones in the following sections.

2.3 Retrofitting

Faruqui et al. (2015) proposed to retrofit word embeddings with semantic lexicons,

including PPDB (Ganitkevitch et al. 2013), WordNet (Miller 1995, Fellbaum 1998), and

FrameNet (Baker et al. 1998). The positive pair:  and  should bear a corresponding

semantic relationship extracted from lexicons, including lexical paraphrasing in PPDB,

synonymy and hypo/pernymy in WordNet, along with words association in FrameNet.

These relations were organized into different graphs, in which word embeddings can be

altered through belief propagation. The loss function for retrofitting can be articulated as:

 

 



 

where  is often set to 1 to control the specialization strength for ; and , equal to

, is another regularizing factor for  in propagation. Since  is a

convex function, minimizing its derivative to  can be denoted as:

 

 



  







Retrofitting works similarly to contrastive loss. Although it extracts multiple positive

samples for , retrofitting pulls only similar or related tokens closer. Srinivasan et al.

(2019) adapted the retrofitting method by introducing a WordNet-based similarity score to

better account for the closeness between  and its neighbours that are located within a 2-

link distance in an IS-A hierarchy, and achieved competitive results in intrinsic and

extrinsic evaluations.

2.4 Counter-fitting.

Inspired by retrofitting, Mrkšić et al. (2016) incorporated synonymy and antonymy in

semantically enhancing word embeddings. They linearly assembled the loss functions from

different semantic constraints while preserving distributional semantics, which are:

1. Synonymy:    

2. Antonymy: 

3. Distributional semantics:

  







Mrkšić et al. (2016) set up different loss functions for synonymy and antonymy and

sequentially specialized NNEs.  and  are margins for synonymy and antonymy,

respectively. Besides semantic specialization on NNEs, they also preserved distributional

semantics using , where  is one of the top distributionally similar words to 

in . In place of the whole vocabulary with ,  also acts as a pseudo-negative

word in  for efficient backpropagation. Note that the Euclidean distance  is often

converted with the cosine similarity. Counter-fitting bears a close resemblance to

contrastive loss as synonymy and antonymy constraints specialize NNEs conversely.

2.5 ATTRACT-REPEL.

Mrkšić et al. (2017) further improved counter-fitting with semantic constraints from mono-

and cross-lingual resources. They used triplet loss rather than contrastive loss to refine a

distributional space, i.e. attracting synonyms and repelling antonyms, therefore termed

ATTRACT-REPEL. The loss functions in ATTRACT-REPEL are listed as follows:

1. Synonymy (ATTRACT):

    



文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LexicalSemanticsEnhancedNeuralWordEmbeddingsDongqiangYang,NingLi,LiZou*,HongweiMa*SchoolofComputerScienceandTechnologyShandongJianzhuUniversity,ChinaAbstractCurrentbreakthroughsinnaturallanguageprocessinghavebenefiteddramaticallyfrom-neurallanguagemodels,throughwhichdistributionalsemanticscanleverag...

展开>> 收起<<

Lexical Semantic s Enhanced Neural Word Embeddings Dongqiang Yang Ning Li Li Zou Hongwei Ma School of Computer Science and Technology.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Lexical Semantic s Enhanced Neural Word Embeddings Dongqiang Yang Ning Li Li Zou Hongwei Ma School of Computer Science and Technology

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: