Lexical Semantic s Enhanced Neural Word Embeddings Dongqiang Yang Ning Li Li Zou Hongwei Ma School of Computer Science and Technology

2025-04-29 0 0 560.91KB 27 页 10玖币
侵权投诉
Lexical Semantics Enhanced Neural Word Embeddings
Dongqiang Yang, Ning Li, Li Zou*, Hongwei Ma
*
School of Computer Science and Technology
Shandong Jianzhu University, China
Abstract
Current breakthroughs in natural language processing have benefited dramatically from- neural language
models, through which distributional semantics can leverage neural data representations to facilitate
downstream applications. Since neural embeddings use context prediction on word co-occurrences to yield
dense vectors, they are inevitably prone to capture more semantic association than semantic similarity. To
improve vector space models in deriving semantic similarity, we post-process neural word embeddings
through deep metric learning, through which we can inject lexical-semantic relations, including
syn/antonymy and hypo/hypernymy, into a distributional space. We introduce hierarchy-fitting, a novel
semantic specialization approach to modelling semantic similarity nuances inherently stored in the IS-A
hierarchies. Hierarchy-fitting attains state-of-the-art results on the common- and rare-word benchmark
datasets for deriving semantic similarity from neural word embeddings. It also incorporates an asymmetric
distance function to specialize hypernymy's directionality explicitly, through which it significantly improves
vanilla embeddings in multiple evaluation tasks of detecting hypernymy and directionality without negative
impacts on semantic similarity judgement. The results demonstrate the efficacy of hierarchy-fitting in
specializing neural embeddings with semantic relations in late fusion, potentially expanding its applicability
to aggregating heterogeneous data and various knowledge resources for learning multimodal semantic spaces.
1. Introduction
Neural language models employ context-predicting patterns rather than the traditional
context-counting statistics to yield continuous word embeddings for distributional
semantics. Neural word embeddings (NNEs), working either on the character level
(Bojanowski et al. 2017) or on the unified (Mikolov et al. 2013a, Mikolov et al. 2013b) vs
contextualized (Devlin et al. 2018, Peters et al. 2018) word level, have become a new
paradigm for achieving state-of-the-art performances in the benchmark evaluations such as
GLUE (Wang et al. 2018) and SuperGLUE (Wang et al. 2019a). Notably, in a broad set of
lexical-semantic tasks such as synonym and analogy detection (Baroni et al. 2014), NNEs
have significantly improved distributional semantics compared to the traditional co-
occurrence counting. For example, after linear vector arithmetic on word2vec (Mikolov et
al. 2013a), queen was found distributionally close to the composition result of king man
+ woman in a distributional space.
*
Co-corresponding author: zouli20|mahongwei@sdjzu.edu.cn.
However, calculating distributional similarity in NNEs usually yields semantic
association or relatedness rather than semantic similarity (Hill et al. 2015), inevitably
caused by sharing co-occurrence patterns in a context window during self-supervised
learning. For example, after calculation of the cosine similarity on word embeddings such
as the word2vec Skip-gram with Negative Sampling (SGNS) (Mikolov et al. 2013a,
Mikolov et al. 2013b), GloVe (Pennington et al. 2014), and fastText (Bojanowski et al.
2017), we find that the most distributionally similar word to man is woman, and vice versa.
In SGNS, queen is one of the top 10 similar words to king, and vice versa; in GloVe and
fastText, king is one of the top 10 similar words to queen. Although man vs woman or king
vs queen belongs to antonymy, each pair in the embeddings is scored as highly similar.
Semantic relatedness contains various semantic relationships, whereas semantic similarity
usually manifests lexical entailment or the IS-A relationship. As hand-crafted knowledge
bases (KBs) such as WordNet (Miller 1995, Fellbaum 1998) and BabelNet (Navigli and
Ponzetto 2012) mainly consist of IS-A taxonomies, along with synonymy and antonymy,
they are often used for computing semantic similarity (Pedersen et al. 2004, Yang and Yin
2021). Distributional semantics needs to fuse semantic relations in KBs to enhance the
semantic content in NNEs, which is necessary for improving the generalization of neural
language models.
The current study often employs joint-training and post-processing methods to
harvest word usage knowledge from distributional semantics and human-curated concept
relations from KBs. Most joint-training methods directly impose semantic constraints on
their loss functions while jointly optimizing the weighting parameters of neural language
models (Yu and Dredze 2014, Nguyen et al. 2017, Alsuhaibani et al. 2018). Another way
of joint training is to revise the architecture of neural networks either through training
Graph Convolutional Networks with syntactic dependencies and semantic relationships
(Vashishth et al. 2019) or by introducing attention mechanisms (Yang and Mitchell 2017,
Peters et al. 2019). Joint training can tailor NNEs to specific needs of applications, albeit
with an excessive training workload in early fusion. In contrast, the post-processing
methods such as retrofitting (Faruqui et al. 2015), counter-fitting (Mrkšić et al. 2016) and
LEAR (Vulic and Mrkšić 2018) can avoid such burdensome training processes,
semantically specializing NNEs via optimizing a distance metric in late fusion.
Semantically enhanced NNEs can facilitate downstream applications, e.g. lexical
entailment detection (Nguyen et al. 2017, Vulic and Mrkšić 2018), sentiment analysis
(Faruqui et al. 2015, Arora et al. 2020), and dialogue state tracking (Mrkšić et al. 2016,
Mrkšić et al. 2017).
Inspired by previous works (Faruqui et al. 2015, Mrkšić et al. 2016, Vulic and Mrkšić
2018) on semantically specializing NNEs in late fusion, we investigate how to post-process
NNEs through merging symmetric syn/antonymy and asymmetric hypo/hypernymy. We
seek to leverage the IS-A hierarchies' multi-level semantic constraints to augment
distributional semantics. By learning distance metrics in a distributional space, we can
effectively inject lexical-semantic information into NNEs, pulling similar words closer and
pushing dissimilar words further. Consistent results on lexical-semantic tasks show that
our novel specialization method can significantly improve distributional semantics in
deriving semantic similarity and detecting hypernymy and its directionality.
This paper is organized as follows: Section 2 introduces deep metric learning and
examines typical post-processing approaches to injecting semantic relations into neural
word embeddings; Section 3 describes hierarchy-fitting, our new late fusion methodology
of specializing a distributional space under different semantic constraints; Section 4
outlines our experiments on evaluating hierarchy-fitting, and other popular post-processing
approaches in calculating distributional semantics; Section 5 and 6 investigate the efficacy
of hierarchy-fitting in refining neural word embeddings through deriving semantic
similarity and recognizing hypernymy and its directionality on the benchmark datasets,
respectively; Section 7 concludes with several observations and future work.
2. Metric learning
The self-supervised training objective of neural language models (NLMs) is to maximize
the prediction probability of a token given an input of its context, where cross-entropy is
often employed as a cost function for backpropagation to produce NNEs, e.g. word2vec
(Mikolov et al. 2013a, Mikolov et al. 2013b) in a simple feedforward network and BERT
(Devlin et al. 2018) in a deep transformer network. The joint-training approaches to
semantic specialization can directly refine the original training objective with hand-crafted
relations (Fried and Duh 2014, Yu and Dredze 2014, Nguyen et al. 2017, Alsuhaibani et al.
2018). To impose semantic constraints on generating neural embeddings, they can also
modify the attention mechanisms in recurrent neural networks (Yang and Mitchell 2017)
and transformers (Peters et al. 2019). Since the joint-training approaches often produce
task-specific NNEs, which are computationally demanding when learning from scratch
with massive corpora, we only investigate post-processing approaches that can work on
any distributional space.
As with semantic specialization on pre-trained NNEs, instead of cross-entropy loss,
ranking loss in deep metric learning (Kaya and BİLge 2019) is often used to learn a
Euclidean distance in a latent space under the constraints of semantic relations in KBs.
Deep metric learning has broad applications from computer vision (Schroff et al. 2015, Lu
et al. 2017) to natural language processing (Mueller and Thyagarajan 2016, Ein Dor et al.
2018, Zhu et al. 2018) to audio speech processing (Narayanaswamy et al. 2019, Wang et
al. 2019b). Given two tokens: and in the original vector space of NNEs with a
weighting function
 , metric learning constructs a distance-based loss function
 to yield the augmented embeddings with  . With the help of
KBs that specify the relationship between and using a similar or dissimilar tag ,
metric learning continuously updates
to pull similar tokens closer or push dissimilar
ones farther, until  finally arrives at minimum for similar tokens and
maximum for dissimilar ones.
In deep metric learning, data sampling for computing ranking loss, either in Siamese
(Bromley et al. 1993) or Triplet (Hoffer and Ailon 2015) networks, plays a crucial role in
specializing neural embeddings. Correspondingly, contrastive or pairwise loss (Chopra et
al. 2005) and triplet loss (Schroff et al. 2015) are two popular cost functions in metric
learning, followed by many of their variants, such as Quadruple Loss (Ni et al. 2017) and
N-Pair Loss (Sohn 2016).
2.1 Contrastive loss
Contrastive or pairwise loss (Chopra et al. 2005) was first used for face recognition on the
hypothesis that similar faces from the same person should be positioned at a smaller
distance in a Euclidean space and different faces from different ones with a larger one. It
can be applied in post-processing NNEs as follows:
 
Here, for the similar tokens: and with the tag  , contrastive loss regards them
as a positive sample and seeks to decrease their distance ; and for the
dissimilar tokens with   , it considers them as a negative sample and recommends a
distance margin to regularize . That is to say if  
, no backpropagation is needed; otherwise, metric learning has to increase their distance.
Contrastive loss only works on two token inputs in computing loss every time.
2.2 Triplet loss
Triplet loss (Schroff et al. 2015) simultaneously takes three inputs in computing rank loss,
which can be defined as follows:
   
For an anchor token , and denote its positive and negative samples in a triplet input,
respectively. Here, works as a margin gap to distinguish an easy negative sample from
a hard negative one (Kaya and BİLge 2019), and it also serves as a distance boundary
between  and  when selecting a triplet in metric
learning. If  > , is an easy negative sample as
no loss is generated, and it is not necessary to push farther from ; and if
 <, is a hard negative sample as is
distributionally closer to than , indicating that backpropagate is needed to update;
and any negative token located between and , is
categorized as semi-hard to push away.
In specializing NNEs with prior knowledge, most methods use contrastive loss and
triplet loss with different negative-sample selection policies, among which we list some
typical ones in the following sections.
2.3 Retrofitting
Faruqui et al. (2015) proposed to retrofit word embeddings with semantic lexicons,
including PPDB (Ganitkevitch et al. 2013), WordNet (Miller 1995, Fellbaum 1998), and
FrameNet (Baker et al. 1998). The positive pair: and should bear a corresponding
semantic relationship extracted from lexicons, including lexical paraphrasing in PPDB,
synonymy and hypo/pernymy in WordNet, along with words association in FrameNet.
These relations were organized into different graphs, in which word embeddings can be
altered through belief propagation. The loss function for retrofitting can be articulated as:
 
 


where is often set to 1 to control the specialization strength for ; and , equal to
, is another regularizing factor for in propagation. Since is a
convex function, minimizing its derivative to can be denoted as:
 
 

 


Retrofitting works similarly to contrastive loss. Although it extracts multiple positive
samples for , retrofitting pulls only similar or related tokens closer. Srinivasan et al.
(2019) adapted the retrofitting method by introducing a WordNet-based similarity score to
better account for the closeness between and its neighbours that are located within a 2-
link distance in an IS-A hierarchy, and achieved competitive results in intrinsic and
extrinsic evaluations.
2.4 Counter-fitting.
Inspired by retrofitting, Mrkšić et al. (2016) incorporated synonymy and antonymy in
semantically enhancing word embeddings. They linearly assembled the loss functions from
different semantic constraints while preserving distributional semantics, which are:
1. Synonymy:    
2. Antonymy: 
3. Distributional semantics:
  


Mrkšić et al. (2016) set up different loss functions for synonymy and antonymy and
sequentially specialized NNEs.  and  are margins for synonymy and antonymy,
respectively. Besides semantic specialization on NNEs, they also preserved distributional
semantics using , where is one of the top distributionally similar words to
in . In place of the whole vocabulary with , also acts as a pseudo-negative
word in  for efficient backpropagation. Note that the Euclidean distance is often
converted with the cosine similarity. Counter-fitting bears a close resemblance to
contrastive loss as synonymy and antonymy constraints specialize NNEs conversely.
2.5 ATTRACT-REPEL.
Mrkšić et al. (2017) further improved counter-fitting with semantic constraints from mono-
and cross-lingual resources. They used triplet loss rather than contrastive loss to refine a
distributional space, i.e. attracting synonyms and repelling antonyms, therefore termed
ATTRACT-REPEL. The loss functions in ATTRACT-REPEL are listed as follows:
1. Synonymy (ATTRACT):
    

摘要:

LexicalSemanticsEnhancedNeuralWordEmbeddingsDongqiangYang,NingLi,LiZou*,HongweiMa*SchoolofComputerScienceandTechnologyShandongJianzhuUniversity,ChinaAbstractCurrentbreakthroughsinnaturallanguageprocessinghavebenefiteddramaticallyfrom-neurallanguagemodels,throughwhichdistributionalsemanticscanleverag...

展开>> 收起<<
Lexical Semantic s Enhanced Neural Word Embeddings Dongqiang Yang Ning Li Li Zou Hongwei Ma School of Computer Science and Technology.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:560.91KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注