However, calculating distributional similarity in NNEs usually yields semantic
association or relatedness rather than semantic similarity (Hill et al. 2015), inevitably
caused by sharing co-occurrence patterns in a context window during self-supervised
learning. For example, after calculation of the cosine similarity on word embeddings such
as the word2vec Skip-gram with Negative Sampling (SGNS) (Mikolov et al. 2013a,
Mikolov et al. 2013b), GloVe (Pennington et al. 2014), and fastText (Bojanowski et al.
2017), we find that the most distributionally similar word to man is woman, and vice versa.
In SGNS, queen is one of the top 10 similar words to king, and vice versa; in GloVe and
fastText, king is one of the top 10 similar words to queen. Although man vs woman or king
vs queen belongs to antonymy, each pair in the embeddings is scored as highly similar.
Semantic relatedness contains various semantic relationships, whereas semantic similarity
usually manifests lexical entailment or the IS-A relationship. As hand-crafted knowledge
bases (KBs) such as WordNet (Miller 1995, Fellbaum 1998) and BabelNet (Navigli and
Ponzetto 2012) mainly consist of IS-A taxonomies, along with synonymy and antonymy,
they are often used for computing semantic similarity (Pedersen et al. 2004, Yang and Yin
2021). Distributional semantics needs to fuse semantic relations in KBs to enhance the
semantic content in NNEs, which is necessary for improving the generalization of neural
language models.
The current study often employs joint-training and post-processing methods to
harvest word usage knowledge from distributional semantics and human-curated concept
relations from KBs. Most joint-training methods directly impose semantic constraints on
their loss functions while jointly optimizing the weighting parameters of neural language
models (Yu and Dredze 2014, Nguyen et al. 2017, Alsuhaibani et al. 2018). Another way
of joint training is to revise the architecture of neural networks either through training
Graph Convolutional Networks with syntactic dependencies and semantic relationships
(Vashishth et al. 2019) or by introducing attention mechanisms (Yang and Mitchell 2017,
Peters et al. 2019). Joint training can tailor NNEs to specific needs of applications, albeit
with an excessive training workload in early fusion. In contrast, the post-processing
methods such as retrofitting (Faruqui et al. 2015), counter-fitting (Mrkšić et al. 2016) and
LEAR (Vulic and Mrkšić 2018) can avoid such burdensome training processes,
semantically specializing NNEs via optimizing a distance metric in late fusion.
Semantically enhanced NNEs can facilitate downstream applications, e.g. lexical
entailment detection (Nguyen et al. 2017, Vulic and Mrkšić 2018), sentiment analysis
(Faruqui et al. 2015, Arora et al. 2020), and dialogue state tracking (Mrkšić et al. 2016,
Mrkšić et al. 2017).
Inspired by previous works (Faruqui et al. 2015, Mrkšić et al. 2016, Vulic and Mrkšić
2018) on semantically specializing NNEs in late fusion, we investigate how to post-process
NNEs through merging symmetric syn/antonymy and asymmetric hypo/hypernymy. We
seek to leverage the IS-A hierarchies' multi-level semantic constraints to augment
distributional semantics. By learning distance metrics in a distributional space, we can
effectively inject lexical-semantic information into NNEs, pulling similar words closer and
pushing dissimilar words further. Consistent results on lexical-semantic tasks show that
our novel specialization method can significantly improve distributional semantics in
deriving semantic similarity and detecting hypernymy and its directionality.