Leveraging knowledge graphs to update scientific word embeddings using latent semantic imputation Jason Hoelscher-Obermaier Edward Stevinson

2025-05-02 0 0 550.67KB 12 页 10玖币
侵权投诉
Leveraging knowledge graphs to update scientific word embeddings using
latent semantic imputation
Jason Hoelscher-Obermaier
, Edward Stevinson
Valentin Stauber, Ivaylo Zhelev
Victor Botev
, Ronin Wu
, Jeremy Minton
Iris AI, Bekkestua, Norway
jason@iris.ai
Abstract
The most interesting words in scientific texts
will often be novel or rare. This presents a chal-
lenge for scientific word embedding models to
determine quality embedding vectors for use-
ful terms that are infrequent or newly emerg-
ing. We demonstrate how latent semantic im-
putation (LSI) can address this problem by im-
puting embeddings for domain-specific words
from up-to-date knowledge graphs while other-
wise preserving the original word embedding
model. We use the Medical Subject Headings
(MeSH) knowledge graph to impute embed-
ding vectors for biomedical terminology with-
out retraining and evaluate the resulting em-
bedding model on a domain-specific word-pair
similarity task. We show that LSI can produce
reliable embedding vectors for rare and out of
vocabulary (OOV) terms in the biomedical do-
main.
1 Introduction
Word embeddings are powerful representations of
the semantic and syntactic properties of words that
facilitate high performance in natural language pro-
cessing (NLP) tasks. Because these models com-
pletely rely on a training corpus, they can struggle
to reliably represent words which are infrequent,
or missing entirely, in that corpus. The latter will
happen for any new terminology emerging after
training is complete.
Rapid emergence of new terminology and a long
tail of highly significant but rare words are charac-
teristic of technical domains, but these terms are
often of particular importance to NLP tasks within
these domains. This drives a need for methods
to generate reliable embeddings of rare and novel
words. At the same time, there are efforts in many
scientific fields to construct large, highly specific
and continuously updated knowledge graphs that
Co-first authors
Co-PIs
capture information about these exact terms. Can
we leverage these knowledge graphs to mitigate the
short-comings of word embeddings on rare, novel
and domain-specific words?
We investigate one method for achieving this
information transfer, latent semantic imputation
(LSI) (Yao et al.,2019). In LSI the embedding vec-
tor for a given word,
w
, is imputed as a weighted
average of existing embedding vectors, where the
weights are inferred from the local neighborhood
structure of a corresponding embedding vector,
wd
,
in a domain-specific embedding space. We study
how to apply LSI in the context of the biomedi-
cal domain using the Medical Subject Headings
(MeSH) knowledge graph (Lipscomb,2000), but
expect the methodology to be applicable to other
scientific domains.
2 Related work
Embeddings for rare/out of vocabulary (OOV)
words.
Early methods for embedding rare words
relied on explicitly provided morphological infor-
mation (Alexandrescu and Kirchhoff,2006;Sak
et al.,2010;Lazaridou et al.,2013;Botha and
Blunsom,2014;Luong and Manning,2016;Qiu
et al.,2014). More recent approaches avoid de-
pendence on explicit morphological information
by learning representations for fixed-length char-
acter n-grams that do not have a direct linguistic
interpretation (Bojanowski et al.,2017;Zhao et al.,
2018). Alternatively, the subword structure used
for generalization beyond a fixed vocabulary can be
learnt from data using techniques such as byte-pair
encoding (Sennrich et al.,2016;Gage,1994) or
the WordPiece algorithm (Schuster and Nakajima,
2012). Embeddings for arbitrary strings can also be
generated using character-level recurrent networks
(Ling et al.,2015;Xie et al.,2016;Pinter et al.,
2017). These approaches, as well as transformer-
based methods mentioned below, provide some
OOV generalization capability but are unlikely to
arXiv:2210.15358v1 [cs.CL] 27 Oct 2022
be a general solution since they will struggle with
novel terms whose meaning is not implicit in the
subword structure such as, e.g., eponyms. Note that
we experimented with fastText and it performed
worse than our approach.
Word embeddings for the biomedical do-
main.
Much research has focused on how to best
generate biomedical-specific embeddings and pro-
vide models to improve performance on down-
stream NLP tasks (Major et al.,2018;Pyysalo et al.,
2013;Chiu et al.,2016;Zhang et al.,2019). Work
in the biomedical domain has investigated opti-
mal hyperparameters for embedding training (Chiu
et al.,2016), the influence of the training corpus
(Pakhomov et al.,2016;Wang et al.,2018;Lai et al.,
2016), and the advantage of subword-based embed-
dings (Zhang et al.,2019). Word embeddings for
clinical applications have been proposed (Ghosh
et al.,2016;Fan et al.,2019) and an overview was
provided in Kalyan and Sangeetha (2020). More re-
cently, transformer models have been successfully
adapted to the biomedical domain yielding con-
textual, domain-specific embedding models (Peng
et al.,2019;Lee et al.,2019;Beltagy et al.,2019;
Phan et al.,2021). Whilst these works highlight the
benefits of domain-specific training corpora this
class of approaches requires retraining to address
the OOV problem.
Improving word embeddings using domain
information.
Our problem task requires improving
a provided embedding model for a given domain,
without detrimental effects on other domains.
Zhang et al. (2019) use random walks over the
MeSH headings knowledge graph to generate ad-
ditional training text to be used during the word
embedding training. Similar ideas have led to us-
ing regularization terms that leverage an existing
embedding during training of a new embedding
to preserve information from an original embed-
ding during training on a new corpus (Yang et al.,
2017). Of course, these methods require the com-
plete training of one or more embedding models.
Faruqui et al. (2014) achieve a similar result
more efficiently by defining a convex objective
function that balances preserving an existing em-
bedding with decreasing the distance between re-
lated vectors, based on external data sources such
as a lexicon. This technique has been applied in
the biomedical domain (Yu et al.,2016,2017), but
has limited ability to infer new vocabulary because
without the contribution from the original embed-
ding this reduces to an average of related vectors.
Another approach is to extend the embedding di-
mension to create space for encoding new informa-
tion. This can be as simple as vector concatenation
from another embedding (Yang et al.,2017), possi-
bly followed by dimensionality reduction (Shalaby
et al.,2018). Alternatively, new dimensions can
be derived from existing vectors based on exter-
nal information like synonym pairs (Jo and Choi,
2018). Again, this has limited ability to infer new
vocabulary.
All of these methods change the original em-
bedding, which limits applicability in use-cases
where the original embedding quality must be re-
tained or where incremental updates from many
domains are required. The optimal alignment of
two partially overlapping word embedding spaces
has been studied in the literature on multilingual
word embeddings (Nakashole and Flauger,2017;
Jawanpuria et al.,2019;Alaux et al.,2019) and pro-
vides a mechanism to patch an existing embedding
with information from a domain-specific embed-
ding. Unfortunately, it assumes the embedding
spaces have the same structure, meaning it is not
suitable when the two embeddings encode different
types of information, such as semantic information
from text and relational information from a knowl-
edge base.
3 Latent Semantic Imputation
LSI, the approach pursued in this paper, represents
embedding vectors for new words as weighted
averages over existing word embedding vectors
with the weights derived from a domain-specific
feature matrix (Yao et al.,2019). This process
draws insights from Locally Linear Embedding
(Roweis and Saul,2000). Specifically, a local
neighborhood in a high-dimensional word embed-
ding space
Es
(
s
for semantic) can be approxi-
mated by a lower-dimensional manifold embedded
in that space. Hence, an embedding vector
ws
for
a word
w
in that local neighborhood can be approx-
imated as a weighted average over a small number
of neighboring vectors.
This would be useful to construct a vector of a
new word
w
if we could determine the weights for
the average over neighboring terms. But since, by
assumption, we do not know
w
s word embedding
vector
ws
, we also do not know its neighborhood
in
Es
. The main insight of LSI is that we can use
the local neighborhood of
w
s embedding
wd
in
a domain-specific space,
Ed
, as a proxy for that
neighborhood in the semantic space of our word-
embedding model,
Es
. The weights used for con-
structing an embedding for
w
in
Es
are calculated
from the domain space as shown in Fig. 1: a k-
nearest-neighbors minimum-spanning-tree (kNN-
MST) is built from the domain space features. Then
the L2-distance between
wd
and a weighted aver-
age over its neighbors in the kNN-MST is mini-
mized using non-negative least squares. The re-
sulting weights are used to impute the missing em-
bedding vectors in
Es
using the power iteration
method. This procedure crucially relies on the ex-
istence of words with good representations in both
Es
and
Ed
, referred to as anchor terms, which serve
as data from which the positions of the derived em-
bedding vectors are constructed.
Figure 1: Latent Semantic Imputation. Rdis the do-
main space and Rsis the semantic space.
4 Methodology
We extend the original LSI procedure described
above in a few key ways. Instead of using a nu-
meric data matrix as the domain data source of
LSI, we use a node embedding model trained on a
domain-specific knowledge graph to obtain
Ed
. As
knowledge graphs are used as a source of structured
information in many fields, we expect our method
to be applicable to many scientific domains. Knowl-
edge graphs are prevalent in scientific fields as they
serve as a means to organise and store scientific
data, as well as to aid downstream tasks such as
reasoning and exploration. Their structure and abil-
ity to represent different relationship types makes it
relatively easy to integrate new data, meaning they
can evolve to reflect changes in a field and as new
data becomes available.
We use the 2021 RDF dump of the MeSH
knowledge graph (available at
https://id.
nlm.nih.gov/mesh/
). The complete graph
consists of 2,327,188 nodes and 4,272,681 edges,
which we reduce into a simpler, smaller, and undi-
rected graph to be fed into a node embedding algo-
rithm. We extract a subgraph consisting of solely
the nodes of type "ns0__TopicalDescriptor" and
the nodes of type "ns0__Concept" that are directly
connected to the topical descriptors via any relation-
ship type. The relationship types and directionality
were removed. This results in 58,695 nodes and
113,094 edges.
We use the node2vec graph embedding algo-
rithm (Grover and Leskovec,2016) on this sub-
graph to produce an embedding matrix of 58,695
vectors with dimension 200 (orange squares in
Fig. 2). The hyperparameters are given in Ap-
pendix 8.1. These node embeddings form the
domain-specific space,
Ed
, as described in the
previous section. We note that in preliminary ex-
periments, the adjacency matrix of the knowledge
graph was used directly as
Ed
but this yielded im-
puted embeddings that performed poorly.
To provide the mapping between the MeSH
nodes and the word embedding vocabulary we
normalize the human-readable "rdfs__label" node
property by replacing spaces with hyphens and
lower-casing. The anchor terms are then iden-
tified as the normalized words that match be-
tween the graph labels and the vocabulary of the
word-embedding model; resulting in 12,676 anchor
terms. As an example, "alpha-2-hs-glycoprotein"
appears as both a node in the reduced graph and in
the word-embedding model, along with its neigh-
bors in the kNN-MST, which include "neoglyco-
proteins" and "alpha-2-antiplasmin". These serve
to stabilise the positions of unknown word embed-
ding vectors for domain space nodes which did not
摘要:

LeveragingknowledgegraphstoupdatescienticwordembeddingsusinglatentsemanticimputationJasonHoelscher-Obermaier,EdwardStevinsonValentinStauber,IvayloZhelevVictorBotevy,RoninWuy,JeremyMintonyIrisAI,Bekkestua,Norwayjason@iris.aiAbstractThemostinterestingwordsinscientictextswilloftenbenovelorrare.This...

展开>> 收起<<
Leveraging knowledge graphs to update scientific word embeddings using latent semantic imputation Jason Hoelscher-Obermaier Edward Stevinson.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:550.67KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注