Leveraging knowledge graphs to update scientiﬁc word embeddings using latent semantic imputation Jason Hoelscher-Obermaier Edward Stevinson

2025-05-02 0 0 550.67KB 12 页 10玖币

侵权投诉

Leveraging knowledge graphs to update scientiﬁc word embeddings using

latent semantic imputation

Jason Hoelscher-Obermaier∗

, Edward Stevinson∗

Valentin Stauber, Ivaylo Zhelev

Victor Botev†

, Ronin Wu†

, Jeremy Minton†

Iris AI, Bekkestua, Norway

jason@iris.ai

Abstract

The most interesting words in scientiﬁc texts

will often be novel or rare. This presents a chal-

lenge for scientiﬁc word embedding models to

determine quality embedding vectors for use-

ful terms that are infrequent or newly emerg-

ing. We demonstrate how latent semantic im-

putation (LSI) can address this problem by im-

puting embeddings for domain-speciﬁc words

from up-to-date knowledge graphs while other-

wise preserving the original word embedding

model. We use the Medical Subject Headings

(MeSH) knowledge graph to impute embed-

ding vectors for biomedical terminology with-

out retraining and evaluate the resulting em-

bedding model on a domain-speciﬁc word-pair

similarity task. We show that LSI can produce

reliable embedding vectors for rare and out of

vocabulary (OOV) terms in the biomedical do-

main.

1 Introduction

Word embeddings are powerful representations of

the semantic and syntactic properties of words that

facilitate high performance in natural language pro-

cessing (NLP) tasks. Because these models com-

pletely rely on a training corpus, they can struggle

to reliably represent words which are infrequent,

or missing entirely, in that corpus. The latter will

happen for any new terminology emerging after

training is complete.

Rapid emergence of new terminology and a long

tail of highly signiﬁcant but rare words are charac-

teristic of technical domains, but these terms are

often of particular importance to NLP tasks within

these domains. This drives a need for methods

to generate reliable embeddings of rare and novel

words. At the same time, there are efforts in many

scientiﬁc ﬁelds to construct large, highly speciﬁc

and continuously updated knowledge graphs that

∗Co-ﬁrst authors

†Co-PIs

capture information about these exact terms. Can

we leverage these knowledge graphs to mitigate the

short-comings of word embeddings on rare, novel

and domain-speciﬁc words?

We investigate one method for achieving this

information transfer, latent semantic imputation

(LSI) (Yao et al.,2019). In LSI the embedding vec-

tor for a given word,

, is imputed as a weighted

average of existing embedding vectors, where the

weights are inferred from the local neighborhood

structure of a corresponding embedding vector,

in a domain-speciﬁc embedding space. We study

how to apply LSI in the context of the biomedi-

cal domain using the Medical Subject Headings

(MeSH) knowledge graph (Lipscomb,2000), but

expect the methodology to be applicable to other

scientiﬁc domains.

2 Related work

Embeddings for rare/out of vocabulary (OOV)

words.

Early methods for embedding rare words

relied on explicitly provided morphological infor-

mation (Alexandrescu and Kirchhoff,2006;Sak

et al.,2010;Lazaridou et al.,2013;Botha and

Blunsom,2014;Luong and Manning,2016;Qiu

et al.,2014). More recent approaches avoid de-

pendence on explicit morphological information

by learning representations for ﬁxed-length char-

acter n-grams that do not have a direct linguistic

interpretation (Bojanowski et al.,2017;Zhao et al.,

2018). Alternatively, the subword structure used

for generalization beyond a ﬁxed vocabulary can be

learnt from data using techniques such as byte-pair

encoding (Sennrich et al.,2016;Gage,1994) or

the WordPiece algorithm (Schuster and Nakajima,

2012). Embeddings for arbitrary strings can also be

generated using character-level recurrent networks

(Ling et al.,2015;Xie et al.,2016;Pinter et al.,

2017). These approaches, as well as transformer-

based methods mentioned below, provide some

OOV generalization capability but are unlikely to

arXiv:2210.15358v1 [cs.CL] 27 Oct 2022

be a general solution since they will struggle with

novel terms whose meaning is not implicit in the

subword structure such as, e.g., eponyms. Note that

we experimented with fastText and it performed

worse than our approach.

Word embeddings for the biomedical do-

main.

Much research has focused on how to best

generate biomedical-speciﬁc embeddings and pro-

vide models to improve performance on down-

stream NLP tasks (Major et al.,2018;Pyysalo et al.,

2013;Chiu et al.,2016;Zhang et al.,2019). Work

in the biomedical domain has investigated opti-

mal hyperparameters for embedding training (Chiu

et al.,2016), the inﬂuence of the training corpus

(Pakhomov et al.,2016;Wang et al.,2018;Lai et al.,

2016), and the advantage of subword-based embed-

dings (Zhang et al.,2019). Word embeddings for

clinical applications have been proposed (Ghosh

et al.,2016;Fan et al.,2019) and an overview was

provided in Kalyan and Sangeetha (2020). More re-

cently, transformer models have been successfully

adapted to the biomedical domain yielding con-

textual, domain-speciﬁc embedding models (Peng

et al.,2019;Lee et al.,2019;Beltagy et al.,2019;

Phan et al.,2021). Whilst these works highlight the

beneﬁts of domain-speciﬁc training corpora this

class of approaches requires retraining to address

the OOV problem.

Improving word embeddings using domain

information.

Our problem task requires improving

a provided embedding model for a given domain,

without detrimental effects on other domains.

Zhang et al. (2019) use random walks over the

MeSH headings knowledge graph to generate ad-

ditional training text to be used during the word

embedding training. Similar ideas have led to us-

ing regularization terms that leverage an existing

embedding during training of a new embedding

to preserve information from an original embed-

ding during training on a new corpus (Yang et al.,

2017). Of course, these methods require the com-

plete training of one or more embedding models.

Faruqui et al. (2014) achieve a similar result

more efﬁciently by deﬁning a convex objective

function that balances preserving an existing em-

bedding with decreasing the distance between re-

lated vectors, based on external data sources such

as a lexicon. This technique has been applied in

the biomedical domain (Yu et al.,2016,2017), but

has limited ability to infer new vocabulary because

without the contribution from the original embed-

ding this reduces to an average of related vectors.

Another approach is to extend the embedding di-

mension to create space for encoding new informa-

tion. This can be as simple as vector concatenation

from another embedding (Yang et al.,2017), possi-

bly followed by dimensionality reduction (Shalaby

et al.,2018). Alternatively, new dimensions can

be derived from existing vectors based on exter-

nal information like synonym pairs (Jo and Choi,

2018). Again, this has limited ability to infer new

vocabulary.

All of these methods change the original em-

bedding, which limits applicability in use-cases

where the original embedding quality must be re-

tained or where incremental updates from many

domains are required. The optimal alignment of

two partially overlapping word embedding spaces

has been studied in the literature on multilingual

word embeddings (Nakashole and Flauger,2017;

Jawanpuria et al.,2019;Alaux et al.,2019) and pro-

vides a mechanism to patch an existing embedding

with information from a domain-speciﬁc embed-

ding. Unfortunately, it assumes the embedding

spaces have the same structure, meaning it is not

suitable when the two embeddings encode different

types of information, such as semantic information

from text and relational information from a knowl-

edge base.

3 Latent Semantic Imputation

LSI, the approach pursued in this paper, represents

embedding vectors for new words as weighted

averages over existing word embedding vectors

with the weights derived from a domain-speciﬁc

feature matrix (Yao et al.,2019). This process

draws insights from Locally Linear Embedding

(Roweis and Saul,2000). Speciﬁcally, a local

neighborhood in a high-dimensional word embed-

ding space

(

for semantic) can be approxi-

mated by a lower-dimensional manifold embedded

in that space. Hence, an embedding vector

for

a word

in that local neighborhood can be approx-

imated as a weighted average over a small number

of neighboring vectors.

This would be useful to construct a vector of a

new word

if we could determine the weights for

the average over neighboring terms. But since, by

assumption, we do not know

’s word embedding

vector

, we also do not know its neighborhood

. The main insight of LSI is that we can use

the local neighborhood of

’s embedding

a domain-speciﬁc space,

, as a proxy for that

neighborhood in the semantic space of our word-

embedding model,

. The weights used for con-

structing an embedding for

are calculated

from the domain space as shown in Fig. 1: a k-

nearest-neighbors minimum-spanning-tree (kNN-

MST) is built from the domain space features. Then

the L2-distance between

and a weighted aver-

age over its neighbors in the kNN-MST is mini-

mized using non-negative least squares. The re-

sulting weights are used to impute the missing em-

bedding vectors in

using the power iteration

method. This procedure crucially relies on the ex-

istence of words with good representations in both

and

, referred to as anchor terms, which serve

as data from which the positions of the derived em-

bedding vectors are constructed.

Figure 1: Latent Semantic Imputation. Rdis the do-

main space and Rsis the semantic space.

4 Methodology

We extend the original LSI procedure described

above in a few key ways. Instead of using a nu-

meric data matrix as the domain data source of

LSI, we use a node embedding model trained on a

domain-speciﬁc knowledge graph to obtain

. As

knowledge graphs are used as a source of structured

information in many ﬁelds, we expect our method

to be applicable to many scientiﬁc domains. Knowl-

edge graphs are prevalent in scientiﬁc ﬁelds as they

serve as a means to organise and store scientiﬁc

data, as well as to aid downstream tasks such as

reasoning and exploration. Their structure and abil-

ity to represent different relationship types makes it

relatively easy to integrate new data, meaning they

can evolve to reﬂect changes in a ﬁeld and as new

data becomes available.

We use the 2021 RDF dump of the MeSH

knowledge graph (available at

https://id.

nlm.nih.gov/mesh/

). The complete graph

consists of 2,327,188 nodes and 4,272,681 edges,

which we reduce into a simpler, smaller, and undi-

rected graph to be fed into a node embedding algo-

rithm. We extract a subgraph consisting of solely

the nodes of type "ns0__TopicalDescriptor" and

the nodes of type "ns0__Concept" that are directly

connected to the topical descriptors via any relation-

ship type. The relationship types and directionality

were removed. This results in 58,695 nodes and

113,094 edges.

We use the node2vec graph embedding algo-

rithm (Grover and Leskovec,2016) on this sub-

graph to produce an embedding matrix of 58,695

vectors with dimension 200 (orange squares in

Fig. 2). The hyperparameters are given in Ap-

pendix 8.1. These node embeddings form the

domain-speciﬁc space,

, as described in the

previous section. We note that in preliminary ex-

periments, the adjacency matrix of the knowledge

graph was used directly as

but this yielded im-

puted embeddings that performed poorly.

To provide the mapping between the MeSH

nodes and the word embedding vocabulary we

normalize the human-readable "rdfs__label" node

property by replacing spaces with hyphens and

lower-casing. The anchor terms are then iden-

tiﬁed as the normalized words that match be-

tween the graph labels and the vocabulary of the

word-embedding model; resulting in 12,676 anchor

terms. As an example, "alpha-2-hs-glycoprotein"

appears as both a node in the reduced graph and in

the word-embedding model, along with its neigh-

bors in the kNN-MST, which include "neoglyco-

proteins" and "alpha-2-antiplasmin". These serve

to stabilise the positions of unknown word embed-

ding vectors for domain space nodes which did not

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LeveragingknowledgegraphstoupdatescienticwordembeddingsusinglatentsemanticimputationJasonHoelscher-Obermaier,EdwardStevinsonValentinStauber,IvayloZhelevVictorBotevy,RoninWuy,JeremyMintonyIrisAI,Bekkestua,Norwayjason@iris.aiAbstractThemostinterestingwordsinscientictextswilloftenbenovelorrare.This...

展开>> 收起<<

Leveraging knowledge graphs to update scientiﬁc word embeddings using latent semantic imputation Jason Hoelscher-Obermaier Edward Stevinson.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Leveraging knowledge graphs to update scientiﬁc word embeddings using latent semantic imputation Jason Hoelscher-Obermaier Edward Stevinson

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: