be a general solution since they will struggle with
novel terms whose meaning is not implicit in the
subword structure such as, e.g., eponyms. Note that
we experimented with fastText and it performed
worse than our approach.
Word embeddings for the biomedical do-
main.
Much research has focused on how to best
generate biomedical-specific embeddings and pro-
vide models to improve performance on down-
stream NLP tasks (Major et al.,2018;Pyysalo et al.,
2013;Chiu et al.,2016;Zhang et al.,2019). Work
in the biomedical domain has investigated opti-
mal hyperparameters for embedding training (Chiu
et al.,2016), the influence of the training corpus
(Pakhomov et al.,2016;Wang et al.,2018;Lai et al.,
2016), and the advantage of subword-based embed-
dings (Zhang et al.,2019). Word embeddings for
clinical applications have been proposed (Ghosh
et al.,2016;Fan et al.,2019) and an overview was
provided in Kalyan and Sangeetha (2020). More re-
cently, transformer models have been successfully
adapted to the biomedical domain yielding con-
textual, domain-specific embedding models (Peng
et al.,2019;Lee et al.,2019;Beltagy et al.,2019;
Phan et al.,2021). Whilst these works highlight the
benefits of domain-specific training corpora this
class of approaches requires retraining to address
the OOV problem.
Improving word embeddings using domain
information.
Our problem task requires improving
a provided embedding model for a given domain,
without detrimental effects on other domains.
Zhang et al. (2019) use random walks over the
MeSH headings knowledge graph to generate ad-
ditional training text to be used during the word
embedding training. Similar ideas have led to us-
ing regularization terms that leverage an existing
embedding during training of a new embedding
to preserve information from an original embed-
ding during training on a new corpus (Yang et al.,
2017). Of course, these methods require the com-
plete training of one or more embedding models.
Faruqui et al. (2014) achieve a similar result
more efficiently by defining a convex objective
function that balances preserving an existing em-
bedding with decreasing the distance between re-
lated vectors, based on external data sources such
as a lexicon. This technique has been applied in
the biomedical domain (Yu et al.,2016,2017), but
has limited ability to infer new vocabulary because
without the contribution from the original embed-
ding this reduces to an average of related vectors.
Another approach is to extend the embedding di-
mension to create space for encoding new informa-
tion. This can be as simple as vector concatenation
from another embedding (Yang et al.,2017), possi-
bly followed by dimensionality reduction (Shalaby
et al.,2018). Alternatively, new dimensions can
be derived from existing vectors based on exter-
nal information like synonym pairs (Jo and Choi,
2018). Again, this has limited ability to infer new
vocabulary.
All of these methods change the original em-
bedding, which limits applicability in use-cases
where the original embedding quality must be re-
tained or where incremental updates from many
domains are required. The optimal alignment of
two partially overlapping word embedding spaces
has been studied in the literature on multilingual
word embeddings (Nakashole and Flauger,2017;
Jawanpuria et al.,2019;Alaux et al.,2019) and pro-
vides a mechanism to patch an existing embedding
with information from a domain-specific embed-
ding. Unfortunately, it assumes the embedding
spaces have the same structure, meaning it is not
suitable when the two embeddings encode different
types of information, such as semantic information
from text and relational information from a knowl-
edge base.
3 Latent Semantic Imputation
LSI, the approach pursued in this paper, represents
embedding vectors for new words as weighted
averages over existing word embedding vectors
with the weights derived from a domain-specific
feature matrix (Yao et al.,2019). This process
draws insights from Locally Linear Embedding
(Roweis and Saul,2000). Specifically, a local
neighborhood in a high-dimensional word embed-
ding space
Es
(
s
for semantic) can be approxi-
mated by a lower-dimensional manifold embedded
in that space. Hence, an embedding vector
ws
for
a word
w
in that local neighborhood can be approx-
imated as a weighted average over a small number
of neighboring vectors.
This would be useful to construct a vector of a
new word
w
if we could determine the weights for
the average over neighboring terms. But since, by
assumption, we do not know
w
’s word embedding
vector
ws
, we also do not know its neighborhood
in
Es
. The main insight of LSI is that we can use
the local neighborhood of
w
’s embedding
wd
in