Domain-Specific Word Embeddings with Structure Prediction Stephanie Brandl12David Lassner13Anne Baillot4Shinichi Nakajima135 1TU Berlin2University of Copenhagen3BIFOLD

2025-04-27 0 0 1.22MB 16 页 10玖币
侵权投诉
Domain-Specific Word Embeddings with Structure Prediction
Stephanie Brandl,1,2David Lassner,1,3Anne Baillot4Shinichi Nakajima1,3,5
1TU Berlin 2University of Copenhagen 3BIFOLD
4Le Mans Université 5RIKEN Center for AIP
brandl@di.ku.dk,lassner@tu-berlin.de
Authors contributed equally.
Abstract
Complementary to finding good general
word embeddings, an important question
for representation learning is to find dy-
namic word embeddings, e.g., across time
or domain. Current methods do not of-
fer a way to use or predict information on
structure between sub-corpora, time or do-
main and dynamic embeddings can only be
compared after post-alignment. We pro-
pose novel word embedding methods that
provide general word representations for
the whole corpus, domain-specific repre-
sentations for each sub-corpus, sub-corpus
structure, and embedding alignment simul-
taneously. We present an empirical eval-
uation on New York Times articles and
two English Wikipedia datasets with ar-
ticles on science and philosophy. Our
method, called Word2Vec with Structure
Prediction (W2VPred), provides better per-
formance than baselines in terms of the gen-
eral analogy tests, domain-specific analogy
tests, and multiple specific word embedding
evaluations as well as structure prediction
performance when no structure is given a
priori. As a use case in the field of Digi-
tal Humanities we demonstrate how to raise
novel research questions for high literature
from the German Text Archive.
1 Introduction
Word embeddings (Mikolov et al.,2013b;Pen-
nington et al.,2014) are a powerful tool for word-
level representation in a vector space that captures
semantic and syntactic relations between words.
They have been successfully used in many appli-
cations such as text classification (Joulin et al.,
2016) and machine translation (Mikolov et al.,
2013a). Word embeddings highly depend on their
training corpus. For example, technical terms used
in scientific documents can have a different mean-
ing in other domains, and words can change their
meaning over time—“apple” did not mean a tech
company before Apple Inc. was founded. On the
other hand, such local or domain-specific repre-
sentations are also not independent of each other,
because most words are expected to have a similar
meaning across domains.
There are many situations where a given target
corpus is considered to have some structure. For
example, when analyzing news articles, one can
expect that articles published in 2000 and 2001 are
more similar to each other than the ones from 2000
and 2010. When analyzing scientific articles, uses
of technical terms are expected to be similar in ar-
ticles on similar fields of science. This implies that
the structure of a corpus can be a useful side infor-
mation for obtaining better word representation.
Various approaches to analyse semantic shifts
in text have been proposed where typically first
individual static embeddings are trained and then
aligned afterwards (e.g., Kulkarni et al.,2015;
Hamilton et al.,2016;Kutuzov et al.,2018;Tah-
masebi et al.,2018). As most word embeddings
are invariant with respect to rotation and scaling, it
is necessary to map word embeddings from differ-
ent training procedures into the same vector space
in order to compare them. This procedure is usu-
ally called alignment for which orthogonal Pro-
crustes can be applied as has been used in (Hamil-
ton et al.,2016).
Recently, new methods to train diachronic word
embeddings have been proposed where the align-
ment process is integrated in the training process.
Bamler and Mandt (2017) propose a Bayesian ap-
proach that extends the skip-gram model (Mikolov
et al.,2013b). Rudolph and Blei (2018) analyse
dynamic changes in word embeddings based on
exponential family embeddings. Yao et al. (2018)
propose Dynamic Word2Vec where word embed-
dings for each year of the New York Times corpus
are trained based on individual positive point-wise
information matrices and aligned simultaneously.
arXiv:2210.04962v1 [cs.CL] 6 Oct 2022
We argue that apart from diachronic word em-
beddings there is a need to train dynamic word
embeddings that not only capture temporal shifts
in language but for instance also semantic shifts
between domains or regional differences. It is
therefore important that those embeddings can be
trained on small datasets. We therefore propose
two generalizations of Dynamic Word2Vec. Our
first method is called Word2Vec with Structure
Constraint (W2VConstr), where domain-specific
embeddings are learned under regularization with
any kind of structure. This method performs
well when a respective graph structure is given
a priori. For more general cases where no struc-
ture information is given, we propose our second
method, called Word2Vec with Structure Predic-
tion (W2VPred), where domain-specific embed-
dings and sub-corpora structure are learned at the
same time. W2VPred simultaneously solves three
central problems that arise with word embedding
representations:
1. Words in the sub-corpora are embedded in
the same vector space, and are therefore di-
rectly comparable without post-alignment.
2. The different representations are trained si-
multaneously on the whole corpus as well
as on the sub-corpora, which makes embed-
dings for both general and domain-specific
words robust, due to the information ex-
change between sub-corpora.
3. The estimated graph structure can be used
for confirmatory evaluation when a reason-
able prior structure is given. W2VPred to-
gether with W2VConstr identifies the cases
where the given structure is not ideal, and
suggests a refined structure which leads to an
improved embedding performance, we call
this method Word2Vec with Denoised Struc-
ture Constraint. When no structure is given,
W2VPred provides insights on the structure
of sub-corpora, e.g., similarity between au-
thors or scientific domains.
All our methods rely on static word embeddings
as opposed to currently often used contextualized
word embeddings. As we learn one representation
per slice such as year or author, thus considering a
much broader context than contextualized embed-
dings, we are able to find a meaningful structure
between corresponding slices. Another main ad-
vantage comes from the fact that our methods do
not require any pre-training and can be run on a
single GPU.
We test our methods on 4 different datasets with
different structures (sequences, trees and general
graphs), domains (news, wikipedia, high litera-
ture) and languages (English and German). We
show on numerous established evaluation meth-
ods that W2VConstr and W2VPred significantly
outperform baseline methods with regard to gen-
eral as well as domain-specific embedding qual-
ity. We also show that W2VPred is able to
predict the structure of a given corpus, outper-
forming all baselines. Additionally, we show ro-
bust heuristics to select hyperparameters based
on proxy measurements in a setting where the
true structure is not known. Finally, we show
how W2VPred can be used in an explorative
setting to raise novel research questions in the
field of Digital Humanities. Our code is avail-
able at github.com/stephaniebrandl/
domain-word-embeddings.
2 Related Work
Various approaches to track, detect and quantify
semantic shifts in text over time have been pro-
posed (Kim et al.,2014;Kulkarni et al.,2015;
Hamilton et al.,2016;Zhang et al.,2016;Marja-
nen et al.,2019).
This research is driven by the hypothesis that se-
mantic shifts occur, e.g., over time (Bleich et al.,
2016) and viewpoints (Azarbonyad et al.,2017),
in political debates (Reese and Lewis,2009)
or caused by cultural developments (Lansdall-
Welfare et al.,2017). Analysing those shifts can
be crucial in political and social studies but also in
literary studies as we show in Section 5.
Typically, methods first train individual static
embeddings for different timestamps, and then
align them afterwards (e.g., Kulkarni et al.,2015;
Hamilton et al.,2016;Kutuzov et al.,2018;Devlin
et al.,2018;Jawahar and Seddah,2019;Hofmann
et al.,2020 and a comprehensive survey by Tah-
masebi et al.,2018). Other approaches, which deal
with more general structure (Azarbonyad et al.,
2017;Gonen et al.,2020) and more general appli-
cations (Zeng et al.,2017;Shoemark et al.,2019),
also rely on post-alignment of static word em-
beddings (Grave et al.,2019). With the rise of
larger language models such as Bidirectional En-
coder Representations from Transformers (BERT)
and with that contextualized embeddings, a part of
the research question has shifted towards detecting
language change in contextualized word embed-
dings (e.g., Jawahar and Seddah,2019;Hofmann
et al.,2020).
Recent methods directly learn dynamic word em-
beddings in a common vector space without post-
alignment: Bamler and Mandt (2017) proposed
a Bayesian probabilistic model that generalizes
the skip-gram model (Mikolov et al.,2013b) to
learn dynamic word embeddings that evolve over
time. Rudolph and Blei (2018) analysed dynamic
changes in word embeddings based on exponen-
tial family embeddings, a probabilistic frame-
work that generalizes the concept of word em-
beddings to other types of data (Rudolph et al.,
2016). Yao et al. (2018) proposed Dynamic
Word2Vec (DW2V) to learn individual word em-
beddings for each year of the New York Times
dataset (1990-2016) while simultaneously align-
ing the embeddings in the same vector space.
Specifically, they solve the following problem for
each timepoint t= 1, . . . , T sequentially:
min
Ut
LF+τLR+λLD,where (1)
LF=
YtUtU>
t
2
F, LR=kUtk2
F,
LD=kUt1Utk2
F+kUtUt+1k2
F(2)
represent the losses for data fidelity, regulariza-
tion, and diachronic constraint, respectively. Ut
RV×dis the matrix consisting of d-dimensional
embeddings for Vwords in the vocabulary, and
YtRV×Vrepresents the positive pointwise mu-
tual information (PPMI) matrix (Levy and Gold-
berg,2014). The diachronic constraint LDen-
courages alignment of the word embeddings with
the parameter λcontrolling how much the embed-
dings are allowed to be dynamic (λ= 0: no align-
ment and λ→ ∞: static embeddings).
3 Methods
By generalizing DW2V, we propose two methods,
one for the case where sub-corpora structure is
given as prior knowledge, and the other for the
case where no structure is given a priori. We also
argue that combining both methods can improve
the performance in cases where some prior infor-
mation is available but not necessarily reliable.
3.1 Word2Vec with Structure Constraint
We reformulate the diachronic term in Eq. 1as
LD=PT
t0=1 Wdiac
t,t0kUtUt0k2
F
with Wdiac
t,t0= ({tt0= 1}),(3)
where (·)denotes the indicator function. This
allows us to generalize DW2V for different neigh-
borhood structures: Instead of the chronological
sequence (3), we assume WRT×Tto be an ar-
bitrary affinity matrix representing the underlying
semantic structure, given as prior knowledge.
Let DRT×Tbe the pairwise distance matrix
between embeddings such that
Dt,t0=kUtUt0k2
F,(4)
and we impose regularization on the distance, in-
stead of the norm of each embeddings. This yields
the following optimization problem:
min
Ut
LF+τLRD +λLS,where (5)
LF=
YtUtU>
t
2
F, LRD =kDkF,
LS=PT
t0=1 Wt,t0Dt,t0.(6)
We call this generalization of DW2V Word2Vec
with Structure Constraint (W2VConstr).
3.2 Word2Vec with Structure Prediction
When no structure information is given, we need
to estimate the similarity matrix Wfrom the data.
We define Wbased on the similarity between em-
beddings. Specifically, we initialize (each entry
of) the embeddings {Ut}T
t=1 by independent uni-
form distribution in [0,1). Then, in each iteration,
we compute the distance matrix Dby Eq.(4), and
set f
Wto its (entry-wise) inverse, i.e.,
f
Wt,t0(D1
t,t0for t6=t0,
0for t=t0.(7)
and normalize it according to the corresponding
column and row:
Wt,t0f
Wt,t0
Pt00 f
Wt,t00 +Pt00 f
Wt00 ,t0
.(8)
The structure loss (6) with the similarity matrix
Wupdated by Eqs. 7and 8constrains the dis-
tances between embeddings according to the sim-
ilarity structure that is at the same time estimated
from the distances between embeddings. We call
this variant Word2Vec with Structure Prediction
(W2VPred). Effectively, Wserves as a weighting
factor that strengthens connections between close
embeddings.
3.3 Word2Vec with Denoised Structure
Constraint
We propose a third method that combines
W2VConstr and W2VPred for the scenario where
W2VConstr results in poor word embeddings be-
cause the a-priori structure is not optimal. In this
case, we suggest to apply W2VPred and consider
the resulting structure as an input for W2VConstr.
This procedure needs prior knowledge of the
dataset and a human-in-the-loop to interpret the
predicted structure by W2VPred in order to add
or remove specific edges in the new ground truth
structure. In the experiment section, we will con-
dense the predicted structure by W2VPred into
a sparse, denoised ground truth structure that is
meaningful. We call this method Word2Vec with
Denoised Structure Constraint (W2VDen).
3.4 Optimization
We solve the problem (5) iteratively for each em-
bedding Ut, given the other embedings {Ut0}t06=t
are fixed. We define one epoch as complete when
{Ut}has been updated for all t. We applied gra-
dient descent with Adam (Kingma and Ba,2014)
with default values for the exponential decay rates
given in the original paper and a learning rate of
0.1. The learning rate has been reduced after 100
epochs to 0.05 and after 500 epochs to 0.01 with
a total number of 1000 epochs. Both models have
been implemented in PyTorch. W2VPred updates
Wby Eqs. 7and 8after every iteration.
4 Experiments on Benchmark Data
We conducted four experiments starting with well-
known settings and datasets and incrementally
moving to new datasets with different structures.
The first experiment focuses on the general em-
bedding quality, the second one presents results on
domain-specific embeddings, the third one evalu-
ates the method’s ability to predict structure and
the fourth one shows the method’s performance
on various word similarity tasks. In the following
subsections, we will first describe the data, pre-
processing and then the results. Further details on
implementation and hyperparameters can be found
in Appendix A.
4.1 Datasets
We evaluated our methods on the following three
benchmark datasets.
Category #Articles
Natural Sciences 8536
Chemistry 19164
Computer Science 11201
Biology 10988
Engineering & Technology 20091
Civil Engineering 17797
Electrical & Electronic Engineering 6809
Mechanical Engineering 4978
Social Sciences 17347
Business & Economics 14747
Law 13265
Psychology 5788
Humanities 15066
Literature & Languages 24800
History & Archaeology 16453
Religion & Philosophy & Ethics 19356
Table 1: Categories and the number of articles
in the WikiFoS dataset. One cluster contains 4
categories (rows): the top one is the main cate-
gory and the following 3 are subcategories. Fields
joined by & originate from 2 separate categories
in Wikipedia3but were joined, according to the
OECD’s definition.2
New York Times (NYT): The New York Times
dataset1(NYT) contains headlines, lead texts and
paragraphs of English news articles published on-
line and offline between January 1990 and June
2016 with a total of 100,945 documents. We
grouped the dataset by years with 1990-1998 as
the train set and 1999-2016 as the test set.
Wikipedia Field of Science and Technol-
ogy (WikiFoS): We selected categories of the
OECD’s list of Fields of Science and Technol-
ogy2and downloaded the corresponding articles
from the English Wikipedia. The resulting dataset
Wikipedia Field of Science and technology (Wiki-
FoS) contains four clusters, each of which con-
sists of one main category and three subcategories,
with 226,386 unique articles in total (see Table
1). The articles belonging to multiple categories3
were randomly assigned to a single category in
order to avoid similarity because of overlapping
texts instead of structural similarity. In each cat-
1sites.google.com/site/
zijunyaorutgers
2oecd.org/science/inno/38235147.pdf
3wikipedia.org/wiki/Wikipedia:
Contents/Categories
摘要:

Domain-SpecicWordEmbeddingswithStructurePredictionStephanieBrandl;1;2DavidLassner;1;3AnneBaillot4ShinichiNakajima1;3;51TUBerlin2UniversityofCopenhagen3BIFOLD4LeMansUniversité5RIKENCenterforAIPbrandl@di.ku.dk,lassner@tu-berlin.deAuthorscontributedequally.AbstractComplementarytondinggoodgeneralwo...

展开>> 收起<<
Domain-Specific Word Embeddings with Structure Prediction Stephanie Brandl12David Lassner13Anne Baillot4Shinichi Nakajima135 1TU Berlin2University of Copenhagen3BIFOLD.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.22MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注