Domain-Speciﬁc Word Embeddings with Structure Prediction Stephanie Brandl12David Lassner13Anne Baillot4Shinichi Nakajima135 1TU Berlin2University of Copenhagen3BIFOLD

2025-04-27 0 0 1.22MB 16 页 10玖币

侵权投诉

Domain-Speciﬁc Word Embeddings with Structure Prediction

Stephanie Brandl∗,1,2David Lassner∗,1,3Anne Baillot4Shinichi Nakajima1,3,5

1TU Berlin 2University of Copenhagen 3BIFOLD

4Le Mans Université 5RIKEN Center for AIP

brandl@di.ku.dk,lassner@tu-berlin.de

∗Authors contributed equally.

Abstract

Complementary to ﬁnding good general

word embeddings, an important question

for representation learning is to ﬁnd dy-

namic word embeddings, e.g., across time

or domain. Current methods do not of-

fer a way to use or predict information on

structure between sub-corpora, time or do-

main and dynamic embeddings can only be

compared after post-alignment. We pro-

pose novel word embedding methods that

provide general word representations for

the whole corpus, domain-speciﬁc repre-

sentations for each sub-corpus, sub-corpus

structure, and embedding alignment simul-

taneously. We present an empirical eval-

uation on New York Times articles and

two English Wikipedia datasets with ar-

ticles on science and philosophy. Our

method, called Word2Vec with Structure

Prediction (W2VPred), provides better per-

formance than baselines in terms of the gen-

eral analogy tests, domain-speciﬁc analogy

tests, and multiple speciﬁc word embedding

evaluations as well as structure prediction

performance when no structure is given a

priori. As a use case in the ﬁeld of Digi-

tal Humanities we demonstrate how to raise

novel research questions for high literature

from the German Text Archive.

1 Introduction

Word embeddings (Mikolov et al.,2013b;Pen-

nington et al.,2014) are a powerful tool for word-

level representation in a vector space that captures

semantic and syntactic relations between words.

They have been successfully used in many appli-

cations such as text classiﬁcation (Joulin et al.,

2016) and machine translation (Mikolov et al.,

2013a). Word embeddings highly depend on their

training corpus. For example, technical terms used

in scientiﬁc documents can have a different mean-

ing in other domains, and words can change their

meaning over time—“apple” did not mean a tech

company before Apple Inc. was founded. On the

other hand, such local or domain-speciﬁc repre-

sentations are also not independent of each other,

because most words are expected to have a similar

meaning across domains.

There are many situations where a given target

corpus is considered to have some structure. For

example, when analyzing news articles, one can

expect that articles published in 2000 and 2001 are

more similar to each other than the ones from 2000

and 2010. When analyzing scientiﬁc articles, uses

of technical terms are expected to be similar in ar-

ticles on similar ﬁelds of science. This implies that

the structure of a corpus can be a useful side infor-

mation for obtaining better word representation.

Various approaches to analyse semantic shifts

in text have been proposed where typically ﬁrst

individual static embeddings are trained and then

aligned afterwards (e.g., Kulkarni et al.,2015;

Hamilton et al.,2016;Kutuzov et al.,2018;Tah-

masebi et al.,2018). As most word embeddings

are invariant with respect to rotation and scaling, it

is necessary to map word embeddings from differ-

ent training procedures into the same vector space

in order to compare them. This procedure is usu-

ally called alignment for which orthogonal Pro-

crustes can be applied as has been used in (Hamil-

ton et al.,2016).

Recently, new methods to train diachronic word

embeddings have been proposed where the align-

ment process is integrated in the training process.

Bamler and Mandt (2017) propose a Bayesian ap-

proach that extends the skip-gram model (Mikolov

et al.,2013b). Rudolph and Blei (2018) analyse

dynamic changes in word embeddings based on

exponential family embeddings. Yao et al. (2018)

propose Dynamic Word2Vec where word embed-

dings for each year of the New York Times corpus

are trained based on individual positive point-wise

information matrices and aligned simultaneously.

arXiv:2210.04962v1 [cs.CL] 6 Oct 2022

We argue that apart from diachronic word em-

beddings there is a need to train dynamic word

embeddings that not only capture temporal shifts

in language but for instance also semantic shifts

between domains or regional differences. It is

therefore important that those embeddings can be

trained on small datasets. We therefore propose

two generalizations of Dynamic Word2Vec. Our

ﬁrst method is called Word2Vec with Structure

Constraint (W2VConstr), where domain-speciﬁc

embeddings are learned under regularization with

any kind of structure. This method performs

well when a respective graph structure is given

a priori. For more general cases where no struc-

ture information is given, we propose our second

method, called Word2Vec with Structure Predic-

tion (W2VPred), where domain-speciﬁc embed-

dings and sub-corpora structure are learned at the

same time. W2VPred simultaneously solves three

central problems that arise with word embedding

representations:

1. Words in the sub-corpora are embedded in

the same vector space, and are therefore di-

rectly comparable without post-alignment.

2. The different representations are trained si-

multaneously on the whole corpus as well

as on the sub-corpora, which makes embed-

dings for both general and domain-speciﬁc

words robust, due to the information ex-

change between sub-corpora.

3. The estimated graph structure can be used

for conﬁrmatory evaluation when a reason-

able prior structure is given. W2VPred to-

gether with W2VConstr identiﬁes the cases

where the given structure is not ideal, and

suggests a reﬁned structure which leads to an

improved embedding performance, we call

this method Word2Vec with Denoised Struc-

ture Constraint. When no structure is given,

W2VPred provides insights on the structure

of sub-corpora, e.g., similarity between au-

thors or scientiﬁc domains.

All our methods rely on static word embeddings

as opposed to currently often used contextualized

word embeddings. As we learn one representation

per slice such as year or author, thus considering a

much broader context than contextualized embed-

dings, we are able to ﬁnd a meaningful structure

between corresponding slices. Another main ad-

vantage comes from the fact that our methods do

not require any pre-training and can be run on a

single GPU.

We test our methods on 4 different datasets with

different structures (sequences, trees and general

graphs), domains (news, wikipedia, high litera-

ture) and languages (English and German). We

show on numerous established evaluation meth-

ods that W2VConstr and W2VPred signiﬁcantly

outperform baseline methods with regard to gen-

eral as well as domain-speciﬁc embedding qual-

ity. We also show that W2VPred is able to

predict the structure of a given corpus, outper-

forming all baselines. Additionally, we show ro-

bust heuristics to select hyperparameters based

on proxy measurements in a setting where the

true structure is not known. Finally, we show

how W2VPred can be used in an explorative

setting to raise novel research questions in the

ﬁeld of Digital Humanities. Our code is avail-

able at github.com/stephaniebrandl/

domain-word-embeddings.

2 Related Work

Various approaches to track, detect and quantify

semantic shifts in text over time have been pro-

posed (Kim et al.,2014;Kulkarni et al.,2015;

Hamilton et al.,2016;Zhang et al.,2016;Marja-

nen et al.,2019).

This research is driven by the hypothesis that se-

mantic shifts occur, e.g., over time (Bleich et al.,

2016) and viewpoints (Azarbonyad et al.,2017),

in political debates (Reese and Lewis,2009)

or caused by cultural developments (Lansdall-

Welfare et al.,2017). Analysing those shifts can

be crucial in political and social studies but also in

literary studies as we show in Section 5.

Typically, methods ﬁrst train individual static

embeddings for different timestamps, and then

align them afterwards (e.g., Kulkarni et al.,2015;

Hamilton et al.,2016;Kutuzov et al.,2018;Devlin

et al.,2018;Jawahar and Seddah,2019;Hofmann

et al.,2020 and a comprehensive survey by Tah-

masebi et al.,2018). Other approaches, which deal

with more general structure (Azarbonyad et al.,

2017;Gonen et al.,2020) and more general appli-

cations (Zeng et al.,2017;Shoemark et al.,2019),

also rely on post-alignment of static word em-

beddings (Grave et al.,2019). With the rise of

larger language models such as Bidirectional En-

coder Representations from Transformers (BERT)

and with that contextualized embeddings, a part of

the research question has shifted towards detecting

language change in contextualized word embed-

dings (e.g., Jawahar and Seddah,2019;Hofmann

et al.,2020).

Recent methods directly learn dynamic word em-

beddings in a common vector space without post-

alignment: Bamler and Mandt (2017) proposed

a Bayesian probabilistic model that generalizes

the skip-gram model (Mikolov et al.,2013b) to

learn dynamic word embeddings that evolve over

time. Rudolph and Blei (2018) analysed dynamic

changes in word embeddings based on exponen-

tial family embeddings, a probabilistic frame-

work that generalizes the concept of word em-

beddings to other types of data (Rudolph et al.,

2016). Yao et al. (2018) proposed Dynamic

Word2Vec (DW2V) to learn individual word em-

beddings for each year of the New York Times

dataset (1990-2016) while simultaneously align-

ing the embeddings in the same vector space.

Speciﬁcally, they solve the following problem for

each timepoint t= 1, . . . , T sequentially:

min

LF+τLR+λLD,where (1)

LF=

Yt−UtU>

t

2

F, LR=kUtk2

LD=kUt−1−Utk2

F+kUt−Ut+1k2

F(2)

represent the losses for data ﬁdelity, regulariza-

tion, and diachronic constraint, respectively. Ut∈

RV×dis the matrix consisting of d-dimensional

embeddings for Vwords in the vocabulary, and

Yt∈RV×Vrepresents the positive pointwise mu-

tual information (PPMI) matrix (Levy and Gold-

berg,2014). The diachronic constraint LDen-

courages alignment of the word embeddings with

the parameter λcontrolling how much the embed-

dings are allowed to be dynamic (λ= 0: no align-

ment and λ→ ∞: static embeddings).

3 Methods

By generalizing DW2V, we propose two methods,

one for the case where sub-corpora structure is

given as prior knowledge, and the other for the

case where no structure is given a priori. We also

argue that combining both methods can improve

the performance in cases where some prior infor-

mation is available but not necessarily reliable.

3.1 Word2Vec with Structure Constraint

We reformulate the diachronic term in Eq. 1as

LD=PT

t0=1 Wdiac

t,t0kUt−Ut0k2

with Wdiac

t,t0= ({t−t0= 1}),(3)

where (·)denotes the indicator function. This

allows us to generalize DW2V for different neigh-

borhood structures: Instead of the chronological

sequence (3), we assume W∈RT×Tto be an ar-

bitrary afﬁnity matrix representing the underlying

semantic structure, given as prior knowledge.

Let D∈RT×Tbe the pairwise distance matrix

between embeddings such that

Dt,t0=kUt−Ut0k2

F,(4)

and we impose regularization on the distance, in-

stead of the norm of each embeddings. This yields

the following optimization problem:

min

LF+τLRD +λLS,where (5)

LF=

Yt−UtU>

t

2

F, LRD =kDkF,

LS=PT

t0=1 Wt,t0Dt,t0.(6)

We call this generalization of DW2V Word2Vec

with Structure Constraint (W2VConstr).

3.2 Word2Vec with Structure Prediction

When no structure information is given, we need

to estimate the similarity matrix Wfrom the data.

We deﬁne Wbased on the similarity between em-

beddings. Speciﬁcally, we initialize (each entry

of) the embeddings {Ut}T

t=1 by independent uni-

form distribution in [0,1). Then, in each iteration,

we compute the distance matrix Dby Eq.(4), and

set f

Wto its (entry-wise) inverse, i.e.,

Wt,t0←(D−1

t,t0for t6=t0,

0for t=t0.(7)

and normalize it according to the corresponding

column and row:

Wt,t0←f

Wt,t0

Pt00 f

Wt,t00 +Pt00 f

Wt00 ,t0

.(8)

The structure loss (6) with the similarity matrix

Wupdated by Eqs. 7and 8constrains the dis-

tances between embeddings according to the sim-

ilarity structure that is at the same time estimated

from the distances between embeddings. We call

this variant Word2Vec with Structure Prediction

(W2VPred). Effectively, Wserves as a weighting

factor that strengthens connections between close

embeddings.

3.3 Word2Vec with Denoised Structure

Constraint

We propose a third method that combines

W2VConstr and W2VPred for the scenario where

W2VConstr results in poor word embeddings be-

cause the a-priori structure is not optimal. In this

case, we suggest to apply W2VPred and consider

the resulting structure as an input for W2VConstr.

This procedure needs prior knowledge of the

dataset and a human-in-the-loop to interpret the

predicted structure by W2VPred in order to add

or remove speciﬁc edges in the new ground truth

structure. In the experiment section, we will con-

dense the predicted structure by W2VPred into

a sparse, denoised ground truth structure that is

meaningful. We call this method Word2Vec with

Denoised Structure Constraint (W2VDen).

3.4 Optimization

We solve the problem (5) iteratively for each em-

bedding Ut, given the other embedings {Ut0}t06=t

are ﬁxed. We deﬁne one epoch as complete when

{Ut}has been updated for all t. We applied gra-

dient descent with Adam (Kingma and Ba,2014)

with default values for the exponential decay rates

given in the original paper and a learning rate of

0.1. The learning rate has been reduced after 100

epochs to 0.05 and after 500 epochs to 0.01 with

a total number of 1000 epochs. Both models have

been implemented in PyTorch. W2VPred updates

Wby Eqs. 7and 8after every iteration.

4 Experiments on Benchmark Data

We conducted four experiments starting with well-

known settings and datasets and incrementally

moving to new datasets with different structures.

The ﬁrst experiment focuses on the general em-

bedding quality, the second one presents results on

domain-speciﬁc embeddings, the third one evalu-

ates the method’s ability to predict structure and

the fourth one shows the method’s performance

on various word similarity tasks. In the following

subsections, we will ﬁrst describe the data, pre-

processing and then the results. Further details on

implementation and hyperparameters can be found

in Appendix A.

4.1 Datasets

We evaluated our methods on the following three

benchmark datasets.

Category #Articles

Natural Sciences 8536

Chemistry 19164

Computer Science 11201

Biology 10988

Engineering & Technology 20091

Civil Engineering 17797

Electrical & Electronic Engineering 6809

Mechanical Engineering 4978

Social Sciences 17347

Business & Economics 14747

Law 13265

Psychology 5788

Humanities 15066

Literature & Languages 24800

History & Archaeology 16453

Religion & Philosophy & Ethics 19356

Table 1: Categories and the number of articles

in the WikiFoS dataset. One cluster contains 4

categories (rows): the top one is the main cate-

gory and the following 3 are subcategories. Fields

joined by & originate from 2 separate categories

in Wikipedia3but were joined, according to the

OECD’s deﬁnition.2

New York Times (NYT): The New York Times

dataset1(NYT) contains headlines, lead texts and

paragraphs of English news articles published on-

line and ofﬂine between January 1990 and June

2016 with a total of 100,945 documents. We

grouped the dataset by years with 1990-1998 as

the train set and 1999-2016 as the test set.

Wikipedia Field of Science and Technol-

ogy (WikiFoS): We selected categories of the

OECD’s list of Fields of Science and Technol-

ogy2and downloaded the corresponding articles

from the English Wikipedia. The resulting dataset

Wikipedia Field of Science and technology (Wiki-

FoS) contains four clusters, each of which con-

sists of one main category and three subcategories,

with 226,386 unique articles in total (see Table

1). The articles belonging to multiple categories3

were randomly assigned to a single category in

order to avoid similarity because of overlapping

texts instead of structural similarity. In each cat-

1sites.google.com/site/

zijunyaorutgers

2oecd.org/science/inno/38235147.pdf

3wikipedia.org/wiki/Wikipedia:

Contents/Categories

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Domain-SpecicWordEmbeddingswithStructurePredictionStephanieBrandl;1;2DavidLassner;1;3AnneBaillot4ShinichiNakajima1;3;51TUBerlin2UniversityofCopenhagen3BIFOLD4LeMansUniversité5RIKENCenterforAIPbrandl@di.ku.dk,lassner@tu-berlin.deAuthorscontributedequally.AbstractComplementarytondinggoodgeneralwo...

展开>> 收起<<

Domain-Speciﬁc Word Embeddings with Structure Prediction Stephanie Brandl12David Lassner13Anne Baillot4Shinichi Nakajima135 1TU Berlin2University of Copenhagen3BIFOLD.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Domain-Speciﬁc Word Embeddings with Structure Prediction Stephanie Brandl12David Lassner13Anne Baillot4Shinichi Nakajima135 1TU Berlin2University of Copenhagen3BIFOLD

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: