Predicting Long-Term Citations from Short-Term Linguistic Influence Sandeep Soni and David Bamman University of California Berkeley

2025-05-02 0 0 533.4KB 17 页 10玖币
侵权投诉
Predicting Long-Term Citations from Short-Term Linguistic Influence
Sandeep Soni and David Bamman
University of California, Berkeley
{sandeepsoni,dbamman}@berkeley.edu
Jacob Eisenstein
Google Research
jeisenstein@google.com
Abstract
A standard measure of the influence of a re-
search paper is the number of times it is cited.
However, papers may be cited for many rea-
sons, and citation count offers limited informa-
tion about the extent to which a paper affected
the content of subsequent publications. We
therefore propose a novel method to quantify
linguistic influence in timestamped document
collections. There are two main steps: first,
identify lexical and semantic changes using
contextual embeddings and word frequencies;
second, aggregate information about these
changes into per-document influence scores
by estimating a high-dimensional Hawkes pro-
cess with a low-rank parameter matrix. We
show that this measure of linguistic influence
is predictive of future citations: the estimate
of linguistic influence from the two years af-
ter a paper’s publication is correlated with and
predictive of its citation count in the follow-
ing three years. This is demonstrated using
an online evaluation with incremental tempo-
ral training/test splits, in comparison with a
strong baseline that includes predictors for ini-
tial citation counts, topics, and lexical features.
1 Introduction
The citation count of a paper is a standard, eas-
ily measurable proxy for its influence (Cronin,
2005). Researchers have shown that citation count
is strongly correlated with the quality of scientific
work (e.g., Lawani,1986), the recognition that a
paper or an author gets (e.g., Inhaber and Przed-
nowek,1976), or in policy decisions such as as-
sessment of scientific performance (e.g., Cronin,
2005). Consequently, citation count is a ubiqui-
tously deployed and important measure of a paper
with whole subfields of research dedicated to its
analysis (Bornmann and Daniel,2008).
However, papers may be cited (or not cited) for
many reasons, and citation count alone is insuffi-
cient to explain the emergence and the spread of
Figure 1: Research papers that are more linguistically
influential within an initial time window tend to receive
more citations in the long term. The x-axis shows lex-
ical and semantic influence, binned into quantiles (see
§ 2); the y-axis shows the corresponding regression co-
efficients and standard errors, in units of Z-normalized
log future citations (see § 5.3). To give a sense of scale,
for papers published in 2012, being in the top decile of
semantic influence corresponds to an 14.5% increase
in long-term citations, as compared to control-matched
papers that received the same number of short-term ci-
tations and covered similar topics but were in the bot-
tom half by semantic influence.
research ideas and trends. For this reason, we turn
to content analysis: to what extent can the text
of a research paper be said to influence the trajec-
tory of the research community? In this paper, we
present a novel technique for estimating the influ-
ence of documents in a timestamped corpus. To
demonstrate the validity of the resulting measure
of linguistic influence, we show that it is predictive
of future citations. Specifically, we find that: (1)
papers that our metric judges as highly influential
in the short term tend to receive more citations in
the long term; (2) short-term linguistic influence
increases the ability to predict long-term citations
over strong baselines.
Our modeling approach focuses on semantic
changes, and treats the temporal usage of semantic
innovations as emissions from a parametric low-
rank Hawkes process (Hawkes,1971). The pa-
arXiv:2210.13628v1 [cs.CL] 24 Oct 2022
rameters of the Hawkes process correspond to the
linguistic influence of each paper, aggregated over
thousands of linguistic changes. The changes them-
selves are identified through analysis of contex-
tual embeddings, with the goal of finding words
whose meaning has shifted over time (Traugott and
Dasher,2001). Though there are several compu-
tational methods to detect semantic changes (e.g.,
Kim et al.,2014;Hamilton et al.,2016;Rosenfeld
and Erk,2018;Dubossarsky et al.,2019), including
methods based on contextual embeddings (e.g., Ku-
tuzov and Giulianelli,2020), our proposed method
focuses on detecting smooth, non-bursty semantic
changes; we also go further than other methods by
distinguishing old and contemporary usages of an
identified semantic change.
We show through a multivariate regression that
our estimates of semantic influence of each paper
are positively correlated with their long-term cita-
tions, even after controlling for the initial citations,
the content of the paper in terms of topics, and the
lexical influence of the paper (see Figure 1). Fur-
ther, we formulate long-term citation prediction as
an online prediction task, constructing test sets for
successive years. The addition of semantic influ-
ence as features to a model once again improves the
predictive performance of the model over baselines.
In summary, our contributions are as follows:1
We empirically demonstrate a link between
long-term citation count and short-term lin-
guistic influence, using both regression anal-
ysis (§ 5.3) and an online prediction task
(§ 5.4).
We present a method to estimate semantic
influence using a parametric Hawkes process
(§ 2.1). To achieve this, we find semantic
changes and convert the usage of each change
into a cascade (§ 2.2). We also show that
the method can be applied to quantify lexical
influence.
We present a method to identify monotonic se-
mantic changes from timestamped text using
contextual embeddings (see § 2.2.1).
2 Methodology
This section describes our method for estimating
the linguistic influence of each document in a times-
1
The code and relevant data from our paper
can be found at
http://github.com/sandeepsoni/
contextual-leadership
tamped collection. Our work builds on the theory
of point process models (Daley et al.,2003), in
which the basic unit of data is a set of marked
event timestamps. In our case, the events corre-
spond to the use of an innovative word or usage;
the mark corresponds to the document in which
word or usage appears. To estimate linguistic influ-
ence of individual documents, we fit a parametric
model in which per-document influence parameters
explain the density of events in subsequent docu-
ments. We first describe the modeling framework
in which these influence parameters are estimated
(§ 2.1) and then describe how event cascades are
constructed (§ 2.2) from semantic changes (§ 2.2.1)
and lexical innovations (§ 2.2.2).
2.1 Estimating document influence from
timestamped events
A marked cascade is a set of marked events
{e1, e2, . . . , eN}
, in which each event
ei= (ti, pi)
corresponds to a tuple of a timestamp
ti
and a mark
pi
. Assume a set of marked cascades, indexed by
w∈ W
, with each mark belonging to a finite set
that is shared across all cascades. In our applica-
tion, each cascade corresponds to the appearances
of an individual word or word sense, and each mark
is the identity of the document in which the word
or word sense appears.
Point process models define probability distribu-
tions over cascades. In an inhomogeneous point
process, the distribution of the count of events be-
tween any two timestamps
(t1, t2)
is governed by
the integral of an intensity function
λ(t, w)
. A
Hawkes process is a special case in which the in-
tensity function is the sum of terms associated with
previous events (Hawkes,1971). We choose the
following special form,
λ(t, w) = cw+X
i:t(w)
i<t
αp(w)
i
κ(tt(w)
i),(1)
where
κ
is a time-decay kernel such as the expo-
nential kernel
κ(∆t) = eγt
and
cw
is a constant.
The parameter of interest is
α
, which quantifies the
influence exerted by the document
p(w)
i
on subse-
quent events.2
2
In the more general multivariate Hawkes process, the
intensity function can depend on the identity of “receiver” of
influence. This enables the estimation of pairwise excitation
parameters
αi,j
, as in the work of Lukasik et al. (2016), to
give an example from the NLP literature. However, it would
be difficult to estimate pairwise excitation between thousands
of documents, as required by our setting.
Our application focuses on research papers,
which historically have been published in a few
bursts — at conferences and in journals — rather
than continuously over time. For this reason we
simplify our setting further, discretizing the times-
tamps by year. The evidence to be explained is now
of the form
n(t, w)
, the count of word or sense
w
in year
t
. We model this count as a Poisson random
variable, and estimate the parameters
cw
and
α
by
maximum likelihood.
2.2 Building event cascades
To estimate the parameters in Equation 1, we re-
quire a set of timestamped events. Ideally these
events should correspond to evidence of linguistic
innovation. We consider two sources of events: se-
mantic innovations (here focusing on words whose
meaning changes over time) and lexical innova-
tions (words whose usage rate increases dramati-
cally over time).
We now introduce some notation used in the
remainder of this section. Let a document be a
sequence of discrete tokens from a finite vocab-
ulary
V
, so that document
i
is denoted
Xi=
[x(1)
i, x(2)
i, . . . , x(ni)
i]
, with
ni
indicating the length
of document
i
. A corpus is similarly defined as
a set of
N
documents,
X={X1, X2, ..., XN}
,
with each document associated with a discrete time
ti T .
2.2.1 Using contextual embeddings to
identify semantic changes
We use contextual embeddings to identify words
whose meaning changes over time, following prior
work on computational historical linguistics (e.g.,
Kutuzov and Giulianelli,2020, see §6for a more
comprehensive review). A contextual embedding
h(k)
iRD
is a vector representation of token
k
in document
i
, computed from a model such as
BERT (Devlin et al.,2019). When the distribution
over
h
for a given word changes over time, this
is taken as evidence for a change in the word’s
meaning.
Let
mt,w
and
mt+,w
be the count of the word
wup to and after time t, respectively. Specifically,
mt,w =X
i:tit
ni
X
k
1(x(k)
i=w)
mt+,w =X
i:ti>t
ni
X
k
1(x(k)
i=w)
Average representations of the word
w
up to and
after time
t
, respectively, are calculated as follows.
vt,w =1
mt,w X
i:tit
ni
X
k
h(k)
i1(x(k)
i=w)
vt+,w =1
mt+,w X
i:ti>t
ni
X
k
h(k)
i1(x(k)
i=w)
Further, the variance in the contextual embed-
dings of the word
w
over the entire corpus is calcu-
lated by taking the variance of each component of
the embedding,
sw=1
mwX
i,k:x(k)
i=wh(k)
iµw2,(2)
with
µw
equal to the mean contextualized embed-
ding of word w.
A semantic change score for a word
w
for a time
t
is then the variance-weighted squared norm of
the difference between its average pre-
t
and post-
t
contextualized embeddings (also known as the
squared Mahalanobis distance):
r(w, t)=(vt,w vt+,w)>S1
w(vt,w vt+,w),
(3)
with Sw=Diag(sw).
Correction for frequency effects.
Both the
mean and variance are estimated with larger sam-
ples for timestamps in the middle of
T
in com-
parison to the initial and final timestamps. Conse-
quently, the distance metric suffers from high sam-
ple variance for values of
t
near these endpoints.
The discrepancy is corrected by replacing the di-
agonal covariance
Sw
in Equation 3 with an alter-
native covariance
˜
Sw
that reflects that additional
uncertainty due to sample size. Specifically, we
approximate the standard error of the mean
vt
as
pS/mt
, and analogously for
vt+
. Then
˜
Sw
is
defined as the product of these two approximate
standard errors,
˜
Sw=sSw
mt,w sSw
mt+,w
=Sw
mt,wmt+,w
.
(4)
Finally,
t= argmaxtr(w, t)
is selected as the
transition point for the change in meaning of
w
.
The changes are identified by sorting the words
by
max r(w, t)
and applying a set of basic filters
explained in § 4. To give some intuition:
If
w
changes in meaning at time
t
, then the
difference in its representation up to
t
and after
t
should be high. The metric in Equation 3
captures this precisely by calculating the term
vt,w vt+,w.
Difference in average embeddings can be high
for seasonal or bursty changes seen in words
such as turkey which is referred to the bird
more frequently at the time of American hol-
idays (Shoemark et al.,2019). Rescaling the
difference by the inverse variance encourages
detection of monotonic changes.
For rare words, the mean embeddings will be
less reliable. The
m
terms in
˜
S
have the
effect of emphasizing high-frequency words
for which changes in the mean embedding are
likely to be significant.
Distinguishing old and new usages.
The previ-
ous step yields semantic innovations and their tran-
sition time. Simply identifying semantic changes
is insufficient, since at any given time a word could
be used in its old or new sense with respect to its
time of transition. To categorize every usage of a
semantic innovation
w
, the contextual embeddings
are passed through a logistic regression classifier
that predicts whether the usage is before or after
the transition time. At the end of this step a se-
quence of embeddings for any semantic innovation
is converted to a sequence of binary labels denot-
ing their usage. For each word
w
, the cascade
(e(w)
1, e(w)
2,...e(w)
Nw)
is formed by filtering the us-
ages to those that are classified as corresponding to
the newer sense, with each event
e(w)
i
containing
a timestamp
t(w)
i
and a document identifier
p(w)
i
.
These cascades are the evidence from which we es-
timate the per-document semantic influence scores
αs, as described in § 2.1.
Why contextual embeddings?
Embeddings
provide a powerful tool for understanding language
change, offering more linguistic granularity than
measures of change in the strength or composition
of latent topics (e.g., Griffiths and Steyvers,2004;
Gerow et al.,2018). Prior work has employed
diachronic non-contextual embeddings (e.g., Soni
et al.,2021b). Such methods require each word
to have a single shared embedding in each time
period. During periods in which a word is used in
multiple senses, the non-contextual embedding
must average across these senses, making it harder
to detect changes in progress.
2.2.2 Identifying lexical changes
Unlike semantic changes, whose identification re-
quires representations such as contextual embed-
dings, lexical changes are identified simply by com-
paring frequency changes. Specifically, for every
word in a vocabulary we vary the segmentation
year, say
t
, for the word and calculate the rela-
tive frequency up to and after
t
. We then take the
best relative frequency ratio across the years as the
score of lexical change for that word and aggregate
to form a list of changes by sorting on this score.
In contrast to semantic changes, all the usages of
lexical changes are used to form cascades. These
cascades are the evidence from which we estimate
the per-document lexical influence scores
α`
, again
using the methods in § 2.1.
2.3 Overview
To summarize the method for computing semantic
influence:
1.
Compute the score
r(w, t)
for each word
w
and time
t
as described in Equation 3 (with
the adjusted covariance term from Equation 4),
and threshold to identify semantic changes.
2.
For each word selected in the previous step,
classify each usage as either “old” or “new”,
and build a cascade from the timestamps of
the new usages.
3.
Aggregating over all the cascades, estimate
the influence parameters
αi
for each document
in the collection.
A visual summary of the entire methodological
pipeline is given in Figure 2.
摘要:

PredictingLong-TermCitationsfromShort-TermLinguisticInuenceSandeepSoniandDavidBammanUniversityofCalifornia,Berkeley{sandeepsoni,dbamman}@berkeley.eduJacobEisensteinGoogleResearchjeisenstein@google.comAbstractAstandardmeasureoftheinuenceofare-searchpaperisthenumberoftimesitiscited.However,papersmay...

展开>> 收起<<
Predicting Long-Term Citations from Short-Term Linguistic Influence Sandeep Soni and David Bamman University of California Berkeley.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:17 页 大小:533.4KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注