Predicting Long-Term Citations from Short-Term Linguistic Inﬂuence Sandeep Soni and David Bamman University of California Berkeley

2025-05-02 0 0 533.4KB 17 页 10玖币

侵权投诉

Predicting Long-Term Citations from Short-Term Linguistic Inﬂuence

Sandeep Soni and David Bamman

University of California, Berkeley

{sandeepsoni,dbamman}@berkeley.edu

Jacob Eisenstein

Google Research

jeisenstein@google.com

Abstract

A standard measure of the inﬂuence of a re-

search paper is the number of times it is cited.

However, papers may be cited for many rea-

sons, and citation count offers limited informa-

tion about the extent to which a paper affected

the content of subsequent publications. We

therefore propose a novel method to quantify

linguistic inﬂuence in timestamped document

collections. There are two main steps: ﬁrst,

identify lexical and semantic changes using

contextual embeddings and word frequencies;

second, aggregate information about these

changes into per-document inﬂuence scores

by estimating a high-dimensional Hawkes pro-

cess with a low-rank parameter matrix. We

show that this measure of linguistic inﬂuence

is predictive of future citations: the estimate

of linguistic inﬂuence from the two years af-

ter a paper’s publication is correlated with and

predictive of its citation count in the follow-

ing three years. This is demonstrated using

an online evaluation with incremental tempo-

ral training/test splits, in comparison with a

strong baseline that includes predictors for ini-

tial citation counts, topics, and lexical features.

1 Introduction

The citation count of a paper is a standard, eas-

ily measurable proxy for its inﬂuence (Cronin,

2005). Researchers have shown that citation count

is strongly correlated with the quality of scientiﬁc

work (e.g., Lawani,1986), the recognition that a

paper or an author gets (e.g., Inhaber and Przed-

nowek,1976), or in policy decisions such as as-

sessment of scientiﬁc performance (e.g., Cronin,

2005). Consequently, citation count is a ubiqui-

tously deployed and important measure of a paper

with whole subﬁelds of research dedicated to its

analysis (Bornmann and Daniel,2008).

However, papers may be cited (or not cited) for

many reasons, and citation count alone is insufﬁ-

cient to explain the emergence and the spread of

Figure 1: Research papers that are more linguistically

inﬂuential within an initial time window tend to receive

more citations in the long term. The x-axis shows lex-

ical and semantic inﬂuence, binned into quantiles (see

§ 2); the y-axis shows the corresponding regression co-

efﬁcients and standard errors, in units of Z-normalized

log future citations (see § 5.3). To give a sense of scale,

for papers published in 2012, being in the top decile of

semantic inﬂuence corresponds to an 14.5% increase

in long-term citations, as compared to control-matched

papers that received the same number of short-term ci-

tations and covered similar topics but were in the bot-

tom half by semantic inﬂuence.

research ideas and trends. For this reason, we turn

to content analysis: to what extent can the text

of a research paper be said to inﬂuence the trajec-

tory of the research community? In this paper, we

present a novel technique for estimating the inﬂu-

ence of documents in a timestamped corpus. To

demonstrate the validity of the resulting measure

of linguistic inﬂuence, we show that it is predictive

of future citations. Speciﬁcally, we ﬁnd that: (1)

papers that our metric judges as highly inﬂuential

in the short term tend to receive more citations in

the long term; (2) short-term linguistic inﬂuence

increases the ability to predict long-term citations

over strong baselines.

Our modeling approach focuses on semantic

changes, and treats the temporal usage of semantic

innovations as emissions from a parametric low-

rank Hawkes process (Hawkes,1971). The pa-

arXiv:2210.13628v1 [cs.CL] 24 Oct 2022

rameters of the Hawkes process correspond to the

linguistic inﬂuence of each paper, aggregated over

thousands of linguistic changes. The changes them-

selves are identiﬁed through analysis of contex-

tual embeddings, with the goal of ﬁnding words

whose meaning has shifted over time (Traugott and

Dasher,2001). Though there are several compu-

tational methods to detect semantic changes (e.g.,

Kim et al.,2014;Hamilton et al.,2016;Rosenfeld

and Erk,2018;Dubossarsky et al.,2019), including

methods based on contextual embeddings (e.g., Ku-

tuzov and Giulianelli,2020), our proposed method

focuses on detecting smooth, non-bursty semantic

changes; we also go further than other methods by

distinguishing old and contemporary usages of an

identiﬁed semantic change.

We show through a multivariate regression that

our estimates of semantic inﬂuence of each paper

are positively correlated with their long-term cita-

tions, even after controlling for the initial citations,

the content of the paper in terms of topics, and the

lexical inﬂuence of the paper (see Figure 1). Fur-

ther, we formulate long-term citation prediction as

an online prediction task, constructing test sets for

successive years. The addition of semantic inﬂu-

ence as features to a model once again improves the

predictive performance of the model over baselines.

In summary, our contributions are as follows:1

•

We empirically demonstrate a link between

long-term citation count and short-term lin-

guistic inﬂuence, using both regression anal-

ysis (§ 5.3) and an online prediction task

(§ 5.4).

•

We present a method to estimate semantic

inﬂuence using a parametric Hawkes process

(§ 2.1). To achieve this, we ﬁnd semantic

changes and convert the usage of each change

into a cascade (§ 2.2). We also show that

the method can be applied to quantify lexical

inﬂuence.

•

We present a method to identify monotonic se-

mantic changes from timestamped text using

contextual embeddings (see § 2.2.1).

2 Methodology

This section describes our method for estimating

the linguistic inﬂuence of each document in a times-

The code and relevant data from our paper

can be found at

http://github.com/sandeepsoni/

contextual-leadership

tamped collection. Our work builds on the theory

of point process models (Daley et al.,2003), in

which the basic unit of data is a set of marked

event timestamps. In our case, the events corre-

spond to the use of an innovative word or usage;

the mark corresponds to the document in which

word or usage appears. To estimate linguistic inﬂu-

ence of individual documents, we ﬁt a parametric

model in which per-document inﬂuence parameters

explain the density of events in subsequent docu-

ments. We ﬁrst describe the modeling framework

in which these inﬂuence parameters are estimated

(§ 2.1) and then describe how event cascades are

constructed (§ 2.2) from semantic changes (§ 2.2.1)

and lexical innovations (§ 2.2.2).

2.1 Estimating document inﬂuence from

timestamped events

A marked cascade is a set of marked events

{e1, e2, . . . , eN}

, in which each event

ei= (ti, pi)

corresponds to a tuple of a timestamp

and a mark

. Assume a set of marked cascades, indexed by

w∈ W

, with each mark belonging to a ﬁnite set

that is shared across all cascades. In our applica-

tion, each cascade corresponds to the appearances

of an individual word or word sense, and each mark

is the identity of the document in which the word

or word sense appears.

Point process models deﬁne probability distribu-

tions over cascades. In an inhomogeneous point

process, the distribution of the count of events be-

tween any two timestamps

(t1, t2)

is governed by

the integral of an intensity function

λ(t, w)

. A

Hawkes process is a special case in which the in-

tensity function is the sum of terms associated with

previous events (Hawkes,1971). We choose the

following special form,

λ(t, w) = cw+X

i:t(w)

i<t

αp(w)

κ(t−t(w)

i),(1)

where

is a time-decay kernel such as the expo-

nential kernel

κ(∆t) = e−γ∆t

and

is a constant.

The parameter of interest is

, which quantiﬁes the

inﬂuence exerted by the document

p(w)

on subse-

quent events.2

In the more general multivariate Hawkes process, the

intensity function can depend on the identity of “receiver” of

inﬂuence. This enables the estimation of pairwise excitation

parameters

αi,j

, as in the work of Lukasik et al. (2016), to

give an example from the NLP literature. However, it would

be difﬁcult to estimate pairwise excitation between thousands

of documents, as required by our setting.

Our application focuses on research papers,

which historically have been published in a few

bursts — at conferences and in journals — rather

than continuously over time. For this reason we

simplify our setting further, discretizing the times-

tamps by year. The evidence to be explained is now

of the form

n(t, w)

, the count of word or sense

in year

. We model this count as a Poisson random

variable, and estimate the parameters

and

maximum likelihood.

2.2 Building event cascades

To estimate the parameters in Equation 1, we re-

quire a set of timestamped events. Ideally these

events should correspond to evidence of linguistic

innovation. We consider two sources of events: se-

mantic innovations (here focusing on words whose

meaning changes over time) and lexical innova-

tions (words whose usage rate increases dramati-

cally over time).

We now introduce some notation used in the

remainder of this section. Let a document be a

sequence of discrete tokens from a ﬁnite vocab-

ulary

, so that document

is denoted

Xi=

[x(1)

i, x(2)

i, . . . , x(ni)

, with

indicating the length

of document

. A corpus is similarly deﬁned as

a set of

documents,

X={X1, X2, ..., XN}

with each document associated with a discrete time

ti∈ T .

2.2.1 Using contextual embeddings to

identify semantic changes

We use contextual embeddings to identify words

whose meaning changes over time, following prior

work on computational historical linguistics (e.g.,

Kutuzov and Giulianelli,2020, see §6for a more

comprehensive review). A contextual embedding

h(k)

i∈RD

is a vector representation of token

in document

, computed from a model such as

BERT (Devlin et al.,2019). When the distribution

over

for a given word changes over time, this

is taken as evidence for a change in the word’s

meaning.

Let

mt−,w

and

mt+,w

be the count of the word

wup to and after time t, respectively. Speciﬁcally,

mt−,w =X

i:ti≤t

1(x(k)

i=w)

mt+,w =X

i:ti>t

1(x(k)

i=w)

Average representations of the word

up to and

after time

, respectively, are calculated as follows.

vt−,w =1

mt−,w X

i:ti≤t

h(k)

i1(x(k)

i=w)

vt+,w =1

mt+,w X

i:ti>t

h(k)

i1(x(k)

i=w)

Further, the variance in the contextual embed-

dings of the word

over the entire corpus is calcu-

lated by taking the variance of each component of

the embedding,

sw=1

mwX

i,k:x(k)

i=wh(k)

i−µw2,(2)

with

µw

equal to the mean contextualized embed-

ding of word w.

A semantic change score for a word

for a time

is then the variance-weighted squared norm of

the difference between its average pre-

and post-

contextualized embeddings (also known as the

squared Mahalanobis distance):

r(w, t)=(vt−,w −vt+,w)>S−1

w(vt−,w −vt+,w),

(3)

with Sw=Diag(sw).

Correction for frequency effects.

Both the

mean and variance are estimated with larger sam-

ples for timestamps in the middle of

in com-

parison to the initial and ﬁnal timestamps. Conse-

quently, the distance metric suffers from high sam-

ple variance for values of

near these endpoints.

The discrepancy is corrected by replacing the di-

agonal covariance

in Equation 3 with an alter-

native covariance

that reﬂects that additional

uncertainty due to sample size. Speciﬁcally, we

approximate the standard error of the mean

vt−

pS/mt−

, and analogously for

vt+

. Then

deﬁned as the product of these two approximate

standard errors,

Sw=sSw

mt−,w sSw

mt+,w

=Sw

√mt−,wmt+,w

(4)

Finally,

t∗= argmaxtr(w, t)

is selected as the

transition point for the change in meaning of

The changes are identiﬁed by sorting the words

max r(w, t)

and applying a set of basic ﬁlters

explained in § 4. To give some intuition:

•

changes in meaning at time

, then the

difference in its representation up to

and after

should be high. The metric in Equation 3

captures this precisely by calculating the term

vt−,w −vt+,w.

•

Difference in average embeddings can be high

for seasonal or bursty changes seen in words

such as turkey which is referred to the bird

more frequently at the time of American hol-

idays (Shoemark et al.,2019). Rescaling the

difference by the inverse variance encourages

detection of monotonic changes.

•

For rare words, the mean embeddings will be

less reliable. The

√m

terms in

have the

effect of emphasizing high-frequency words

for which changes in the mean embedding are

likely to be signiﬁcant.

Distinguishing old and new usages.

The previ-

ous step yields semantic innovations and their tran-

sition time. Simply identifying semantic changes

is insufﬁcient, since at any given time a word could

be used in its old or new sense with respect to its

time of transition. To categorize every usage of a

semantic innovation

, the contextual embeddings

are passed through a logistic regression classiﬁer

that predicts whether the usage is before or after

the transition time. At the end of this step a se-

quence of embeddings for any semantic innovation

is converted to a sequence of binary labels denot-

ing their usage. For each word

, the cascade

(e(w)

1, e(w)

2,...e(w)

Nw)

is formed by ﬁltering the us-

ages to those that are classiﬁed as corresponding to

the newer sense, with each event

e(w)

containing

a timestamp

t(w)

and a document identiﬁer

p(w)

These cascades are the evidence from which we es-

timate the per-document semantic inﬂuence scores

αs, as described in § 2.1.

Why contextual embeddings?

Embeddings

provide a powerful tool for understanding language

change, offering more linguistic granularity than

measures of change in the strength or composition

of latent topics (e.g., Grifﬁths and Steyvers,2004;

Gerow et al.,2018). Prior work has employed

diachronic non-contextual embeddings (e.g., Soni

et al.,2021b). Such methods require each word

to have a single shared embedding in each time

period. During periods in which a word is used in

multiple senses, the non-contextual embedding

must average across these senses, making it harder

to detect changes in progress.

2.2.2 Identifying lexical changes

Unlike semantic changes, whose identiﬁcation re-

quires representations such as contextual embed-

dings, lexical changes are identiﬁed simply by com-

paring frequency changes. Speciﬁcally, for every

word in a vocabulary we vary the segmentation

year, say

, for the word and calculate the rela-

tive frequency up to and after

. We then take the

best relative frequency ratio across the years as the

score of lexical change for that word and aggregate

to form a list of changes by sorting on this score.

In contrast to semantic changes, all the usages of

lexical changes are used to form cascades. These

cascades are the evidence from which we estimate

the per-document lexical inﬂuence scores

α`

, again

using the methods in § 2.1.

2.3 Overview

To summarize the method for computing semantic

inﬂuence:

Compute the score

r(w, t)

for each word

and time

as described in Equation 3 (with

the adjusted covariance term from Equation 4),

and threshold to identify semantic changes.

For each word selected in the previous step,

classify each usage as either “old” or “new”,

and build a cascade from the timestamps of

the new usages.

Aggregating over all the cascades, estimate

the inﬂuence parameters

αi

for each document

in the collection.

A visual summary of the entire methodological

pipeline is given in Figure 2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PredictingLong-TermCitationsfromShort-TermLinguisticInuenceSandeepSoniandDavidBammanUniversityofCalifornia,Berkeley{sandeepsoni,dbamman}@berkeley.eduJacobEisensteinGoogleResearchjeisenstein@google.comAbstractAstandardmeasureoftheinuenceofare-searchpaperisthenumberoftimesitiscited.However,papersmay...

展开>> 收起<<

Predicting Long-Term Citations from Short-Term Linguistic Inﬂuence Sandeep Soni and David Bamman University of California Berkeley.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Predicting Long-Term Citations from Short-Term Linguistic Inﬂuence Sandeep Soni and David Bamman University of California Berkeley

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: