Modeling Information Change in Science Communication with Semantically Matched Paraphrases Dustin WrightJiaxin PeiDavid JurgensIsabelle Augenstein

2025-05-06 0 0 783.45KB 25 页 10玖币

侵权投诉

Modeling Information Change in Science Communication with

Semantically Matched Paraphrases

Dustin Wright[∗Jiaxin Pei]∗David Jurgens]Isabelle Augenstein[

[Dept. of Computer Science, University of Copenhagen, Denmark

]School of Information, University of Michigan, Ann Arbor, MI, USA

{dw,augenstein}@di.ku.dk

{pedropei,jurgens}@umich.edu

Abstract

Whether the media faithfully communicate sci-

entiﬁc information has long been a core issue

to the science community. Automatically iden-

tifying paraphrased scientiﬁc ﬁndings could

enable large-scale tracking and analysis of in-

formation changes in the science communica-

tion process, but this requires systems to under-

stand the similarity between scientiﬁc informa-

tion across multiple domains. To this end, we

present the SCIENTIFIC PARAPHRASE AND

INFORMATION CHANGE DATASET (SPICED),

the ﬁrst paraphrase dataset of scientiﬁc ﬁnd-

ings annotated for degree of information

change. SPICED contains 6,000 scientiﬁc ﬁnd-

ing pairs extracted from news stories, social

media discussions, and full texts of original

papers. We demonstrate that SPICED poses

a challenging task and that models trained on

SPICED improve downstream performance on

evidence retrieval for fact checking of real-

world scientiﬁc claims. Finally, we show that

models trained on SPICED can reveal large-

scale trends in the degrees to which people and

organizations faithfully communicate new sci-

entiﬁc ﬁndings. Data, code, and pre-trained

models are available at http://www.copenlu.

com/publication/2022_emnlp_wright/.

1 Introduction

Science communication disseminates scholarly in-

formation to audiences outside the research com-

munity, such as the public and policymakers

(National Academies of Sciences, Engineering,

and Medicine,2017). This process usually in-

volves translating highly technical language to non-

technical, less-formal language that is engaging

and easily understandable for lay people (Salita,

2015). The public relies on the media to learn

about new scientiﬁc ﬁndings, and media portray-

als of science affect people’s trust in science while

at the same time inﬂuencing their future actions

∗denotes equal contribution

Text

Increasing dietary magnesium intake is associated with a

reduced risk of stroke, heart failure, diabetes, and all-cause

mortality.

The study findings suggest that increased consumption

of magnesium-rich foods may have health benefits.

#Magnesium saves lives https://t.co/K0M6QdjWcc

https://t.co/FUtWZ8jADM

Figure 1: We are interested in measuring the informa-

tion similarity of statements about scientiﬁc ﬁndings

between different sources, including scientiﬁc papers,

news, and tweets, shown here with real examples. The

ﬁnding in this ﬁgure comes from Fang et al. (2016) and

the news quote is from this Reuters story.

(Gustafson and Rice,2019;Fischhoff,2012;Kuru

et al.,2021). However, not all scientiﬁc communi-

cation accurately conveys the original information,

as shown in Figure 1. Identifying cases where sci-

entiﬁc information has changed is a critical but

challenging task due to the complex translating

and paraphrasing done by effective communicators.

Our work introduces a new task of measuring sci-

entiﬁc information change, and through developing

new data and models aims to address the gap in

studying faithful scientiﬁc communication.

Though efforts exist to track and ﬂag when pop-

ular media misrepresent science,

the sheer volume

of new studies, reporting, and online engagement

make purely manual efforts both intractable and

unattractive. Existing studies in NLP to help au-

tomate the study of science communication have

examined exaggeration (Wright and Augenstein,

2021b), certainty (Pei and Jurgens,2021), and fact

checking (Boissonnet et al.,2022;Wright et al.,

2022), among others. However, these studies skip

over the key ﬁrst step needed to compare scientiﬁc

texts for information change: automatically identi-

See e.g.

https://www.healthnewsreview.org/

and

https://sciencefeedback.co/

arXiv:2210.13001v1 [cs.CL] 24 Oct 2022

fying content from both sources which describe the

same

scientiﬁc ﬁnding. In other words, to answer

relevant questions about and analyze changes in

scientiﬁc information at scale, one must ﬁrst be

able to point to which original information is being

communicated in a new way.

To enable automated analysis of science com-

munication, this work offers the following

con-

tributions

(marked by

). First, we present the

SCIENTIFIC PARAPHRASE AND INFORMATION

CHANGE DATASET dataset (SPICED), a manually

annotated dataset of paired scientiﬁc ﬁndings from

news articles, tweets, and scientiﬁc papers (

§3). SPICED has the following merits: (1) exist-

ing datasets focus purely on semantic similarity,

while SPICED focuses on differences in the infor-

mation communicated in scientiﬁc ﬁndings; (2) sci-

entiﬁc text datasets tend to focus solely on titles or

paper abstracts, while SPICED includes sentences

extracted from the full-text of papers and news arti-

cles; (3) SPICED is largely multi-domain, covering

the 4 broad scientiﬁc ﬁelds that get the most media

attention (namely: medicine, biology, computer

science, and psychology) and includes data from

the whole science communication pipeline, from

research articles to science news and social media

discussions.

In addition to extensively benchmarking the per-

formance of current models on SPICED (

, §4),

we demonstrate that the dataset enables multiple

downstream applications. In particular, we demon-

strate how models trained on SPICED improve zero-

shot performance on the task of sentence-level

evidence retrieval for verifying real-world claims

about scientiﬁc topics (

, §5), and perform an

applied analysis on unlabelled tweets and news ar-

ticles where we show (1) media tend to exaggerate

ﬁndings in the limitations sections of papers; (2)

press releases and SciTech tend to have less infor-

mational change than general news outlets; and (3)

organizations’ Twitter accounts tend to discuss sci-

ence more faithfully than veriﬁed users on Twitter

and users with more followers (C4, §6).

2 Related Work

The analysis of scientiﬁc communication directly

relates to fact checking, scientiﬁc language anal-

ysis, and semantic textual similarity. We brieﬂy

highlight our connections to these.

Fact Checking

Automatic fact checking is con-

cerned with verifying whether or not a given claim

is true, and has been studied extensively in mul-

tiple domains (Thorne et al.,2018;Augenstein

et al.,2019) including science (Wadden et al.,2020;

Boissonnet et al.,2022;Wright et al.,2022). Fact

checking focuses on a speciﬁc type of information

change, namely veracity. Additionally, the task gen-

erally assumes access to pre-existing knowledge re-

sources, such as Wikipedia or PubMed, from which

evidence can be retrieved that either supports or re-

futes a given claim. Our task is concerned with a

more general type of information change beyond

categorical falsehood and is a required task to com-

plete prior to performing any kind of fact check.

Scientiﬁc Language Analysis

Automating tasks

beneﬁcial for understanding changes in scientiﬁc

information between the published literature and

media is a growing area of research (Wright and

Augenstein,2021b;Pei and Jurgens,2021;Bois-

sonnet et al.,2022;Dai et al.,2020;August et al.,

2020b;Tan and Lee,2014;Vadapalli et al.,2018;

August et al.,2020a;Ginev and Miller,2020). The

three tasks most related to our work are under-

standing writing strategies for science communi-

cation (August et al.,2020b), detecting changes

in certainty (Pei and Jurgens,2021), and detecting

changes in causal claim strength i.e. exaggera-

tion (Wright and Augenstein,2021b). However,

studying these requires access to paired scientiﬁc

ﬁndings. To be able to do so at scale will require

the ability to pair such ﬁndings automatically.

Semantic Similarity

The topic of semantic sim-

ilarity is well-studied in NLP. Several datasets ex-

ist with explicit similarity labels, many of which

come from SemEval STS shared tasks (e.g., Cer

et al.,2017) and paraphrasing datasets (Ganitke-

vitch et al.,2013). It is possible to build unla-

belled datasets of semantic similarity automatically,

which is the main method that has been used for

scientiﬁc texts (Cohan et al.,2020;Lo et al.,2020).

However, such datasets fail to capture more subtle

aspects of similarity, particularly when the focus

is solely on the scientiﬁc ﬁndings conveyed by a

sentence (see Appendix A). And as we will show,

approaches based on these datasets are insufﬁcient

for the task we are concerned with in this work,

motivating the need for a new resource.

3 SPICED

We introduce SPICED, a new large-scale dataset of

scientiﬁc ﬁndings paired with how they are commu-

nicated in news and social media. Communicating

scientiﬁc ﬁndings is known to have a broad impact

on public attitudes (Weigold,2001) and to inﬂu-

ence behavior, e.g., the way vaccines are framed

in the media has an effect on vaccine uptake (Kuru

et al.,2021). Building upon prior work in NLP

(Wright and Augenstein,2021a;Pei and Jurgens,

2021;Sumner et al.,2014;Bratton et al.,2019), we

deﬁne a scientiﬁc ﬁnding as

a statement that de-

scribes a particular research output of a scien-

tiﬁc study, which could be a result, conclusion,

product, etc.

This general deﬁnition holds across

ﬁelds; for example, many ﬁndings from medicine

and psychology report on effects on some depen-

dent variable via manipulation of an independent

variable, while in computer science many ﬁndings

are related to new systems, algorithms, or methods.

Following, we describe how the pairs of scientiﬁc

ﬁndings were selected and annotated.

3.1 Data Collection

An initial dataset of unlabelled pairs of scientiﬁc

communications was collected through Altmet-

ric (

https://www.altmetric.com/

) a platform track-

ing mentions of scientiﬁc articles online. This ini-

tial pool contains 17,668 scientiﬁc papers, 41,388

paired news articles, and 733,755 tweets—note that

a single paper may be communicated about mul-

tiple times. The scientiﬁc ﬁndings were extracted

in different ways for each source. Similar to Prab-

hakaran et al. (2016), we ﬁne-tune a RoBERTa (Liu

et al.,2019) model to classify sentences into meth-

ods, background, objective, results and conclusions

using 200K paper abstracts from PubMed that had

been self-labeled with these categories (Canese and

Weis,2013). This sentence classiﬁer attained 0.92

F1 score on a held-out 10% sample (details in Ap-

pendix I) and then the classiﬁer was applied to

each sentence of the news stories and paper full-

texts. Given the domain difference between scien-

tiﬁc abstracts and news, we additionally manually

annotated a sample of 100 extracted conclusions;

we ﬁnd that the precision of the classiﬁer is 0.88,

suggesting that it is able to accurately identify sci-

entiﬁc ﬁndings in news as well. We extract each

sentence classiﬁed as “result” or “conclusion” and

create pairs with each ﬁnding sentence from news

articles written about it. This yields 45.7M poten-

tial pairs of

news, paper

ﬁndings. For tweets, we

take full tweets as is, yielding 35.6M potential pairs

of htweet, paperiﬁndings.

3.2 Data sampling

Pairing every ﬁnding from a news story with ev-

ery ﬁnding from its matched paper results in an

untenable amount of data to annotate. Addition-

ally, it has been shown that proper data selec-

tion can reduce the need to annotate every pos-

sible sample (MacKay,1992;Holub et al.,2008;

Houlsby et al.,2011). Therefore, to obtain a sam-

ple of paired ﬁndings covering a range of similari-

ties, we ﬁrst ﬁlter our pool of unlabelled matched

ﬁndings based on the semantics using Sentence-

BERT (SBERT, Reimers and Gurevych (2019)), a

Siamese BERT network trained for semantic text

similarity, trained on over 1B sentence pairs (see

Appendix G for further details). We use this model

to score pairs of ﬁndings from news articles and pa-

pers based on their embeddings’ cosine similarity

and conduct a pilot study to determine which data

to annotate.

For the pilot, we sample 400 pairs evenly for

every

0.05

increment bucket in the range

[0,1]

similarity scores (20 per bucket). Each sample is

annotated by two of the authors of this study with

a binary label of “matching” vs “not matching”,

yielding a Krippendorff’s alpha of

0.73

From this

sample, we observed that there were no matches

below 0.3 and only 2 ambiguous matches below

0.4. At the same time, the vast majority of samples

from the entire dataset have a similarity score of

less than 0.4. Additionally, above 0.9 we saw that

each pair was essentially equivalent. Given the dis-

tribution of matched ﬁndings across the similarity

scale, in order to balance the number of annotations

we can acquire, the yield of positive samples, and

the sample difﬁculty, we sampled data as follows

based on their cosine similarity:

• Below 0.4= automatically unmatched.

•

Above

0.9

with a Jaccard index above

0.5

automatically matched.

•

Sample an equal number of pairs from each

0.05

increment bin between

0.4

and

0.9

for

human expert annotation.

We sample 600

news, paper

ﬁnding pairs from

the four ﬁelds which receive the most media atten-

tion (medicine, biology, computer science, and psy-

chology) using this method. This yields 2,400 pairs

to be annotated. For extensive details on the pilot

annotation and visualizations, see Appendix B.

Note that many discussions about what constitutes match-

ing vs. not matching were had in pilot work, leading to high

agreement.

Paper ﬁnding News Finding Similarity Score IMS

However, the consistency of the erythritol results

in both the central adiposity and usual glycemia

comparisons lends strength to the ﬁndings, and the

cluster of metabolites has biological plausibility.

Young adults who exhibited central adiposity gain

over the course of 35 weeks had plasma erythritol

levels 15-times higher at baseline than those with

stable adiposity over the same period.

0.88 1

Our results showed that most of the ofﬁcial adult-

onset men began their antisocial activities during

early childhood.

Beckley, who is in the department of psychology

and neuroscience at Duke, said the adult-onset

group had a history of anti-social behavior back

to childhood, but reported committing relatively

fewer crimes.

0.38 4.4

Table 1: Annotated information matching score (IMS) and the similarity score estimated by SBERT (Reimers and

Gurevych,2019) for selected ﬁnding pairs from SPICED. These examples demonstrate that simple similarity scores

may not reﬂect whether the two sentences are covering the same scientiﬁc ﬁnding.

We follow a similar procedure to sample pairs

from papers and Twitter for annotation. However,

rather than use the SBERT similarity scores, we

instead ﬁrst obtain annotations for news pairs using

the scheme to be described later in §3.3 in order

to train an initial model on our task (CiteBERT,

Wright and Augenstein 2021a). We then use the

trained model to obtain scores in the range [0,1]

for each pair and sample an equal number of pairs

from bins in 0.05 increments, for a total of 1,200

pairs (300 from each ﬁeld of interest).

3.3 Finding Matching Annotation

We perform our ﬁnal annotation based on the sam-

pling scheme above using the Proliﬁc platform

(

https://www.prolific.co/

) as it allows prescreen-

ing annotators by educational background. We

require each annotator to have at least a bache-

lor’s degree in a relevant ﬁeld to work on the task.

Annotators are asked to label “whether the two sen-

tences are discussing the same scientiﬁc ﬁnding”

for 50 ﬁnding pairs with a 5-point Likert schema

where each value indicates that “The information

in the ﬁndings is...” (1): Completely different (2):

Mostly different (3): Somewhat similar (4): Mostly

the same, or (5): Completely the same. See Ap-

pendix C for details of how this rating scale was de-

cided. We call this the INFORMATION MATCHING

SCORE (IMS) of a pair of ﬁndings. Annotation

was performed using POTATO (Pei et al.,2022).

Full annotation instructions and details are listed in

Appendix D. Notably, annotators were instructed

to mark how similar the information in the ﬁndings

was, as opposed to how similar the sentences are.

Further, they were instructed to ignore extraneous

information like “The scientists show...” and “our

experiments demonstrate...”.

Post processing

To improve the reliability of the

annotations, we use MACE (Hovy et al.,2013) to

estimate the competence score of each annotator

and removed the labels from the annotators with

the lowest competence scores. We further man-

ually examine pairs with the most diverse labels

(standard deviation of ratings

1.2) and manually

replace the outliers with our expert annotations.

The overall Krippendoff’s

is 0.52, 0.57, 0.53,

and 0.52 for CS, Medicine, Biology, and Psychol-

ogy respectively, indicating that the ﬁnal labels are

reliable. While many annotators considered the

task challenging, our quality control strategies al-

low us to collect reliable annotations.

For all the

annotated pairs, we average the ratings as the ﬁnal

similarity score. In addition to the 3,600 manually

annotated pairs, we include an extra 2,400 auto-

matically annotated pairs as determined in §3.2

(unmatched pairs get an IMS of 1, matched pairs

get an IMS of 5), for a total of 6,000 pairs. Given

that there can be multiple pairs from a single news-

paper pair, to avoid overlaps between training and

test sets, we split the dataset 80%/10%/10% based

on the paper DOI and balance across subjects. Fur-

ther dataset details in Appendix E

Selected Examples

To highlight the difﬁculty of

SPICED, we show a pair of samples from our ﬁnal

dataset in Table 1. The IMS is compared to the

cosine similarity between embeddings produced

by SBERT. For the ﬁrst case, SBERT presumably

picks up on similarities in the discussed topics,

such as erythritol and its relationship to adiposity,

but the paper ﬁnding is concerned with the con-

sistency of results and its biological implications

while the news ﬁnding explicitly mentions a re-

lationship between erythritol and adiposity. The

second case expresses the opposite effect; the news

ﬁnding contains a lot of extraneous information for

For example, one participant commented “It was pretty

hard to consider both the statements and their context then

comparing them for similarities, but i enjoyed it”

STSB SNLI SPICED News Tweets

0.401 0.631 0.726 0.712 0 .749

Table 2: The average normalized edit distance between

matching pairs for various datasets shows that SPICED

includes more pairs that are lexically dissimilar. For

SPICED and STSB, pairs are considered matching if

their similarity score is greater than 3. For SNLI, pairs

are considered matching if the label is “entailment”.

context, but one of the core ﬁndings it expresses

is the same as the paper ﬁnding, giving it a high

rating in SPICED.

Comparison with existing datasets

To further

characterize the difﬁculty of SPICED compared to

existing datasets, we show the average normalized

edit distance between matching pairs in SPICED,

STSB (Cer et al.,2017), and SNLI (Bowman et al.,

2015) (see Appendix F for the calculation). STSB

is a semantic text similarity dataset consisting of

pairs of sentences scored with their semantic simi-

larity, sourced from multiple SemEval shared tasks.

SNLI is a natural language inference corpus, and

consists of pairs of sentences labeled for if they

entail each other, contradict each other, or are

neutral. We calculated the mean normalized edit

distance across all pairs of matching sentences in

each dataset’s training data; For SPICED and STSB,

pairs are considered matching if their IMS or sim-

ilarity score is greater than 3, respectively. For

SNLI, pairs are considered matching if the label is

“entailment”.

We ﬁnd that there is a much greater lexical differ-

ence between the matching pairs in SPICED (0.726)

than existing general domain paired text datasets

(0.401 for STSB and 0.631 for SNLI). This gap

between STSB and SPICED also emphasizes the

difference between traditional semantic textual sim-

ilarity tasks and the information change task we

describe here. Within SPICED, Twitter pairs had

a higher distance (0.749) than news pairs (0.712),

suggesting stronger domain differences. For qual-

itative examples showing the difference between

SPICED and STSB, see Appendix A.

Relationship of SPICED to Fact Checking

The

task introduced by SPICED captures information

change more broadly than veracity as in automatic

fact checking, as the task is concerned with the

degree to which two sentences describe the same

scientiﬁc information—indeed, two similar sen-

tences may describe the same information equally

poorly. Our task is similar to the sentence selec-

tion stage in the fact checking pipeline, and we

later demonstrate that models trained on SPICED

data are useful for this task for science in section 5.

However, our task and annotation are agnostic to

whether a pair of sentences entail one another. This

is especially useful if one wants to compare how

a particular ﬁnding is presented across different

media. Fact-checking datasets are also explicitly

constructed to contain claims which are about a sin-

gle piece of information—SPICED is not restricted

in this way, focusing on a more general type of in-

formation change beyond categorical falsehood. Fi-

nally, we note two more unique features of SPICED:

1) SPICED contains naturally occurring sentences,

while fact checking datasets like FEVER and Sci-

Fact often contain manually written claims. 2)

The combination of domains in SPICED is unique;

sentences are paired between (news, science) and

(tweets, science), and these pairings don’t exist

currently.

4 Scientiﬁc Information Change Models

We now use SPICED to evaluate models for esti-

mating the IMS of ﬁnding pairs in two settings:

zero-shot transfer and supervised ﬁne-tuning.

4.1 Experimental setup

We use the following four models to estimate

zero-shot transfer performance.

Paraphrase

RoBERTa (Liu et al.,2019) pre-trained for para-

phrase detection on an adversarial paraphrasing

task (Nighojkar and Licato,2021). We convert

the output probability of a pair being a paraphrase

to the range [1,5] for comparison with our labels.

Natural Language Inference (NLI)

: RoBERTa

pre-trained on a wide range of NLI datasets (Nie

et al.,2020). The ﬁnal score is the model’s mea-

sured probability of entailment mapped to the range

[1,5].

MiniLM

: SBERT with MiniLM as the base

network (Wang et al.,2020a); we obtain sentence

embeddings for pairs of ﬁndings and measure the

cosine similarity between these two embeddings,

clip the lowest score to 0, and convert this score to

the range [1,5]. Note that this model was trained on

over 1B sentence pairs, including from scientiﬁc

text, using a contrastive learning approach where

the embeddings of sentences known to be similar

are trained to be closer than the embeddings of

negatively sampled sentences. SBERT models rep-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ModelingInformationChangeinScienceCommunicationwithSemanticallyMatchedParaphrasesDustinWright[JiaxinPei]DavidJurgens]IsabelleAugenstein[[Dept.ofComputerScience,UniversityofCopenhagen,Denmark]SchoolofInformation,UniversityofMichigan,AnnArbor,MI,USA{dw,augenstein}@di.ku.dk{pedropei,jurgens}@umich.ed...

展开>> 收起<<

Modeling Information Change in Science Communication with Semantically Matched Paraphrases Dustin WrightJiaxin PeiDavid JurgensIsabelle Augenstein.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Modeling Information Change in Science Communication with Semantically Matched Paraphrases Dustin WrightJiaxin PeiDavid JurgensIsabelle Augenstein

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: