Modeling Information Change in Science Communication with Semantically Matched Paraphrases Dustin WrightJiaxin PeiDavid JurgensIsabelle Augenstein

2025-05-06 0 0 783.45KB 25 页 10玖币
侵权投诉
Modeling Information Change in Science Communication with
Semantically Matched Paraphrases
Dustin Wright[Jiaxin Pei]David Jurgens]Isabelle Augenstein[
[Dept. of Computer Science, University of Copenhagen, Denmark
]School of Information, University of Michigan, Ann Arbor, MI, USA
{dw,augenstein}@di.ku.dk
{pedropei,jurgens}@umich.edu
Abstract
Whether the media faithfully communicate sci-
entific information has long been a core issue
to the science community. Automatically iden-
tifying paraphrased scientific findings could
enable large-scale tracking and analysis of in-
formation changes in the science communica-
tion process, but this requires systems to under-
stand the similarity between scientific informa-
tion across multiple domains. To this end, we
present the SCIENTIFIC PARAPHRASE AND
INFORMATION CHANGE DATASET (SPICED),
the first paraphrase dataset of scientific find-
ings annotated for degree of information
change. SPICED contains 6,000 scientific find-
ing pairs extracted from news stories, social
media discussions, and full texts of original
papers. We demonstrate that SPICED poses
a challenging task and that models trained on
SPICED improve downstream performance on
evidence retrieval for fact checking of real-
world scientific claims. Finally, we show that
models trained on SPICED can reveal large-
scale trends in the degrees to which people and
organizations faithfully communicate new sci-
entific findings. Data, code, and pre-trained
models are available at http://www.copenlu.
com/publication/2022_emnlp_wright/.
1 Introduction
Science communication disseminates scholarly in-
formation to audiences outside the research com-
munity, such as the public and policymakers
(National Academies of Sciences, Engineering,
and Medicine,2017). This process usually in-
volves translating highly technical language to non-
technical, less-formal language that is engaging
and easily understandable for lay people (Salita,
2015). The public relies on the media to learn
about new scientific findings, and media portray-
als of science affect people’s trust in science while
at the same time influencing their future actions
denotes equal contribution
Text
Increasing dietary magnesium intake is associated with a
reduced risk of stroke, heart failure, diabetes, and all-cause
mortality.
The study findings suggest that increased consumption
of magnesium-rich foods may have health benefits.
#Magnesium saves lives https://t.co/K0M6QdjWcc
https://t.co/FUtWZ8jADM
Figure 1: We are interested in measuring the informa-
tion similarity of statements about scientific findings
between different sources, including scientific papers,
news, and tweets, shown here with real examples. The
finding in this figure comes from Fang et al. (2016) and
the news quote is from this Reuters story.
(Gustafson and Rice,2019;Fischhoff,2012;Kuru
et al.,2021). However, not all scientific communi-
cation accurately conveys the original information,
as shown in Figure 1. Identifying cases where sci-
entific information has changed is a critical but
challenging task due to the complex translating
and paraphrasing done by effective communicators.
Our work introduces a new task of measuring sci-
entific information change, and through developing
new data and models aims to address the gap in
studying faithful scientific communication.
Though efforts exist to track and flag when pop-
ular media misrepresent science,
1
the sheer volume
of new studies, reporting, and online engagement
make purely manual efforts both intractable and
unattractive. Existing studies in NLP to help au-
tomate the study of science communication have
examined exaggeration (Wright and Augenstein,
2021b), certainty (Pei and Jurgens,2021), and fact
checking (Boissonnet et al.,2022;Wright et al.,
2022), among others. However, these studies skip
over the key first step needed to compare scientific
texts for information change: automatically identi-
1
See e.g.
https://www.healthnewsreview.org/
and
https://sciencefeedback.co/
arXiv:2210.13001v1 [cs.CL] 24 Oct 2022
fying content from both sources which describe the
same
scientific finding. In other words, to answer
relevant questions about and analyze changes in
scientific information at scale, one must first be
able to point to which original information is being
communicated in a new way.
To enable automated analysis of science com-
munication, this work offers the following
con-
tributions
(marked by
C
). First, we present the
SCIENTIFIC PARAPHRASE AND INFORMATION
CHANGE DATASET dataset (SPICED), a manually
annotated dataset of paired scientific findings from
news articles, tweets, and scientific papers (
C1
,
§3). SPICED has the following merits: (1) exist-
ing datasets focus purely on semantic similarity,
while SPICED focuses on differences in the infor-
mation communicated in scientific findings; (2) sci-
entific text datasets tend to focus solely on titles or
paper abstracts, while SPICED includes sentences
extracted from the full-text of papers and news arti-
cles; (3) SPICED is largely multi-domain, covering
the 4 broad scientific fields that get the most media
attention (namely: medicine, biology, computer
science, and psychology) and includes data from
the whole science communication pipeline, from
research articles to science news and social media
discussions.
In addition to extensively benchmarking the per-
formance of current models on SPICED (
C2
, §4),
we demonstrate that the dataset enables multiple
downstream applications. In particular, we demon-
strate how models trained on SPICED improve zero-
shot performance on the task of sentence-level
evidence retrieval for verifying real-world claims
about scientific topics (
C3
, §5), and perform an
applied analysis on unlabelled tweets and news ar-
ticles where we show (1) media tend to exaggerate
findings in the limitations sections of papers; (2)
press releases and SciTech tend to have less infor-
mational change than general news outlets; and (3)
organizations’ Twitter accounts tend to discuss sci-
ence more faithfully than verified users on Twitter
and users with more followers (C4, §6).
2 Related Work
The analysis of scientific communication directly
relates to fact checking, scientific language anal-
ysis, and semantic textual similarity. We briefly
highlight our connections to these.
Fact Checking
Automatic fact checking is con-
cerned with verifying whether or not a given claim
is true, and has been studied extensively in mul-
tiple domains (Thorne et al.,2018;Augenstein
et al.,2019) including science (Wadden et al.,2020;
Boissonnet et al.,2022;Wright et al.,2022). Fact
checking focuses on a specific type of information
change, namely veracity. Additionally, the task gen-
erally assumes access to pre-existing knowledge re-
sources, such as Wikipedia or PubMed, from which
evidence can be retrieved that either supports or re-
futes a given claim. Our task is concerned with a
more general type of information change beyond
categorical falsehood and is a required task to com-
plete prior to performing any kind of fact check.
Scientific Language Analysis
Automating tasks
beneficial for understanding changes in scientific
information between the published literature and
media is a growing area of research (Wright and
Augenstein,2021b;Pei and Jurgens,2021;Bois-
sonnet et al.,2022;Dai et al.,2020;August et al.,
2020b;Tan and Lee,2014;Vadapalli et al.,2018;
August et al.,2020a;Ginev and Miller,2020). The
three tasks most related to our work are under-
standing writing strategies for science communi-
cation (August et al.,2020b), detecting changes
in certainty (Pei and Jurgens,2021), and detecting
changes in causal claim strength i.e. exaggera-
tion (Wright and Augenstein,2021b). However,
studying these requires access to paired scientific
findings. To be able to do so at scale will require
the ability to pair such findings automatically.
Semantic Similarity
The topic of semantic sim-
ilarity is well-studied in NLP. Several datasets ex-
ist with explicit similarity labels, many of which
come from SemEval STS shared tasks (e.g., Cer
et al.,2017) and paraphrasing datasets (Ganitke-
vitch et al.,2013). It is possible to build unla-
belled datasets of semantic similarity automatically,
which is the main method that has been used for
scientific texts (Cohan et al.,2020;Lo et al.,2020).
However, such datasets fail to capture more subtle
aspects of similarity, particularly when the focus
is solely on the scientific findings conveyed by a
sentence (see Appendix A). And as we will show,
approaches based on these datasets are insufficient
for the task we are concerned with in this work,
motivating the need for a new resource.
3 SPICED
We introduce SPICED, a new large-scale dataset of
scientific findings paired with how they are commu-
nicated in news and social media. Communicating
scientific findings is known to have a broad impact
on public attitudes (Weigold,2001) and to influ-
ence behavior, e.g., the way vaccines are framed
in the media has an effect on vaccine uptake (Kuru
et al.,2021). Building upon prior work in NLP
(Wright and Augenstein,2021a;Pei and Jurgens,
2021;Sumner et al.,2014;Bratton et al.,2019), we
define a scientific finding as
a statement that de-
scribes a particular research output of a scien-
tific study, which could be a result, conclusion,
product, etc.
This general definition holds across
fields; for example, many findings from medicine
and psychology report on effects on some depen-
dent variable via manipulation of an independent
variable, while in computer science many findings
are related to new systems, algorithms, or methods.
Following, we describe how the pairs of scientific
findings were selected and annotated.
3.1 Data Collection
An initial dataset of unlabelled pairs of scientific
communications was collected through Altmet-
ric (
https://www.altmetric.com/
) a platform track-
ing mentions of scientific articles online. This ini-
tial pool contains 17,668 scientific papers, 41,388
paired news articles, and 733,755 tweets—note that
a single paper may be communicated about mul-
tiple times. The scientific findings were extracted
in different ways for each source. Similar to Prab-
hakaran et al. (2016), we fine-tune a RoBERTa (Liu
et al.,2019) model to classify sentences into meth-
ods, background, objective, results and conclusions
using 200K paper abstracts from PubMed that had
been self-labeled with these categories (Canese and
Weis,2013). This sentence classifier attained 0.92
F1 score on a held-out 10% sample (details in Ap-
pendix I) and then the classifier was applied to
each sentence of the news stories and paper full-
texts. Given the domain difference between scien-
tific abstracts and news, we additionally manually
annotated a sample of 100 extracted conclusions;
we find that the precision of the classifier is 0.88,
suggesting that it is able to accurately identify sci-
entific findings in news as well. We extract each
sentence classified as “result” or “conclusion” and
create pairs with each finding sentence from news
articles written about it. This yields 45.7M poten-
tial pairs of
h
news, paper
i
findings. For tweets, we
take full tweets as is, yielding 35.6M potential pairs
of htweet, paperifindings.
3.2 Data sampling
Pairing every finding from a news story with ev-
ery finding from its matched paper results in an
untenable amount of data to annotate. Addition-
ally, it has been shown that proper data selec-
tion can reduce the need to annotate every pos-
sible sample (MacKay,1992;Holub et al.,2008;
Houlsby et al.,2011). Therefore, to obtain a sam-
ple of paired findings covering a range of similari-
ties, we first filter our pool of unlabelled matched
findings based on the semantics using Sentence-
BERT (SBERT, Reimers and Gurevych (2019)), a
Siamese BERT network trained for semantic text
similarity, trained on over 1B sentence pairs (see
Appendix G for further details). We use this model
to score pairs of findings from news articles and pa-
pers based on their embeddings’ cosine similarity
and conduct a pilot study to determine which data
to annotate.
For the pilot, we sample 400 pairs evenly for
every
0.05
increment bucket in the range
[0,1]
of
similarity scores (20 per bucket). Each sample is
annotated by two of the authors of this study with
a binary label of “matching” vs “not matching”,
yielding a Krippendorffs alpha of
0.73
.
2
From this
sample, we observed that there were no matches
below 0.3 and only 2 ambiguous matches below
0.4. At the same time, the vast majority of samples
from the entire dataset have a similarity score of
less than 0.4. Additionally, above 0.9 we saw that
each pair was essentially equivalent. Given the dis-
tribution of matched findings across the similarity
scale, in order to balance the number of annotations
we can acquire, the yield of positive samples, and
the sample difficulty, we sampled data as follows
based on their cosine similarity:
Below 0.4= automatically unmatched.
Above
0.9
with a Jaccard index above
0.5
=
automatically matched.
Sample an equal number of pairs from each
0.05
increment bin between
0.4
and
0.9
for
human expert annotation.
We sample 600
h
news, paper
i
finding pairs from
the four fields which receive the most media atten-
tion (medicine, biology, computer science, and psy-
chology) using this method. This yields 2,400 pairs
to be annotated. For extensive details on the pilot
annotation and visualizations, see Appendix B.
2
Note that many discussions about what constitutes match-
ing vs. not matching were had in pilot work, leading to high
agreement.
Paper finding News Finding Similarity Score IMS
However, the consistency of the erythritol results
in both the central adiposity and usual glycemia
comparisons lends strength to the findings, and the
cluster of metabolites has biological plausibility.
Young adults who exhibited central adiposity gain
over the course of 35 weeks had plasma erythritol
levels 15-times higher at baseline than those with
stable adiposity over the same period.
0.88 1
Our results showed that most of the official adult-
onset men began their antisocial activities during
early childhood.
Beckley, who is in the department of psychology
and neuroscience at Duke, said the adult-onset
group had a history of anti-social behavior back
to childhood, but reported committing relatively
fewer crimes.
0.38 4.4
Table 1: Annotated information matching score (IMS) and the similarity score estimated by SBERT (Reimers and
Gurevych,2019) for selected finding pairs from SPICED. These examples demonstrate that simple similarity scores
may not reflect whether the two sentences are covering the same scientific finding.
We follow a similar procedure to sample pairs
from papers and Twitter for annotation. However,
rather than use the SBERT similarity scores, we
instead first obtain annotations for news pairs using
the scheme to be described later in §3.3 in order
to train an initial model on our task (CiteBERT,
Wright and Augenstein 2021a). We then use the
trained model to obtain scores in the range [0,1]
for each pair and sample an equal number of pairs
from bins in 0.05 increments, for a total of 1,200
pairs (300 from each field of interest).
3.3 Finding Matching Annotation
We perform our final annotation based on the sam-
pling scheme above using the Prolific platform
(
https://www.prolific.co/
) as it allows prescreen-
ing annotators by educational background. We
require each annotator to have at least a bache-
lor’s degree in a relevant field to work on the task.
Annotators are asked to label “whether the two sen-
tences are discussing the same scientific finding”
for 50 finding pairs with a 5-point Likert schema
where each value indicates that “The information
in the findings is... (1): Completely different (2):
Mostly different (3): Somewhat similar (4): Mostly
the same, or (5): Completely the same. See Ap-
pendix C for details of how this rating scale was de-
cided. We call this the INFORMATION MATCHING
SCORE (IMS) of a pair of findings. Annotation
was performed using POTATO (Pei et al.,2022).
Full annotation instructions and details are listed in
Appendix D. Notably, annotators were instructed
to mark how similar the information in the findings
was, as opposed to how similar the sentences are.
Further, they were instructed to ignore extraneous
information like “The scientists show... and “our
experiments demonstrate...”.
Post processing
To improve the reliability of the
annotations, we use MACE (Hovy et al.,2013) to
estimate the competence score of each annotator
and removed the labels from the annotators with
the lowest competence scores. We further man-
ually examine pairs with the most diverse labels
(standard deviation of ratings
>
1.2) and manually
replace the outliers with our expert annotations.
The overall Krippendoffs
α
is 0.52, 0.57, 0.53,
and 0.52 for CS, Medicine, Biology, and Psychol-
ogy respectively, indicating that the final labels are
reliable. While many annotators considered the
task challenging, our quality control strategies al-
low us to collect reliable annotations.
3
For all the
annotated pairs, we average the ratings as the final
similarity score. In addition to the 3,600 manually
annotated pairs, we include an extra 2,400 auto-
matically annotated pairs as determined in §3.2
(unmatched pairs get an IMS of 1, matched pairs
get an IMS of 5), for a total of 6,000 pairs. Given
that there can be multiple pairs from a single news-
paper pair, to avoid overlaps between training and
test sets, we split the dataset 80%/10%/10% based
on the paper DOI and balance across subjects. Fur-
ther dataset details in Appendix E
Selected Examples
To highlight the difficulty of
SPICED, we show a pair of samples from our final
dataset in Table 1. The IMS is compared to the
cosine similarity between embeddings produced
by SBERT. For the first case, SBERT presumably
picks up on similarities in the discussed topics,
such as erythritol and its relationship to adiposity,
but the paper finding is concerned with the con-
sistency of results and its biological implications
while the news finding explicitly mentions a re-
lationship between erythritol and adiposity. The
second case expresses the opposite effect; the news
finding contains a lot of extraneous information for
3
For example, one participant commented “It was pretty
hard to consider both the statements and their context then
comparing them for similarities, but i enjoyed it”
STSB SNLI SPICED News Tweets
0.401 0.631 0.726 0.712 0 .749
Table 2: The average normalized edit distance between
matching pairs for various datasets shows that SPICED
includes more pairs that are lexically dissimilar. For
SPICED and STSB, pairs are considered matching if
their similarity score is greater than 3. For SNLI, pairs
are considered matching if the label is “entailment”.
context, but one of the core findings it expresses
is the same as the paper finding, giving it a high
rating in SPICED.
Comparison with existing datasets
To further
characterize the difficulty of SPICED compared to
existing datasets, we show the average normalized
edit distance between matching pairs in SPICED,
STSB (Cer et al.,2017), and SNLI (Bowman et al.,
2015) (see Appendix F for the calculation). STSB
is a semantic text similarity dataset consisting of
pairs of sentences scored with their semantic simi-
larity, sourced from multiple SemEval shared tasks.
SNLI is a natural language inference corpus, and
consists of pairs of sentences labeled for if they
entail each other, contradict each other, or are
neutral. We calculated the mean normalized edit
distance across all pairs of matching sentences in
each dataset’s training data; For SPICED and STSB,
pairs are considered matching if their IMS or sim-
ilarity score is greater than 3, respectively. For
SNLI, pairs are considered matching if the label is
“entailment”.
We find that there is a much greater lexical differ-
ence between the matching pairs in SPICED (0.726)
than existing general domain paired text datasets
(0.401 for STSB and 0.631 for SNLI). This gap
between STSB and SPICED also emphasizes the
difference between traditional semantic textual sim-
ilarity tasks and the information change task we
describe here. Within SPICED, Twitter pairs had
a higher distance (0.749) than news pairs (0.712),
suggesting stronger domain differences. For qual-
itative examples showing the difference between
SPICED and STSB, see Appendix A.
Relationship of SPICED to Fact Checking
The
task introduced by SPICED captures information
change more broadly than veracity as in automatic
fact checking, as the task is concerned with the
degree to which two sentences describe the same
scientific information—indeed, two similar sen-
tences may describe the same information equally
poorly. Our task is similar to the sentence selec-
tion stage in the fact checking pipeline, and we
later demonstrate that models trained on SPICED
data are useful for this task for science in section 5.
However, our task and annotation are agnostic to
whether a pair of sentences entail one another. This
is especially useful if one wants to compare how
a particular finding is presented across different
media. Fact-checking datasets are also explicitly
constructed to contain claims which are about a sin-
gle piece of information—SPICED is not restricted
in this way, focusing on a more general type of in-
formation change beyond categorical falsehood. Fi-
nally, we note two more unique features of SPICED:
1) SPICED contains naturally occurring sentences,
while fact checking datasets like FEVER and Sci-
Fact often contain manually written claims. 2)
The combination of domains in SPICED is unique;
sentences are paired between (news, science) and
(tweets, science), and these pairings don’t exist
currently.
4 Scientific Information Change Models
We now use SPICED to evaluate models for esti-
mating the IMS of finding pairs in two settings:
zero-shot transfer and supervised fine-tuning.
4.1 Experimental setup
We use the following four models to estimate
zero-shot transfer performance.
Paraphrase
:
RoBERTa (Liu et al.,2019) pre-trained for para-
phrase detection on an adversarial paraphrasing
task (Nighojkar and Licato,2021). We convert
the output probability of a pair being a paraphrase
to the range [1,5] for comparison with our labels.
Natural Language Inference (NLI)
: RoBERTa
pre-trained on a wide range of NLI datasets (Nie
et al.,2020). The final score is the model’s mea-
sured probability of entailment mapped to the range
[1,5].
MiniLM
: SBERT with MiniLM as the base
network (Wang et al.,2020a); we obtain sentence
embeddings for pairs of findings and measure the
cosine similarity between these two embeddings,
clip the lowest score to 0, and convert this score to
the range [1,5]. Note that this model was trained on
over 1B sentence pairs, including from scientific
text, using a contrastive learning approach where
the embeddings of sentences known to be similar
are trained to be closer than the embeddings of
negatively sampled sentences. SBERT models rep-
摘要:

ModelingInformationChangeinScienceCommunicationwithSemanticallyMatchedParaphrasesDustinWright[JiaxinPei]DavidJurgens]IsabelleAugenstein[[Dept.ofComputerScience,UniversityofCopenhagen,Denmark]SchoolofInformation,UniversityofMichigan,AnnArbor,MI,USA{dw,augenstein}@di.ku.dk{pedropei,jurgens}@umich.ed...

展开>> 收起<<
Modeling Information Change in Science Communication with Semantically Matched Paraphrases Dustin WrightJiaxin PeiDavid JurgensIsabelle Augenstein.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:783.45KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注