Making Science Simple Corpora for the Lay Summarisation of Scientific Literature Tomas Goldsack1 Zhihao Zhang2 Chenghua Lin1 Carolina Scarton1

2025-05-02 0 0 679.77KB 16 页 10玖币
侵权投诉
Making Science Simple: Corpora for the Lay Summarisation of Scientific
Literature
Tomas Goldsack1, Zhihao Zhang2, Chenghua Lin1
, Carolina Scarton1
1Department of Computer Science, University of Sheffield, UK
2School of Economics and Management, Beihang University, China
{tgoldsack1, c.lin, c.scarton}@sheffield.ac.uk
zhhzhang@buaa.edu.cn
Abstract
Lay summarisation aims to jointly summarise
and simplify a given text, thus making its con-
tent more comprehensible to non-experts. Au-
tomatic approaches for lay summarisation can
provide significant value in broadening access
to scientific literature, enabling a greater de-
gree of both interdisciplinary knowledge shar-
ing and public understanding when it comes
to research findings. However, current cor-
pora for this task are limited in their size and
scope, hindering the development of broadly
applicable data-driven approaches. Aiming
to rectify these issues, we present two novel
lay summarisation datasets, PLOS (large-scale)
and eLife (medium-scale), each of which con-
tains biomedical journal articles alongside
expert-written lay summaries. We provide
a thorough characterisation of our lay sum-
maries, highlighting differing levels of read-
ability and abstractiveness between datasets
that can be leveraged to support the needs of
different applications. Finally, we benchmark
our datasets using mainstream summarisation
approaches and perform a manual evaluation
with domain experts, demonstrating their util-
ity and casting light on the key challenges of
this task. Our code and datasets are avail-
able at
https://github.com/TGoldsack1/
Corpora_for_Lay_Summarisation.
1 Introduction
Scientific publications contain information that is
essential for the preservation and progression of our
understanding across all scientific disciplines. Typ-
ically being highly technical in nature, such articles
tend to assume a degree of background knowledge
and make use of domain-specific language, making
them difficult to comprehend for one lacking the
required expertise (i.e., a lay person). These factors
often limit the impact of research to only its direct
community (Albert et al.,2015,2022) and, more
1Corresponding author.
Technical Abstract
The virus SARS-CoV-2 can exploit biological vulner-
abilities (e.g. host proteins) in susceptible hosts that
predispose to the development of severe COVID-19. To
identify host proteins that may contribute to the risk
of severe COVID-19, we undertook proteome-wide ge-
netic colocalisation tests, and polygenic (pan) and cis-
Mendelian randomisation analyses leveraging publicly
available protein and COVID-19 datasets...
Lay Summary
Individuals who become infected with the virus that
causes COVID-19 can experience a wide variety of
symptoms. These can range from no symptoms or minor
symptoms to severe illness and death. Key demographic
factors, such as age, gender and race, are known to affect
how susceptible an individual is to infection. However,
molecular factors, such as unique gene mutations and
gene expression levels can also have a major impact on
patient responses by affecting the levels of proteins in
the body...
Figure 1: The first few sentences of the abstract and lay
summary of an eLife article, illustrating differences in
the language and focus on background information.
dangerously, can cause readers (members of the
public, journalists, etc.) to misinterpret research
findings (Kuehne and Olden,2015).
This latter point is especially important for
biomedical research which, in addition to having
particularly dynamic and confusing terminology
(Smith,2006;Peng et al.,2021), has the poten-
tial to directly impact people’s decision-making
regarding health-related issues, with a pertinent ex-
ample of this being the widespread misinformation
seen during the COVID-19 pandemic (Islam et al.,
2020). Aiming to address these challenges, some
academic journals choose to publish lay summaries
that clearly and concisely explain the context and
significance of an article using non-specialist lan-
guage. Figure 1illustrates how simplifying jar-
arXiv:2210.09932v2 [cs.CL] 12 Dec 2023
gon (e.g., “SARS-CoV-2"
“the virus that causes
COVID-19") and focusing on background informa-
tion allows a reader to better understand a complex
scientific topic. However, in addition to placing
an extra burden on authors, lay summaries are not
yet ubiquitous and focus only on newly published
articles.
Automatic text summarisation can provide sig-
nificant value in the generation of scientific lay
summaries. Although previous use of summarisa-
tion techniques for scientific articles has largely fo-
cused on generating a technical summary (e.g., the
abstract), only a few have addressed the task of lay
summarisation and introduced datasets to facilitate
its study (Chandrasekaran et al.,2020;Guo et al.,
2021;Zaman et al.,2020). However, compared
to datasets ordinarily used for training supervised
summarisation models, these resources are rela-
tively small (ranging from 572 to 6,695 articles),
presenting a significant barrier to the deployment
of large data-driven approaches that require train-
ing on large amounts of parallel data. Furthermore,
these resources are somewhat fragmented in terms
of their framing of the task, making use of article
and summary formats that limit their applicabil-
ity to broader biomedical literature. These factors
hinder the progression of the field and the devel-
opment of usable models that can be used to make
scientific content accessible to a wider audience.
To help alleviate these issues, we introduce two
new datasets derived from different academic jour-
nals within the biomedical domain - PLOS and
eLife (§3). Both datasets use the full journal arti-
cle as the source, enabling the training of models
which can be broadly applied to wider literature.
PLOS is significantly larger than currently avail-
able datasets and makes use of short author-written
lay summaries (150-200 words), whereas eLife’s
summaries are approximately twice as long and
written by expert editors who are well-practiced
in the simplification of scientific content. Given
these differences in authorship and length, we ex-
pect the lay summaries of eLife to simplify content
to a greater extent, meaning our datasets are able to
cater to different audiences and applications (e.g.,
personalised lay summarisation). We confirm this
via an in-depth characterisation of the lay sum-
maries within each dataset, quantifying ways in
which they differ from the technical abstract and
from each other (§4). Finally, we benchmark our
datasets with popular summarisation approaches
using automatic metrics and conduct an expert-
based manual evaluation, highlighting the utility
of our datasets and key challenges for the task of
lay summarisation (§5). This paper also presents
a literature review (§2), conclusions (§6), and a
discussion on its limitations (§7).
2 Related Work
Past attempts to automatically summarise scientific
content in layman’s terms have been scarce, with
the most prominent example being the LaySumm
subtask of the CL-SciSumm 2020 shared task se-
ries (Chandrasekaran et al.,2020) which attracted a
total of 8 submissions. Alongside the task, a train-
ing corpus of 572 articles and author-generated lay
summaries from a multi-disciplinary collection of
Elsevier-published scientific journals was provided,
with submissions being evaluated on a blind test set
of 37 articles. It was noted by the task organisers
that the data provided was insufficient for training
a model to produce a realistic lay summary.
Guo et al. (2021) also make use of a sin-
gle publication source to retrieve lay summaries:
The Cochrane Database of Systematic Reviews
(CDSR). Their dataset contains the abstracts of
6,695 systematic reviews paired with their respec-
tive plain-language summaries, covering various
healthcare domains. Although larger than other
available datasets for lay summarisation, CDSR
is constrained in that it only uses the abstracts of
systematic reviews as source documents, and thus
models trained using CDSR will be unlikely to
generalise well to inputs that are longer than an ab-
stract or the abstracts of other types of publication.
Alternatively, Zaman et al. (2020) introduce a
dataset derived from the ‘Eureka-Alert’ science
news website for the combined tasks of simplifi-
cation and summarisation. Summaries consist of
news articles (average length > 600 words) that aim
to describe the content of a scientific publication
to the non-expert. However, the extensive size of
reference summaries is likely to present additional
challenges in model training and their news-based
format limits their applicability (e.g., in automating
lay summarisation for journals).
Compared to previous resources, our datasets
contain articles and lay summaries of a format
that we consider to be more broadly applicable
to wider literature. Additionally, PLOS is signif-
icantly larger than those currently available (over
4
×
larger than CDSR) and eLife contains sum-
Dataset # Docs Doc Summary
# words # words # sents
LaySumm 572 4,426.1 82.15 3.8
Eureka-Alert 5,204 5,027.0 635.6 24.3
CDSR 6,695 576.0 338.2 16.1
PLOS 27,525 5,366.7 175.6 7.8
eLife 4,828 7,806.1 347.6 15.7
Table 1: Statistics of lay summarisation datasets, with
ours given in bold. Words and sentences (sents) are
average values.
maries written by expert editors. Furthermore, our
work is the first to provide two datasets with differ-
ent levels of readability, thus supporting the needs
of different audiences and applications. Through
each of these factors, we hope to enable the cre-
ation of more usable lay summarisation models.
3 Our Datasets
We introduce two datasets from different biomed-
ical journals (PLOS and eLife), each containing
full scientific articles paired with manually-created
lay summaries. For each data source, articles were
retrieved in XML format and parsed using Python
to retrieve the lay summary, abstract, and article
text.
1
In line with previous datasets for scientific
summarisation (Cohan et al.,2018), the article text
is separated into sections, and the heading of each
section is also retrieved. Sentences are segmented
using the PySBD rule-based parser (Sadvilkar and
Neumann,2020), which we empirically found to
outperform neural alternatives. We separate our
datasets into training, validation, and testing splits
at a ratio of 90%/5%/5%. Statistics describing the
contents of our datasets and that of past lay sum-
marisation datasets are given in Table 1.
PLOS The Public Library of Science (PLOS) is
an open-access publisher that hosts influential peer-
reviewed journals across all areas of science and
medicine. Several of these journals require authors
to submit an author summary alongside their work,
defined as a 150-200 word non-technical summary
aimed at making the findings of a paper accessible
to a wider audience, including non-scientists.
2
The
journals in question focus specifically on Biology,
1
For each article, we also retrieve a number of keywords
from the meta-data, providing an indication of the high-level
topics covered within the article.
2
Source of PLOS author summary definition:
https://journals.plos.org/plosgenetics/s/
submission-guidelines
Metric Abstract Lay Summary
PLOS eLife PLOS eLife
FKGL15.04 15.57 14.76 10.92
CLI16.39 17.68 15.90 12.51
DCRS11.06 11.78 10.91 8.83
WordRank9.08 9.21 8.98 8.68
Table 2: Mean readability scores for abstracts and lay
summaries from our datasets. For all metrics, a lower
score indicates greater readability.
Computational Biology, Genetics, Pathogens, and
Neglected Tropical Diseases.
eLife eLife is an open-access peer-reviewed jour-
nal with a specific focus on biomedical and life
sciences. Of the articles published in eLife, some
are selected to be the subject of a digest, a simpli-
fied summary of the work written by expert editors
based on both the article itself and questions an-
swered by its author. Similarly to PLOS, these
digests aim to explain the background and signif-
icance of a scientific article in language that is
accessible to non-experts (King et al.,2017).
4 Dataset Analysis
We carry out several analyses comparing the lay
summaries of our datasets to the respective techni-
cal abstracts. Through these analyses, we seek to
highlight and quantify the key differences between
these two different types of summary, as well as
those present between the lay summaries of our
two datasets. Specifically, we focus on readabil-
ity (§4.1), rhetorical structure (§4.2), vocabulary
sharing (§4.3), and abstractiveness (§4.4).
4.1 Readability
We assess the readability of our lay summaries and
abstracts using several established metrics. Specif-
ically, we employ Flesch-Kincaid Grade Level
(FKGL), Coleman-Liau Index (CLI), Dale-Chall
Readability Score (DCRS), and WordRank score.
3
FKGL, CLI, and DCRS provide an approximation
of the (US) grade level of education required to
read a given text. The formula for FKGL surrounds
the total number of sentences, words, and syllables
present within the text, whereas CLI is based on
the number of sentences, words, and characters. Al-
ternatively, DCRS measures readability using the
average sentence length and the number of familiar
3
Computed using the
textstat
and
EASSE
(Alva-
Manchego et al.,2019) packages.
Figure 2: Barplot visualising the rhetorical class distributions in our abstracts and lay summaries.
words present, using a lookup table of the 3,000
most commonly used English words. Similarly,
WordRank estimates the lexical complexity of a
text based on how common the language is, using
a frequency table derived from English Wikipedia.
The scores given in Table 2show that the lay
summaries of both datasets are consistently more
readable than their respective abstracts across all
metrics. Although these differences are small in
some cases, in line with the findings of previous
works (Devaraj et al.,2021), we find them all to be
statistically significant by way of Mann–Whitney
U tests (
p < 0.05
). These results indicate that
lay summaries are more readable than technical
abstracts in terms of both syntactic structure and
lexical intelligibility. Additionally, the lay sum-
maries from eLife obtain lower readability scores
than those of PLOS across all metrics, confirm-
ing our expectation that they are suitable for less
technical audiences.4
4.2 Rhetorical Structure
Rhetoric is another important factor when assess-
ing the comprehensibility of a text. Specifically,
a lay person will require a much larger focus on
the background of a scientific article than an expert
4
Manual inspection of the summaries from each dataset
also support this.
Label Abstract Lay Summary
PLOS eLife PLOS eLife
Background 35.40 41.05 58.11 55.03
Objective 0.76 1.06 0.54 0.47
Methods 10.26 6.73 6.24 6.23
Results 34.75 30.60 17.86 18.23
Conclusions 18.83 20.55 17.26 18.83
Table 3: Mean percentage of each rhetorical label within
our abstracts and lay summaries.
in order to understand the significance of its find-
ings (King et al.,2017), thus we would expect lay
summaries to focus more on such aspects.
To provide further insight into the structural
differences between abstracts and lay summaries,
we classify all sentences within each based on
their rhetorical status. To do this, we make use
of PubMed RTC (Dernoncourt and Lee,2017), a
dataset containing the 20,000 biomedical abstracts
retrieved from PubMed, with each sentence la-
belled according to its rhetorical role (roles: Back-
ground, Objective, Methods, Results, Conclusions).
We use PubMed RTC to train the BERT-based se-
quential classifier introduced by Cohan et al. (2019)
due to its strong reported performance (92.9 micro
F1-score), before applying this model to lay sum-
mary and abstract sentences from our datasets.
Figure 2provides a visualisation of how the fre-
摘要:

MakingScienceSimple:CorporafortheLaySummarisationofScientificLiteratureTomasGoldsack1,ZhihaoZhang2,ChenghuaLin1∗,CarolinaScarton11DepartmentofComputerScience,UniversityofSheffield,UK2SchoolofEconomicsandManagement,BeihangUniversity,China{tgoldsack1,c.lin,c.scarton}@sheffield.ac.ukzhhzhang@buaa.edu.c...

展开>> 收起<<
Making Science Simple Corpora for the Lay Summarisation of Scientific Literature Tomas Goldsack1 Zhihao Zhang2 Chenghua Lin1 Carolina Scarton1.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:679.77KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注