Making Science Simple Corpora for the Lay Summarisation of Scientific Literature Tomas Goldsack1 Zhihao Zhang2 Chenghua Lin1 Carolina Scarton1

2025-05-02 0 0 679.77KB 16 页 10玖币

侵权投诉

Making Science Simple: Corpora for the Lay Summarisation of Scientiﬁc

Literature

Tomas Goldsack1, Zhihao Zhang2, Chenghua Lin1∗

, Carolina Scarton1

1Department of Computer Science, University of Shefﬁeld, UK

2School of Economics and Management, Beihang University, China

{tgoldsack1, c.lin, c.scarton}@sheffield.ac.uk

zhhzhang@buaa.edu.cn

Abstract

Lay summarisation aims to jointly summarise

and simplify a given text, thus making its con-

tent more comprehensible to non-experts. Au-

tomatic approaches for lay summarisation can

provide signiﬁcant value in broadening access

to scientiﬁc literature, enabling a greater de-

gree of both interdisciplinary knowledge shar-

ing and public understanding when it comes

to research ﬁndings. However, current cor-

pora for this task are limited in their size and

scope, hindering the development of broadly

applicable data-driven approaches. Aiming

to rectify these issues, we present two novel

lay summarisation datasets, PLOS (large-scale)

and eLife (medium-scale), each of which con-

tains biomedical journal articles alongside

expert-written lay summaries. We provide

a thorough characterisation of our lay sum-

maries, highlighting differing levels of read-

ability and abstractiveness between datasets

that can be leveraged to support the needs of

different applications. Finally, we benchmark

our datasets using mainstream summarisation

approaches and perform a manual evaluation

with domain experts, demonstrating their util-

ity and casting light on the key challenges of

this task. Our code and datasets are avail-

able at

https://github.com/TGoldsack1/

Corpora_for_Lay_Summarisation.

1 Introduction

Scientiﬁc publications contain information that is

essential for the preservation and progression of our

understanding across all scientiﬁc disciplines. Typ-

ically being highly technical in nature, such articles

tend to assume a degree of background knowledge

and make use of domain-speciﬁc language, making

them difﬁcult to comprehend for one lacking the

required expertise (i.e., a lay person). These factors

often limit the impact of research to only its direct

community (Albert et al.,2015,2022) and, more

1∗Corresponding author.

Technical Abstract

The virus SARS-CoV-2 can exploit biological vulner-

abilities (e.g. host proteins) in susceptible hosts that

predispose to the development of severe COVID-19. To

identify host proteins that may contribute to the risk

of severe COVID-19, we undertook proteome-wide ge-

netic colocalisation tests, and polygenic (pan) and cis-

Mendelian randomisation analyses leveraging publicly

available protein and COVID-19 datasets...

Lay Summary

Individuals who become infected with the virus that

causes COVID-19 can experience a wide variety of

symptoms. These can range from no symptoms or minor

symptoms to severe illness and death. Key demographic

factors, such as age, gender and race, are known to affect

how susceptible an individual is to infection. However,

molecular factors, such as unique gene mutations and

gene expression levels can also have a major impact on

patient responses by affecting the levels of proteins in

the body...

Figure 1: The ﬁrst few sentences of the abstract and lay

summary of an eLife article, illustrating differences in

the language and focus on background information.

dangerously, can cause readers (members of the

public, journalists, etc.) to misinterpret research

ﬁndings (Kuehne and Olden,2015).

This latter point is especially important for

biomedical research which, in addition to having

particularly dynamic and confusing terminology

(Smith,2006;Peng et al.,2021), has the poten-

tial to directly impact people’s decision-making

regarding health-related issues, with a pertinent ex-

ample of this being the widespread misinformation

seen during the COVID-19 pandemic (Islam et al.,

2020). Aiming to address these challenges, some

academic journals choose to publish lay summaries

that clearly and concisely explain the context and

signiﬁcance of an article using non-specialist lan-

guage. Figure 1illustrates how simplifying jar-

arXiv:2210.09932v2 [cs.CL] 12 Dec 2023

gon (e.g., “SARS-CoV-2"

→

“the virus that causes

COVID-19") and focusing on background informa-

tion allows a reader to better understand a complex

scientiﬁc topic. However, in addition to placing

an extra burden on authors, lay summaries are not

yet ubiquitous and focus only on newly published

articles.

Automatic text summarisation can provide sig-

niﬁcant value in the generation of scientiﬁc lay

summaries. Although previous use of summarisa-

tion techniques for scientiﬁc articles has largely fo-

cused on generating a technical summary (e.g., the

abstract), only a few have addressed the task of lay

summarisation and introduced datasets to facilitate

its study (Chandrasekaran et al.,2020;Guo et al.,

2021;Zaman et al.,2020). However, compared

to datasets ordinarily used for training supervised

summarisation models, these resources are rela-

tively small (ranging from 572 to 6,695 articles),

presenting a signiﬁcant barrier to the deployment

of large data-driven approaches that require train-

ing on large amounts of parallel data. Furthermore,

these resources are somewhat fragmented in terms

of their framing of the task, making use of article

and summary formats that limit their applicabil-

ity to broader biomedical literature. These factors

hinder the progression of the ﬁeld and the devel-

opment of usable models that can be used to make

scientiﬁc content accessible to a wider audience.

To help alleviate these issues, we introduce two

new datasets derived from different academic jour-

nals within the biomedical domain - PLOS and

eLife (§3). Both datasets use the full journal arti-

cle as the source, enabling the training of models

which can be broadly applied to wider literature.

PLOS is signiﬁcantly larger than currently avail-

able datasets and makes use of short author-written

lay summaries (150-200 words), whereas eLife’s

summaries are approximately twice as long and

written by expert editors who are well-practiced

in the simpliﬁcation of scientiﬁc content. Given

these differences in authorship and length, we ex-

pect the lay summaries of eLife to simplify content

to a greater extent, meaning our datasets are able to

cater to different audiences and applications (e.g.,

personalised lay summarisation). We conﬁrm this

via an in-depth characterisation of the lay sum-

maries within each dataset, quantifying ways in

which they differ from the technical abstract and

from each other (§4). Finally, we benchmark our

datasets with popular summarisation approaches

using automatic metrics and conduct an expert-

based manual evaluation, highlighting the utility

of our datasets and key challenges for the task of

lay summarisation (§5). This paper also presents

a literature review (§2), conclusions (§6), and a

discussion on its limitations (§7).

2 Related Work

Past attempts to automatically summarise scientiﬁc

content in layman’s terms have been scarce, with

the most prominent example being the LaySumm

subtask of the CL-SciSumm 2020 shared task se-

ries (Chandrasekaran et al.,2020) which attracted a

total of 8 submissions. Alongside the task, a train-

ing corpus of 572 articles and author-generated lay

summaries from a multi-disciplinary collection of

Elsevier-published scientiﬁc journals was provided,

with submissions being evaluated on a blind test set

of 37 articles. It was noted by the task organisers

that the data provided was insufﬁcient for training

a model to produce a realistic lay summary.

Guo et al. (2021) also make use of a sin-

gle publication source to retrieve lay summaries:

The Cochrane Database of Systematic Reviews

(CDSR). Their dataset contains the abstracts of

6,695 systematic reviews paired with their respec-

tive plain-language summaries, covering various

healthcare domains. Although larger than other

available datasets for lay summarisation, CDSR

is constrained in that it only uses the abstracts of

systematic reviews as source documents, and thus

models trained using CDSR will be unlikely to

generalise well to inputs that are longer than an ab-

stract or the abstracts of other types of publication.

Alternatively, Zaman et al. (2020) introduce a

dataset derived from the ‘Eureka-Alert’ science

news website for the combined tasks of simpliﬁ-

cation and summarisation. Summaries consist of

news articles (average length > 600 words) that aim

to describe the content of a scientiﬁc publication

to the non-expert. However, the extensive size of

reference summaries is likely to present additional

challenges in model training and their news-based

format limits their applicability (e.g., in automating

lay summarisation for journals).

Compared to previous resources, our datasets

contain articles and lay summaries of a format

that we consider to be more broadly applicable

to wider literature. Additionally, PLOS is signif-

icantly larger than those currently available (over

larger than CDSR) and eLife contains sum-

Dataset # Docs Doc Summary

# words # words # sents

LaySumm 572 4,426.1 82.15 3.8

Eureka-Alert 5,204 5,027.0 635.6 24.3

CDSR 6,695 576.0 338.2 16.1

PLOS 27,525 5,366.7 175.6 7.8

eLife 4,828 7,806.1 347.6 15.7

Table 1: Statistics of lay summarisation datasets, with

ours given in bold. Words and sentences (sents) are

average values.

maries written by expert editors. Furthermore, our

work is the ﬁrst to provide two datasets with differ-

ent levels of readability, thus supporting the needs

of different audiences and applications. Through

each of these factors, we hope to enable the cre-

ation of more usable lay summarisation models.

3 Our Datasets

We introduce two datasets from different biomed-

ical journals (PLOS and eLife), each containing

full scientiﬁc articles paired with manually-created

lay summaries. For each data source, articles were

retrieved in XML format and parsed using Python

to retrieve the lay summary, abstract, and article

text.

In line with previous datasets for scientiﬁc

summarisation (Cohan et al.,2018), the article text

is separated into sections, and the heading of each

section is also retrieved. Sentences are segmented

using the PySBD rule-based parser (Sadvilkar and

Neumann,2020), which we empirically found to

outperform neural alternatives. We separate our

datasets into training, validation, and testing splits

at a ratio of 90%/5%/5%. Statistics describing the

contents of our datasets and that of past lay sum-

marisation datasets are given in Table 1.

PLOS The Public Library of Science (PLOS) is

an open-access publisher that hosts inﬂuential peer-

reviewed journals across all areas of science and

medicine. Several of these journals require authors

to submit an author summary alongside their work,

deﬁned as a 150-200 word non-technical summary

aimed at making the ﬁndings of a paper accessible

to a wider audience, including non-scientists.

The

journals in question focus speciﬁcally on Biology,

For each article, we also retrieve a number of keywords

from the meta-data, providing an indication of the high-level

topics covered within the article.

Source of PLOS author summary deﬁnition:

https://journals.plos.org/plosgenetics/s/

submission-guidelines

Metric Abstract Lay Summary

PLOS eLife PLOS eLife

FKGL↓15.04 15.57 14.76 10.92

CLI↓16.39 17.68 15.90 12.51

DCRS↓11.06 11.78 10.91 8.83

WordRank↓9.08 9.21 8.98 8.68

Table 2: Mean readability scores for abstracts and lay

summaries from our datasets. For all metrics, a lower

score indicates greater readability.

Computational Biology, Genetics, Pathogens, and

Neglected Tropical Diseases.

eLife eLife is an open-access peer-reviewed jour-

nal with a speciﬁc focus on biomedical and life

sciences. Of the articles published in eLife, some

are selected to be the subject of a digest, a simpli-

ﬁed summary of the work written by expert editors

based on both the article itself and questions an-

swered by its author. Similarly to PLOS, these

digests aim to explain the background and signif-

icance of a scientiﬁc article in language that is

accessible to non-experts (King et al.,2017).

4 Dataset Analysis

We carry out several analyses comparing the lay

summaries of our datasets to the respective techni-

cal abstracts. Through these analyses, we seek to

highlight and quantify the key differences between

these two different types of summary, as well as

those present between the lay summaries of our

two datasets. Speciﬁcally, we focus on readabil-

ity (§4.1), rhetorical structure (§4.2), vocabulary

sharing (§4.3), and abstractiveness (§4.4).

4.1 Readability

We assess the readability of our lay summaries and

abstracts using several established metrics. Specif-

ically, we employ Flesch-Kincaid Grade Level

(FKGL), Coleman-Liau Index (CLI), Dale-Chall

Readability Score (DCRS), and WordRank score.

FKGL, CLI, and DCRS provide an approximation

of the (US) grade level of education required to

read a given text. The formula for FKGL surrounds

the total number of sentences, words, and syllables

present within the text, whereas CLI is based on

the number of sentences, words, and characters. Al-

ternatively, DCRS measures readability using the

average sentence length and the number of familiar

Computed using the

textstat

and

EASSE

(Alva-

Manchego et al.,2019) packages.

Figure 2: Barplot visualising the rhetorical class distributions in our abstracts and lay summaries.

words present, using a lookup table of the 3,000

most commonly used English words. Similarly,

WordRank estimates the lexical complexity of a

text based on how common the language is, using

a frequency table derived from English Wikipedia.

The scores given in Table 2show that the lay

summaries of both datasets are consistently more

readable than their respective abstracts across all

metrics. Although these differences are small in

some cases, in line with the ﬁndings of previous

works (Devaraj et al.,2021), we ﬁnd them all to be

statistically signiﬁcant by way of Mann–Whitney

U tests (

p < 0.05

). These results indicate that

lay summaries are more readable than technical

abstracts in terms of both syntactic structure and

lexical intelligibility. Additionally, the lay sum-

maries from eLife obtain lower readability scores

than those of PLOS across all metrics, conﬁrm-

ing our expectation that they are suitable for less

technical audiences.4

4.2 Rhetorical Structure

Rhetoric is another important factor when assess-

ing the comprehensibility of a text. Speciﬁcally,

a lay person will require a much larger focus on

the background of a scientiﬁc article than an expert

Manual inspection of the summaries from each dataset

also support this.

Label Abstract Lay Summary

PLOS eLife PLOS eLife

Background 35.40 41.05 58.11 55.03

Objective 0.76 1.06 0.54 0.47

Methods 10.26 6.73 6.24 6.23

Results 34.75 30.60 17.86 18.23

Conclusions 18.83 20.55 17.26 18.83

Table 3: Mean percentage of each rhetorical label within

our abstracts and lay summaries.

in order to understand the signiﬁcance of its ﬁnd-

ings (King et al.,2017), thus we would expect lay

summaries to focus more on such aspects.

To provide further insight into the structural

differences between abstracts and lay summaries,

we classify all sentences within each based on

their rhetorical status. To do this, we make use

of PubMed RTC (Dernoncourt and Lee,2017), a

dataset containing the 20,000 biomedical abstracts

retrieved from PubMed, with each sentence la-

belled according to its rhetorical role (roles: Back-

ground, Objective, Methods, Results, Conclusions).

We use PubMed RTC to train the BERT-based se-

quential classiﬁer introduced by Cohan et al. (2019)

due to its strong reported performance (92.9 micro

F1-score), before applying this model to lay sum-

mary and abstract sentences from our datasets.

Figure 2provides a visualisation of how the fre-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MakingScienceSimple:CorporafortheLaySummarisationofScientificLiteratureTomasGoldsack1,ZhihaoZhang2,ChenghuaLin1∗,CarolinaScarton11DepartmentofComputerScience,UniversityofSheffield,UK2SchoolofEconomicsandManagement,BeihangUniversity,China{tgoldsack1,c.lin,c.scarton}@sheffield.ac.ukzhhzhang@buaa.edu.c...

展开>> 收起<<

Making Science Simple Corpora for the Lay Summarisation of Scientific Literature Tomas Goldsack1 Zhihao Zhang2 Chenghua Lin1 Carolina Scarton1.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Making Science Simple Corpora for the Lay Summarisation of Scientific Literature Tomas Goldsack1 Zhihao Zhang2 Chenghua Lin1 Carolina Scarton1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: