gon (e.g., “SARS-CoV-2"
→
“the virus that causes
COVID-19") and focusing on background informa-
tion allows a reader to better understand a complex
scientific topic. However, in addition to placing
an extra burden on authors, lay summaries are not
yet ubiquitous and focus only on newly published
articles.
Automatic text summarisation can provide sig-
nificant value in the generation of scientific lay
summaries. Although previous use of summarisa-
tion techniques for scientific articles has largely fo-
cused on generating a technical summary (e.g., the
abstract), only a few have addressed the task of lay
summarisation and introduced datasets to facilitate
its study (Chandrasekaran et al.,2020;Guo et al.,
2021;Zaman et al.,2020). However, compared
to datasets ordinarily used for training supervised
summarisation models, these resources are rela-
tively small (ranging from 572 to 6,695 articles),
presenting a significant barrier to the deployment
of large data-driven approaches that require train-
ing on large amounts of parallel data. Furthermore,
these resources are somewhat fragmented in terms
of their framing of the task, making use of article
and summary formats that limit their applicabil-
ity to broader biomedical literature. These factors
hinder the progression of the field and the devel-
opment of usable models that can be used to make
scientific content accessible to a wider audience.
To help alleviate these issues, we introduce two
new datasets derived from different academic jour-
nals within the biomedical domain - PLOS and
eLife (§3). Both datasets use the full journal arti-
cle as the source, enabling the training of models
which can be broadly applied to wider literature.
PLOS is significantly larger than currently avail-
able datasets and makes use of short author-written
lay summaries (150-200 words), whereas eLife’s
summaries are approximately twice as long and
written by expert editors who are well-practiced
in the simplification of scientific content. Given
these differences in authorship and length, we ex-
pect the lay summaries of eLife to simplify content
to a greater extent, meaning our datasets are able to
cater to different audiences and applications (e.g.,
personalised lay summarisation). We confirm this
via an in-depth characterisation of the lay sum-
maries within each dataset, quantifying ways in
which they differ from the technical abstract and
from each other (§4). Finally, we benchmark our
datasets with popular summarisation approaches
using automatic metrics and conduct an expert-
based manual evaluation, highlighting the utility
of our datasets and key challenges for the task of
lay summarisation (§5). This paper also presents
a literature review (§2), conclusions (§6), and a
discussion on its limitations (§7).
2 Related Work
Past attempts to automatically summarise scientific
content in layman’s terms have been scarce, with
the most prominent example being the LaySumm
subtask of the CL-SciSumm 2020 shared task se-
ries (Chandrasekaran et al.,2020) which attracted a
total of 8 submissions. Alongside the task, a train-
ing corpus of 572 articles and author-generated lay
summaries from a multi-disciplinary collection of
Elsevier-published scientific journals was provided,
with submissions being evaluated on a blind test set
of 37 articles. It was noted by the task organisers
that the data provided was insufficient for training
a model to produce a realistic lay summary.
Guo et al. (2021) also make use of a sin-
gle publication source to retrieve lay summaries:
The Cochrane Database of Systematic Reviews
(CDSR). Their dataset contains the abstracts of
6,695 systematic reviews paired with their respec-
tive plain-language summaries, covering various
healthcare domains. Although larger than other
available datasets for lay summarisation, CDSR
is constrained in that it only uses the abstracts of
systematic reviews as source documents, and thus
models trained using CDSR will be unlikely to
generalise well to inputs that are longer than an ab-
stract or the abstracts of other types of publication.
Alternatively, Zaman et al. (2020) introduce a
dataset derived from the ‘Eureka-Alert’ science
news website for the combined tasks of simplifi-
cation and summarisation. Summaries consist of
news articles (average length > 600 words) that aim
to describe the content of a scientific publication
to the non-expert. However, the extensive size of
reference summaries is likely to present additional
challenges in model training and their news-based
format limits their applicability (e.g., in automating
lay summarisation for journals).
Compared to previous resources, our datasets
contain articles and lay summaries of a format
that we consider to be more broadly applicable
to wider literature. Additionally, PLOS is signif-
icantly larger than those currently available (over
4
×
larger than CDSR) and eLife contains sum-