both professional and general public groups
22
. In addition, Pattisapu et al. have expert annotators identify highly similar pairs
from scientific articles and corresponding health blogs describing them
23
. Though human filtering makes the pairs in both these
datasets much closer to being semantically identical, at less than 1,000 pairs each, they are too small for training and even less
than ideal for evaluation
24
. Sakakini et al. manually translate a somewhat larger set (4,554) of instructions for patients from
clinical notes
25
. However, this corpus covers a very specific case within the clinical domain, which itself constitutes a separate
sublanguage from biomedical literature26.
Since recent models can handle larger paragraphs, comparable corpora have also been suggested as training or benchmark
datasets for adapting biomedical text. These corpora consist of pairs of paragraphs or documents that are on the same topic
and make roughly the same points, but are not sentence-aligned. Devaraj et al. present a paragraph level corpus derived from
Cochrane review abstracts and their Plain Language Summaries, using heuristics to combine subsections with similar content
across the pairs. However, these heuristics do not guarantee identical content
27
. This dataset is also not sentence-aligned,
which limits the architectures that can take advantage of it and results in restriction of documents to those with no more than
1024 tokens. Other datasets include comparable corpora or are created at the paragraph-level and omit relevant details from
the original article
27
. To the best of our knowledge, no datasets provide manual, sentence-level adaptations of the scientific
abstracts
28
. Thus, there is still a need for a high-quality, sentence-level gold standard dataset for the adaptation of general
biomedical text.
To address this need, we have developed the Plain Language Adaptation of Biomedical Abstracts (PLABA) dataset. PLABA
contains 750 abstracts from PubMed (10 on each of 75 topics) and expert-created adaptations at the sentence-level. Annotators
were chosen from the NLM and an external company and given abstracts within their respective expertise to adapt. Human
adaptation allows us to ensure the parallel nature of the corpus down to sentence-level granularity, but still while using the
surrounding context of the entire document to guide each translation. We deliberately construct this dataset so it can serve as a
gold standard on several levels:
1.
Document level simplification. Documents are simplified in total, each by at least one annotator, who is instructed to
carry over all content relevant for general public understanding of the professional document. This allows the corpus to
be used as a gold standard for systems that operate at the document level.
2.
Sentence level simplification. Unlike automatic alignments, these pairings are ensured to be parallel for the purpose
of simplification. Semantically, they will differ only in (1) content removed from the professional register because the
annotator deemed it unimportant for general public understanding, and (2) explanation or elaboration added to the general
public register to aid understanding. Since annotators were instructed to keep content within sentence boundaries (or in
split sentences), there are no issues with fragments of other thoughts spilled over from neighboring sentences on one side
of the pair.
3.
Sentence-level operations and splitting. Though rare in translation between languages, sentence-level operations (e.g.
merging, deletion, and splitting) are common in simplification
29
. Splitting is often used to simplify syntax and reduce
sentence length. Occasionally sentences may be dropped from the general public register altogether (deletion). For
consistency and simplicity of annotation, we do not allow merging, creating a one-to-many relationship at the sentence
level.
The PLABA dataset should further enable the development of systems that automatically adapt relevant medical texts for
patients without prior medical knowledge. In addition to releasing PLABA, we have evaluated state-of-the-art deep learning
approaches on this dataset to set benchmarks for future researchers.
Methods
The PLABA dataset includes 75 health-related questions asked by MedlinePlus users, 750 PubMed abstracts from relevant
scientific articles, and corresponding human created adaptations of the abstracts. The questions in PLABA are among the most
popular topics from MedlinePlus, ranging from topics like COVID-19 symptoms to genetic conditions like cystic fibrosis1.
To gather the PubMed abstracts in PLABA, we first filtered questions from MedlinePlus logs based on the frequency
of general public queries. Then, a medical informatics expert verified the relevance of and lack of accessible resources to
answer each question and chose 75 questions total. For each question, the expert coded its focus (COVID-19, cystic fibrosis,
compression devices, etc.) and question type (general information, treatment, prognosis, etc.) to use as keywords in a PubMed
search
30
. Then, the expert selected 10 abstracts from PubMed retrieval results that appropriately addressed the topic of the
question, as seen in Figure 1.
To create the corresponding adaptations for each abstract in PLABA, medical informatics experts worked with source
abstracts separated into individual sentences to create corresponding adaptations across all 75 questions. Adaptation guidelines
2/12