A dataset for plain language adaptation of biomedical abstracts Kush Attal1 Brian Ondov1 and Dina Demner-Fushman1

2025-04-30 0 0 303.35KB 12 页 10玖币
侵权投诉
A dataset for plain language adaptation of
biomedical abstracts
Kush Attal1,*, Brian Ondov1, and Dina Demner-Fushman1
1
Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of
Health, Bethesda, MD, USA
*corresponding author: Kush Attal (Kush.Attal@nih.gov)
ABSTRACT
Though exponentially growing health-related literature has been made available to a broad audience online, the language
of scientific articles can be difficult for the general public to understand. Therefore, adapting this expert-level language into
plain language versions is necessary for the public to reliably comprehend the vast health-related literature. Deep Learning
algorithms for automatic adaptation are a possible solution; however, gold standard datasets are needed for proper evaluation.
Proposed datasets thus far consist of either pairs of comparable professional- and general public-facing documents or pairs of
semantically similar sentences mined from such documents. This leads to a trade-off between imperfect alignments and small
test sets. To address this issue, we created the Plain Language Adaptation of Biomedical Abstracts dataset. This dataset is the
first manually adapted dataset that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts,
totaling 7643 sentence pairs. Along with describing the dataset, we benchmark automatic adaptation on the dataset with
state-of-the-art Deep Learning approaches, setting baselines for future research.
Background & Summary
While reliable resources for health information conveyed in a plain language format exist, such as the MedlinePlus website from
the National Library of Medicine (NLM)
1
, these resources do not provide all the necessary information for every health-related
situation or rapidly changing state of knowledge arising from novel scientific investigations or global events like pandemics. In
addition, the language used in other health-related articles can be too difficult for patients and the general public to comprehend
2
,
which has a major impact on health outcomes
3
. While work in simplifying text exists, the unique language of biomedical text
warrants a distinct subtask similar to machine translation, termed adaptation
4
. Adapting natural language involves creating a
simplified version that maintains the most important details from a complex source. Adaptations are a common tool for teachers
to use to improve comprehension of content for English language learners5.
A standard internet search will return multiple scientific articles that correspond to a patient’s query; however, without
extensive clinical and/or biological knowledge, the user may not be able to comprehend the scientific language and content
6
.
There are articles with verified, plain language summaries for health information, such as the articles with corresponding plain
language summaries created by medical health organization Cochrane
7
. However, creating manual summaries and adaptations
for every article addressing every user’s queries is not possible. Thus, an automatic adaptation generated for material responding
to a user’s query is very relevant, especially for patients without clinical knowledge.
Though plain language thesauri and other knowledge bases have enabled rule-based systems that substitute difficult terms
for more common ones, human editing is needed to account for grammar, context, and ambiguity
8
. Deep Learning may offer
a solution for fully automated adaptation. Advances in architectures, hardware, and available data have led neural methods
to achieve state-of-the-art results in many linguistic tasks, including Machine Translation
9
and Text Simplification
10
. Neural
methods, however, require large numbers of training examples, as well as benchmark datasets to allow iterative progress11.
Parallel datasets for Text Simplification have been assembled by searching for semantically similar sentences across
comparable document pairs, for example articles on the same subject in both Wikipedia and Simple English Wikipedia (or
Vikidia, an encyclopedia for children in several languages)
1215
. Since Wikipedia contains some articles on biomedical topics,
it has been proposed to extract subsets of these datasets for use in this domain
1619
. However, since these sentence pairs exist
in different contexts, they are often not semantically identical, having undergone sentence-level operations like splitting or
merging. Sentence pairs pulled out of context may also use anaphora on one side of a pair but not the other. This can confuse
models during training and expect impossible replacements during testing. Further, Simple English Wikipedia often still
contains complex medical terms on the simple side
16,20,21
. Parallel sentences have also been mined from dedicated biomedical
sources. Cao et al. have expert annotators pinpoint highly similar passages, usually consisting of one or two sentences from
each passage, from Merck Manuals, an online website containing numerous articles on medical and health topics created for
arXiv:2210.12242v1 [cs.CL] 21 Oct 2022
both professional and general public groups
22
. In addition, Pattisapu et al. have expert annotators identify highly similar pairs
from scientific articles and corresponding health blogs describing them
23
. Though human filtering makes the pairs in both these
datasets much closer to being semantically identical, at less than 1,000 pairs each, they are too small for training and even less
than ideal for evaluation
24
. Sakakini et al. manually translate a somewhat larger set (4,554) of instructions for patients from
clinical notes
25
. However, this corpus covers a very specific case within the clinical domain, which itself constitutes a separate
sublanguage from biomedical literature26.
Since recent models can handle larger paragraphs, comparable corpora have also been suggested as training or benchmark
datasets for adapting biomedical text. These corpora consist of pairs of paragraphs or documents that are on the same topic
and make roughly the same points, but are not sentence-aligned. Devaraj et al. present a paragraph level corpus derived from
Cochrane review abstracts and their Plain Language Summaries, using heuristics to combine subsections with similar content
across the pairs. However, these heuristics do not guarantee identical content
27
. This dataset is also not sentence-aligned,
which limits the architectures that can take advantage of it and results in restriction of documents to those with no more than
1024 tokens. Other datasets include comparable corpora or are created at the paragraph-level and omit relevant details from
the original article
27
. To the best of our knowledge, no datasets provide manual, sentence-level adaptations of the scientific
abstracts
28
. Thus, there is still a need for a high-quality, sentence-level gold standard dataset for the adaptation of general
biomedical text.
To address this need, we have developed the Plain Language Adaptation of Biomedical Abstracts (PLABA) dataset. PLABA
contains 750 abstracts from PubMed (10 on each of 75 topics) and expert-created adaptations at the sentence-level. Annotators
were chosen from the NLM and an external company and given abstracts within their respective expertise to adapt. Human
adaptation allows us to ensure the parallel nature of the corpus down to sentence-level granularity, but still while using the
surrounding context of the entire document to guide each translation. We deliberately construct this dataset so it can serve as a
gold standard on several levels:
1.
Document level simplification. Documents are simplified in total, each by at least one annotator, who is instructed to
carry over all content relevant for general public understanding of the professional document. This allows the corpus to
be used as a gold standard for systems that operate at the document level.
2.
Sentence level simplification. Unlike automatic alignments, these pairings are ensured to be parallel for the purpose
of simplification. Semantically, they will differ only in (1) content removed from the professional register because the
annotator deemed it unimportant for general public understanding, and (2) explanation or elaboration added to the general
public register to aid understanding. Since annotators were instructed to keep content within sentence boundaries (or in
split sentences), there are no issues with fragments of other thoughts spilled over from neighboring sentences on one side
of the pair.
3.
Sentence-level operations and splitting. Though rare in translation between languages, sentence-level operations (e.g.
merging, deletion, and splitting) are common in simplification
29
. Splitting is often used to simplify syntax and reduce
sentence length. Occasionally sentences may be dropped from the general public register altogether (deletion). For
consistency and simplicity of annotation, we do not allow merging, creating a one-to-many relationship at the sentence
level.
The PLABA dataset should further enable the development of systems that automatically adapt relevant medical texts for
patients without prior medical knowledge. In addition to releasing PLABA, we have evaluated state-of-the-art deep learning
approaches on this dataset to set benchmarks for future researchers.
Methods
The PLABA dataset includes 75 health-related questions asked by MedlinePlus users, 750 PubMed abstracts from relevant
scientific articles, and corresponding human created adaptations of the abstracts. The questions in PLABA are among the most
popular topics from MedlinePlus, ranging from topics like COVID-19 symptoms to genetic conditions like cystic fibrosis1.
To gather the PubMed abstracts in PLABA, we first filtered questions from MedlinePlus logs based on the frequency
of general public queries. Then, a medical informatics expert verified the relevance of and lack of accessible resources to
answer each question and chose 75 questions total. For each question, the expert coded its focus (COVID-19, cystic fibrosis,
compression devices, etc.) and question type (general information, treatment, prognosis, etc.) to use as keywords in a PubMed
search
30
. Then, the expert selected 10 abstracts from PubMed retrieval results that appropriately addressed the topic of the
question, as seen in Figure 1.
To create the corresponding adaptations for each abstract in PLABA, medical informatics experts worked with source
abstracts separated into individual sentences to create corresponding adaptations across all 75 questions. Adaptation guidelines
2/12
allowed annotators to split long source sentences and ignore source sentences that were not relevant to the general public. Each
source sentence corresponds to no, one, or multiple sentences in the adaptation. Creating these adaptations involved syntactic,
lexical and semantic simplifications, which were developed in the context of the entire abstract. Examples taken from the
dataset can be seen in Table 1. Specific examples of adaptation guidelines are demonstrated in Figure 2and included:
Replacing arcane words like "orthosis" with common synonyms like "brace"
Changing sentence structure from passive voice to active voice
Omitting or incorporating subheadings at the beginning of sentences (e.g., "Aim:", "Purpose:")
Splitting long, complex sentences into shorter, simpler sentences
Omitting confidence intervals and other statistical values
Carrying over understandable sentences from the source with no changes into the adaptation
Ignoring sentences that are not relevant to a patient’s understanding of the text
Resolving anaphora and pronouns with specific nouns
Explaining complex terms and abbreviations with explanatory clauses when first mentioned
Data Records
We archived the dataset with Open Science Framework (OSF) at https://osf.io/rnpmf/. The dataset is saved in JSON format and
organized or "keyed" by question ID. Each key is a question ID that contains a corresponding nested JSON object. This nested
object contains the actual question and contains the abstracts and corresponding human adaptations grouped by the PubMed ID
of the abstract. Table 2shows statistics of the abstracts and adaptations. Additional details regarding the data structure can be
found in the README file in the OSF archive.
Technical Validation
We measured the level of complexity, the ability to train tools and how well the main points are preserved in the automatic
adaptations trained on our data. We first introduce the metrics we used to measure text complexity followed by the metrics to
measure text similarity and inter-annotator agreement between manually created adaptations. We use the same text similarity
metrics to also compare automatically created adaptations to both the source abstracts and manually created adaptations.
Evaluation metrics
To measure text readability and compare the abstracts and manually created adaptations, we use the Flesch-Kincaid Grade Level
(FKGL) test
31
. FKGL uses the average number of syllables per word and the average number of words per sentence to calculate
the score. A higher FKGL score for a text indicates a higher reading comprehension level needed to understand the text.
In addition, we use BLEU
32
, ROUGE
33
, and SARI
4,34
, commonly used text similarity and simplification metrics, to
measure inter-annotator agreement, compare abstracts to manually created adaptations, and evaluate the automatically created
adaptations. BLEU and ROUGE look at spans of contiguous words (referred to as n-grams in Natural Language Processing
or NLP) to evaluate a candidate adaptation against a reference adaptation. For instance, BLEU-4 measures how many of
the contiguous sequences from one to four words in length in the candidate adaptation appear in the reference adaptation.
However, BLEU is a measure of precision and penalizes candidates for adding incorrect n-grams. ROUGE is a measure of
recall and penalizes candidate adaptations for missing n-grams. Since neither BLEU nor ROUGE is specifically designed for
simplification, we also use SARI, which also incorporates the source sentence in order to weight the various operations involved
in simplification. While n-grams are still used, SARI balances (1) addition operations, in which n-grams of the candidate
adaptation are shared with the reference adaptation but not the source, (2) deletion operations, in which n-grams appear in
the source but neither the reference nor candidate, and (3) keep operations, in which n-grams are shared by all three. We
report BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L (which measures the longest shared sub-sequence between a candidate and
reference), and SARI. All metrics can account for multiple possible reference adaptations.
Text readability
To verify that the human generated adaptations simplify the source abstracts, we calculated the FKGL readability scores for both
the adaptations and abstracts. FKGL scores were lower for the adaptations compared to the abstracts (p< 0.0001, Kendall’s
tau). It is important to note that FKGL does not measure similarity or content preservation, so additional metrics like BLEU,
ROUGE, and SARI are needed to address this concern.
3/12
摘要:

AdatasetforplainlanguageadaptationofbiomedicalabstractsKushAttal1,*,BrianOndov1,andDinaDemner-Fushman11ListerHillNationalCenterforBiomedicalCommunications,U.S.NationalLibraryofMedicine,NationalInstitutesofHealth,Bethesda,MD,USA*correspondingauthor:KushAttal(Kush.Attal@nih.gov)ABSTRACTThoughexponenti...

展开>> 收起<<
A dataset for plain language adaptation of biomedical abstracts Kush Attal1 Brian Ondov1 and Dina Demner-Fushman1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:303.35KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注