A dataset for plain language adaptation of biomedical abstracts Kush Attal1 Brian Ondov1 and Dina Demner-Fushman1

2025-04-30 0 0 303.35KB 12 页 10玖币

侵权投诉

A dataset for plain language adaptation of

biomedical abstracts

Kush Attal1,*, Brian Ondov1, and Dina Demner-Fushman1

Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of

Health, Bethesda, MD, USA

*corresponding author: Kush Attal (Kush.Attal@nih.gov)

ABSTRACT

Though exponentially growing health-related literature has been made available to a broad audience online, the language

of scientiﬁc articles can be difﬁcult for the general public to understand. Therefore, adapting this expert-level language into

plain language versions is necessary for the public to reliably comprehend the vast health-related literature. Deep Learning

algorithms for automatic adaptation are a possible solution; however, gold standard datasets are needed for proper evaluation.

Proposed datasets thus far consist of either pairs of comparable professional- and general public-facing documents or pairs of

semantically similar sentences mined from such documents. This leads to a trade-off between imperfect alignments and small

test sets. To address this issue, we created the Plain Language Adaptation of Biomedical Abstracts dataset. This dataset is the

ﬁrst manually adapted dataset that is both document- and sentence-aligned. The dataset contains 750 adapted abstracts,

totaling 7643 sentence pairs. Along with describing the dataset, we benchmark automatic adaptation on the dataset with

state-of-the-art Deep Learning approaches, setting baselines for future research.

Background & Summary

While reliable resources for health information conveyed in a plain language format exist, such as the MedlinePlus website from

the National Library of Medicine (NLM)

, these resources do not provide all the necessary information for every health-related

situation or rapidly changing state of knowledge arising from novel scientiﬁc investigations or global events like pandemics. In

addition, the language used in other health-related articles can be too difﬁcult for patients and the general public to comprehend

which has a major impact on health outcomes

. While work in simplifying text exists, the unique language of biomedical text

warrants a distinct subtask similar to machine translation, termed adaptation

. Adapting natural language involves creating a

simpliﬁed version that maintains the most important details from a complex source. Adaptations are a common tool for teachers

to use to improve comprehension of content for English language learners5.

A standard internet search will return multiple scientiﬁc articles that correspond to a patient’s query; however, without

extensive clinical and/or biological knowledge, the user may not be able to comprehend the scientiﬁc language and content

There are articles with veriﬁed, plain language summaries for health information, such as the articles with corresponding plain

language summaries created by medical health organization Cochrane

. However, creating manual summaries and adaptations

for every article addressing every user’s queries is not possible. Thus, an automatic adaptation generated for material responding

to a user’s query is very relevant, especially for patients without clinical knowledge.

Though plain language thesauri and other knowledge bases have enabled rule-based systems that substitute difﬁcult terms

for more common ones, human editing is needed to account for grammar, context, and ambiguity

. Deep Learning may offer

a solution for fully automated adaptation. Advances in architectures, hardware, and available data have led neural methods

to achieve state-of-the-art results in many linguistic tasks, including Machine Translation

and Text Simpliﬁcation

. Neural

methods, however, require large numbers of training examples, as well as benchmark datasets to allow iterative progress11.

Parallel datasets for Text Simpliﬁcation have been assembled by searching for semantically similar sentences across

comparable document pairs, for example articles on the same subject in both Wikipedia and Simple English Wikipedia (or

Vikidia, an encyclopedia for children in several languages)

12–15

. Since Wikipedia contains some articles on biomedical topics,

it has been proposed to extract subsets of these datasets for use in this domain

16–19

. However, since these sentence pairs exist

in different contexts, they are often not semantically identical, having undergone sentence-level operations like splitting or

merging. Sentence pairs pulled out of context may also use anaphora on one side of a pair but not the other. This can confuse

models during training and expect impossible replacements during testing. Further, Simple English Wikipedia often still

contains complex medical terms on the simple side

16,20,21

. Parallel sentences have also been mined from dedicated biomedical

sources. Cao et al. have expert annotators pinpoint highly similar passages, usually consisting of one or two sentences from

each passage, from Merck Manuals, an online website containing numerous articles on medical and health topics created for

arXiv:2210.12242v1 [cs.CL] 21 Oct 2022

both professional and general public groups

. In addition, Pattisapu et al. have expert annotators identify highly similar pairs

from scientiﬁc articles and corresponding health blogs describing them

. Though human ﬁltering makes the pairs in both these

datasets much closer to being semantically identical, at less than 1,000 pairs each, they are too small for training and even less

than ideal for evaluation

. Sakakini et al. manually translate a somewhat larger set (4,554) of instructions for patients from

clinical notes

. However, this corpus covers a very speciﬁc case within the clinical domain, which itself constitutes a separate

sublanguage from biomedical literature26.

Since recent models can handle larger paragraphs, comparable corpora have also been suggested as training or benchmark

datasets for adapting biomedical text. These corpora consist of pairs of paragraphs or documents that are on the same topic

and make roughly the same points, but are not sentence-aligned. Devaraj et al. present a paragraph level corpus derived from

Cochrane review abstracts and their Plain Language Summaries, using heuristics to combine subsections with similar content

across the pairs. However, these heuristics do not guarantee identical content

. This dataset is also not sentence-aligned,

which limits the architectures that can take advantage of it and results in restriction of documents to those with no more than

1024 tokens. Other datasets include comparable corpora or are created at the paragraph-level and omit relevant details from

the original article

. To the best of our knowledge, no datasets provide manual, sentence-level adaptations of the scientiﬁc

abstracts

. Thus, there is still a need for a high-quality, sentence-level gold standard dataset for the adaptation of general

biomedical text.

To address this need, we have developed the Plain Language Adaptation of Biomedical Abstracts (PLABA) dataset. PLABA

contains 750 abstracts from PubMed (10 on each of 75 topics) and expert-created adaptations at the sentence-level. Annotators

were chosen from the NLM and an external company and given abstracts within their respective expertise to adapt. Human

adaptation allows us to ensure the parallel nature of the corpus down to sentence-level granularity, but still while using the

surrounding context of the entire document to guide each translation. We deliberately construct this dataset so it can serve as a

gold standard on several levels:

Document level simpliﬁcation. Documents are simpliﬁed in total, each by at least one annotator, who is instructed to

carry over all content relevant for general public understanding of the professional document. This allows the corpus to

be used as a gold standard for systems that operate at the document level.

Sentence level simpliﬁcation. Unlike automatic alignments, these pairings are ensured to be parallel for the purpose

of simpliﬁcation. Semantically, they will differ only in (1) content removed from the professional register because the

annotator deemed it unimportant for general public understanding, and (2) explanation or elaboration added to the general

public register to aid understanding. Since annotators were instructed to keep content within sentence boundaries (or in

split sentences), there are no issues with fragments of other thoughts spilled over from neighboring sentences on one side

of the pair.

Sentence-level operations and splitting. Though rare in translation between languages, sentence-level operations (e.g.

merging, deletion, and splitting) are common in simpliﬁcation

. Splitting is often used to simplify syntax and reduce

sentence length. Occasionally sentences may be dropped from the general public register altogether (deletion). For

consistency and simplicity of annotation, we do not allow merging, creating a one-to-many relationship at the sentence

level.

The PLABA dataset should further enable the development of systems that automatically adapt relevant medical texts for

patients without prior medical knowledge. In addition to releasing PLABA, we have evaluated state-of-the-art deep learning

approaches on this dataset to set benchmarks for future researchers.

Methods

The PLABA dataset includes 75 health-related questions asked by MedlinePlus users, 750 PubMed abstracts from relevant

scientiﬁc articles, and corresponding human created adaptations of the abstracts. The questions in PLABA are among the most

popular topics from MedlinePlus, ranging from topics like COVID-19 symptoms to genetic conditions like cystic ﬁbrosis1.

To gather the PubMed abstracts in PLABA, we ﬁrst ﬁltered questions from MedlinePlus logs based on the frequency

of general public queries. Then, a medical informatics expert veriﬁed the relevance of and lack of accessible resources to

answer each question and chose 75 questions total. For each question, the expert coded its focus (COVID-19, cystic ﬁbrosis,

compression devices, etc.) and question type (general information, treatment, prognosis, etc.) to use as keywords in a PubMed

. Then, the expert selected 10 abstracts from PubMed retrieval results that appropriately addressed the topic of the

question, as seen in Figure 1.

To create the corresponding adaptations for each abstract in PLABA, medical informatics experts worked with source

abstracts separated into individual sentences to create corresponding adaptations across all 75 questions. Adaptation guidelines

2/12

allowed annotators to split long source sentences and ignore source sentences that were not relevant to the general public. Each

source sentence corresponds to no, one, or multiple sentences in the adaptation. Creating these adaptations involved syntactic,

lexical and semantic simpliﬁcations, which were developed in the context of the entire abstract. Examples taken from the

dataset can be seen in Table 1. Speciﬁc examples of adaptation guidelines are demonstrated in Figure 2and included:

• Replacing arcane words like "orthosis" with common synonyms like "brace"

• Changing sentence structure from passive voice to active voice

• Omitting or incorporating subheadings at the beginning of sentences (e.g., "Aim:", "Purpose:")

• Splitting long, complex sentences into shorter, simpler sentences

• Omitting conﬁdence intervals and other statistical values

• Carrying over understandable sentences from the source with no changes into the adaptation

• Ignoring sentences that are not relevant to a patient’s understanding of the text

• Resolving anaphora and pronouns with speciﬁc nouns

• Explaining complex terms and abbreviations with explanatory clauses when ﬁrst mentioned

Data Records

We archived the dataset with Open Science Framework (OSF) at https://osf.io/rnpmf/. The dataset is saved in JSON format and

organized or "keyed" by question ID. Each key is a question ID that contains a corresponding nested JSON object. This nested

object contains the actual question and contains the abstracts and corresponding human adaptations grouped by the PubMed ID

of the abstract. Table 2shows statistics of the abstracts and adaptations. Additional details regarding the data structure can be

found in the README ﬁle in the OSF archive.

Technical Validation

We measured the level of complexity, the ability to train tools and how well the main points are preserved in the automatic

adaptations trained on our data. We ﬁrst introduce the metrics we used to measure text complexity followed by the metrics to

measure text similarity and inter-annotator agreement between manually created adaptations. We use the same text similarity

metrics to also compare automatically created adaptations to both the source abstracts and manually created adaptations.

Evaluation metrics

To measure text readability and compare the abstracts and manually created adaptations, we use the Flesch-Kincaid Grade Level

(FKGL) test

. FKGL uses the average number of syllables per word and the average number of words per sentence to calculate

the score. A higher FKGL score for a text indicates a higher reading comprehension level needed to understand the text.

In addition, we use BLEU

, ROUGE

, and SARI

4,34

, commonly used text similarity and simpliﬁcation metrics, to

measure inter-annotator agreement, compare abstracts to manually created adaptations, and evaluate the automatically created

adaptations. BLEU and ROUGE look at spans of contiguous words (referred to as n-grams in Natural Language Processing

or NLP) to evaluate a candidate adaptation against a reference adaptation. For instance, BLEU-4 measures how many of

the contiguous sequences from one to four words in length in the candidate adaptation appear in the reference adaptation.

However, BLEU is a measure of precision and penalizes candidates for adding incorrect n-grams. ROUGE is a measure of

recall and penalizes candidate adaptations for missing n-grams. Since neither BLEU nor ROUGE is speciﬁcally designed for

simpliﬁcation, we also use SARI, which also incorporates the source sentence in order to weight the various operations involved

in simpliﬁcation. While n-grams are still used, SARI balances (1) addition operations, in which n-grams of the candidate

adaptation are shared with the reference adaptation but not the source, (2) deletion operations, in which n-grams appear in

the source but neither the reference nor candidate, and (3) keep operations, in which n-grams are shared by all three. We

report BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L (which measures the longest shared sub-sequence between a candidate and

reference), and SARI. All metrics can account for multiple possible reference adaptations.

Text readability

To verify that the human generated adaptations simplify the source abstracts, we calculated the FKGL readability scores for both

the adaptations and abstracts. FKGL scores were lower for the adaptations compared to the abstracts (p< 0.0001, Kendall’s

tau). It is important to note that FKGL does not measure similarity or content preservation, so additional metrics like BLEU,

ROUGE, and SARI are needed to address this concern.

3/12

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdatasetforplainlanguageadaptationofbiomedicalabstractsKushAttal1,*,BrianOndov1,andDinaDemner-Fushman11ListerHillNationalCenterforBiomedicalCommunications,U.S.NationalLibraryofMedicine,NationalInstitutesofHealth,Bethesda,MD,USA*correspondingauthor:KushAttal(Kush.Attal@nih.gov)ABSTRACTThoughexponenti...

展开>> 收起<<

A dataset for plain language adaptation of biomedical abstracts Kush Attal1 Brian Ondov1 and Dina Demner-Fushman1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A dataset for plain language adaptation of biomedical abstracts Kush Attal1 Brian Ondov1 and Dina Demner-Fushman1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: