PHEE A Dataset for Pharmacovigilance Event Extraction from Text Zhaoyue Sun1 Jiazheng Li1 Gabriele Pergola1 Byron C. Wallace2 Bino John3 Nigel Greene3 Joseph Kim3and Yulan He145

2025-05-02 0 0 1.38MB 17 页 10玖币

侵权投诉

PHEE: A Dataset for Pharmacovigilance Event Extraction from Text

Zhaoyue Sun1, Jiazheng Li1, Gabriele Pergola1, Byron C. Wallace2

Bino John3, Nigel Greene3, Joseph Kim3and Yulan He1,4,5

1Department of Computer Science, University of Warwick

2Khoury College of Computer Sciences, Northeastern University

3AstraZeneca

4Department of Informatics, King’s College London

5The Alan Turing Institute

{Zhaoyue.Sun, Jiazheng.Li, Gabriele.Pergola.1}@warwick.ac.uk

{bino.john, nigel.greene, joseph.kim1}@astrazeneca.com

b.wallace@northeastern.edu, yulan.he@kcl.ac.uk

Abstract

The primary goal of drug safety researchers

and regulators is to promptly identify adverse

drug reactions. Doing so may in turn prevent

or reduce the harm to patients and ultimately

improve public health. Evaluating and moni-

toring drug safety (i.e., pharmacovigilance) in-

volves analyzing an ever growing collection of

spontaneous reports from health professionals,

physicians, and pharmacists, and information

voluntarily submitted by patients. In this sce-

nario, facilitating analysis of such reports via

automation has the potential to rapidly iden-

tify safety signals. Unfortunately, public re-

sources for developing natural language mod-

els for this task are scant. We present PHEE, a

novel dataset for pharmacovigilance compris-

ing over 5000 annotated events from medical

case reports and biomedical literature, mak-

ing it the largest such public dataset to date.

We describe the hierarchical event schema de-

signed to provide coarse and ﬁne-grained in-

formation about patients’ demographics, treat-

ments and (side) effects. Along with the dis-

cussion of the dataset, we present a thorough

experimental evaluation of current state-of-the-

art approaches for biomedical event extrac-

tion, point out their limitations, and highlight

open challenges to foster future research in this

area1.

1 Introduction

Pharmacovigilance is the pharmaceutical science

that entails monitoring and evaluating the safety

and efﬁciency of medicine use, which is vital for

improving public health (World Health Organi-

zation,2004). Unexpected adverse drug effects

(ADEs) could lead to considerable morbidity and

mortality (Lazarou et al.,1998). It has been re-

ported that more than half of ADEs are preventable

Our data and code is available at

https://github.com/ZhaoyueSun/PHEE

(Gurwitz et al.,2000). Pharmacovigilance is there-

fore important for detecting and understanding

ADE-related events, as it may inform clinical prac-

tice and ultimately mitigate preventable hazards.

Collecting and maintaining the clinical evidence

for pharmacovigilance can be difﬁcult because it

requires time-consuming manual curation to cap-

ture emerging data about drugs (Thompson et al.,

2018). Much of this information can be found

in unstructured textual data including medical lit-

erature, notes in electronic health records (EHR),

and social media posts. Using NLP methods to

discover and extract adverse drug events from un-

structured text may permit efﬁcient monitoring of

such sources (Nikfarjam et al.,2015;Huynh et al.,

2016;Ju et al.,2020;Wei et al.,2020).

Past work has introduced pharmacovigilance cor-

pora to support training and evaluation of NLP ap-

proaches for ADE extraction. However, most of

these datasets (e.g., the ADE corpus; Gurulingappa

et al. 2012b) contain annotations only on entities

(such as drugs and side effects) and their binary

relations as shown in Figure 1(a). This ignores

contextual information relating to human subjects,

treatments administered, and more complex sit-

uations such as multi-drug concomitant use. To

address this problem, Thompson et al. (2018) de-

veloped the PHAEDRA corpus, which includes

annotations of not only drugs and side effects, but

also subjects (human, speciﬁc species, bacteria, and

so on) and events encoding descriptions of drug ef-

fects, which involve multiple arguments, and event

attributes — see Figure 1(b).

Despite these reﬁnements, however, PHAEDRA

does not provide detailed, nested annotations such

as dosages, conditions, and patient demographic de-

tails. This granular information may provide criti-

cal context to clinical studies. Furthermore, PHAE-

DRA consists of only 600 annotated abstracts of

arXiv:2210.12560v1 [cs.CL] 22 Oct 2022

medical case reports, making it challenging to train

NLP models for pharmacovigilance events extrac-

tion since its annotations are in the document level

and the actual annotated events are sparse.

In this work we introduce a new annotated cor-

pus, PHEE, for adverse and potential therapeutic

effect event extraction for pharmacovigilance study.

The dataset consists of nearly 5,000 sentences ex-

tracted from MEDLINE case reports, and each sen-

tence features two levels of annotations. With re-

spect to coarse-grained annotations, each sentence

is annotated with the event trigger word/phrase,

event type and text spans indicating the event’s

associated subject, treatment, and effect. In a

ﬁne-grained annotation pass, further details are

marked, such as patient demographic information,

the context information about the treatments in-

cluding drug dosage levels, administration routes,

frequency, and attributes relating to events. An

example annotation is shown in Figure 1(c).

Using PHEE as the benchmark, we conduct

thorough experiments to assess the state-of-the-

art NLP technologies for the pharmacovigilance-

related event extraction task. We use sequence

labelling and (both extractive and generative) QA-

based methods as baselines and evaluate event trig-

ger extraction and argument extraction. The ex-

tractive QA method performs best for trigger ex-

traction with the exact match F1 score of 70.09%,

while the generative QA method achieves the best

exact match F1 score of 68.60% and 76.16% for

the main argument and sub-argument extraction,

respectively. Further analysis shows that current

models perform well on average cases but often

fail on more complex examples.

Our contributions can be summarised as fol-

lows:

1) We introduce PHEE, a new pharma-

covigilance dataset containing over 5,000 ﬁnely

annotated events from public medical case reports.

To the best of our knowledge, this is the largest

and most comprehensively annotated dataset of

this type to date. 2) We collect hierarchical anno-

tations to provide granular information about pa-

tients and conditions in addition to coarse-grained

event information. 3) We conduct thorough ex-

periments to compare current state-of-the-art ap-

proaches for biomedical event extraction, demon-

strating the strength and weaknesses of current tech-

nologies and use this to highlight challenges for

future research in this area.

2 Related Work

Pharmacovigilance Related Corpora

Prior

pharmacovigilance-related corpora mainly has

focused on annotation of entities (e.g., drugs,

diseases, medications) and binary relations

between them, namely, drug-ADE relations

(Gurulingappa et al.,2012a;Patki et al.,2014;

Ginn et al.,2014), disorder-treatment relations

(Rosario and Hearst,2004;Roberts et al.,2009;

Uzuner et al.,2011;Van Mulligen et al.,2012),

and drug-drug interactions (Segura-Bedmar et al.,

2011;Boyce et al.,2012;Rubrichi and Quaglini,

2012;Herrero-Zazo et al.,2013). More recent

open challenges, including the 2018 n2c2 shared

task (Henry et al.,2020) and MADE1.0 challenge

(Jagannatha et al.,2019), have considered annotat-

ing additional relation types, such as drug-attribute

and drug-reason relations, but they are still binary

relationships.

Thompson et al. (2018) introduced the PHAE-

DRA corpus, extending the drug-ADE annotations

to pharmacovigilance events. Compared to corpora

that only annotate simple drug-ADE relations—

referred to as AE events in PHAEDRA—they fur-

ther annotate three additional relations, namely the

Potential Therapeutic Effect (PTE) event which

refers to the potential beneﬁcial effects of drugs,

the Combination and the Drug-Drug Interaction

event which indicates multiple drug use and inter-

actions between administered drugs, respectively.

In addition, PHAEDRA includes the subject as a

type of named entities (NEs) and annotates three

types event attributes, i.e., negated, speculated and

manner. However, some key informative details are

still missing in PHAEDRA. As the NE annotation

of PHAEDRA is usually a single noun or a short

noun phrase, detailed information about the subject

(such as age and gender), and of the medication

(e.g., dosage and frequency) is not captured.

We set out to annotate a larger corpus with more

detailed information to facilitate training of phar-

macovigilance event extraction models. We build

on existing corpora (PHAEDRA and ADE). The

ADE corpus comprises

∼

3,000 MEDLINE case

reports and annotations on

∼

4,000 sentences indi-

cating adverse effects, but their annotations only

involve drugs, dosages and adverse effects, and

lack sufﬁcient event details of interest. The PHAE-

DRA corpus reuses 227 abstracts from ADE and

integrates an additional 370 abstracts (from other

corpora and some novel entries). However, the

6414095

Transient hemiparesis caused by phenytoin toxicity.

A case report.

A 52-year-old Black woman on phenytoin therapy for post-traumatic epilepsy developed transient hemiparesis contralateral to the injury.

The episode appeared to have been precipitated by toxicity due to ingestion of a large amount of phenytoin.

A possible mechanism for focal neurological deficit in brain-damaged patients on phenytoin therapy is discussed.

Disorder Adverse_effect Pharmacological_substance Disorder

has_agent

affects

Subject Pharmacological_substance Potential_therapeutic_effect Disorder Adverse_effect Disorder

has_agent affectsaffects

has_subject

has_agent

Subject_Disorder

has_subject

Disorder Adverse_effect Pharmacological_substance

affects has_agent

Disorder Disorder Subject Pharmacological_substance Potential_therapeutic_effect

Adverse_effect

has_agent

Subject_Disorder

has_subject

affects

bratbrat

/PHAEDRA/train/6414095

666

A 52-year-old Black woman on phenytoin therapy for post-traumatic epilepsy developed transient hemiparesis contralateral to the injury.

666

Drug Adverse_effect

has

bratbrat

/ade_example/6414095

A 52-year-old Black woman on phenytoin therapy for post-traumatic epilepsy developed transient hemiparesis contralateral to the injury.

GenderRaceAge

Subject

Drug

Treatment

Treat-Disorder Adverse_event [Low] Severity_cue Effect

bratbrat

/clean/single/6414095_3

(a) An example of the ADE dataset.

(b) An example from the PHAEDRA dataset.

Figure 1: Comparison of annotations from (a) the ADE corpus, (b) the PHEADRA corpus and (c) our developed

PHEE corpus.

PHAEDRA corpus is annotated at the document

level, the actual annotated events are very sparse.

We collected sentences in ADE and those in PHAE-

DRA with AE or PTE event annotations and en-

riched these using our proposed annotation scheme.

Biomedical Event Extraction

Most existing

biomedical event extraction methods work as

“pipelines”, treating trigger extraction and argument

extraction as two stages (Björne and Salakoski,

2018;Li et al.,2018,2020a;Huang et al.,2020;

Zhu and Zheng,2020); this can lead to error propa-

gation. Trieu et al. (2020) propose an end-to-end

model that jointly extracts the trigger/entity and

assigns argument roles to mitigate the problem of

error propagation, but in contrast to our span-based

annotation, this requires full annotation of all en-

tities. Ramponi et al. (2020) consider biomedical

event extraction as a sequence labelling task, allow-

ing them jointly model event trigger and argument

extraction via multi-task learning.

In other domains, recent work has formulated

event extraction as a question answering task (Du

and Cardie,2020;Li et al.,2020b;Liu et al.,2020).

This new paradigm transforms the extraction of

event trigger and arguments into multiple rounds

of questioning, obtaining an answer about a trigger

or an argument in each round. Such methods can

reduce the reliance on the entity information for

argument extraction and have proved to be data

efﬁcient. The current QA-based event extraction

methods are mainly built on extractive QA which

obtains the answer to a question by predicting the

position of the target span in the original text. As

such, a separate question needs to be formulated

for different event and argument type. We also

experiment with a generative QA method, which

generates the answers directly, for comparison.

3 The PHEE Dataset

3.1 Task Deﬁnition and Schema

The PHEE corpus comprises sentences from

biomedical literature annotated with information

relevant to pharmacovigilance. Annotations are hi-

erarchically structured in terms of textual events.

Following prior work (Thompson et al.,2018),

we deﬁne two main clinical event types: Adverse

Drug Effect (ADE) and Potential Therapeutic Ef-

fect (PTE), denoting potentially harmful and ben-

eﬁcial effects of medical therapies, respectively.

Events consist of a trigger and several arguments,

as deﬁned by the ACE Semantic Structure (

LDC

2005). The trigger is a word or phrase that best

indicates the occurrence of an event (e.g., ‘in-

duced’, ‘developed’), while the arguments specify

the information characterizing an event, such as

patient’s demographic information, treatments, and

(side-)

effects (Figure 1(c)). We further organise ar-

guments into two hierarchical levels, namely main

and sub-arguments. Main arguments are longer

text spans that contain the full description of an

event aspect (e.g., treatment), while sub-arguments

are usually words or short phrases included in main

argument spans and highlighting speciﬁc details of

the argument (e.g., drug,dosage,duration, etc).

More speciﬁcally, in PHEE, event arguments are

deﬁned as:

Subject

highlights the patients involved in the

medical event, with sub-arguments including

age, gender, race, number of patients (labeled

as population) and preexisting conditions (la-

beled as subject.disorder) of the subject.

Treatment

describes the therapy administered to

the patients, with sub-arguments specifying

drug (and their combinations), dosage,fre-

quency, route, time elapsed,duration and the

target disorder (labeled as treatment.disorder)

of the treatment.

Effect indicates the outcome of the treatment.

We also collected annotations indicating three

types of attributes characterizing whether an event

is negated,speculated or its severity is indicated.

See more details about the schema in Appendix A.

3.2 Data Collection and Validation

Data Collection

To compose the PHEE corpus,

we collect existing medical case report abstracts

from the ADE (Gurulingappa et al.,2012b) and

PHAEDRA (Thompson et al.,2018) datasets. We

extract sentences from the abstracts and annotate

them containing at least one adverse or therapeutic

effect (ADE or PTE) event, for a total of over 4.8k

sentences after deduplication.

Annotation Process

We hired 15 annotators in

total to participate in our annotation, who are

PhD students in the computer science or medi-

cal domain. We consulted our annotation schema

with pharmacovigilance researchers and biomedi-

cal NLP researchers before starting the annotation.

We conducted the corpus annotation through two

stages to reduce the difﬁculty in dealing with medi-

cal text. In the ﬁrst stage, we provided the annota-

tors with sets of single sentences and asked them

to highlight the event triggers and the text spans

functioning as main arguments (i.e., subject,treat-

ment and effect). Each annotator annotates about

330 sentences during this stage. In the next stage,

we randomly assigned the annotated sentences to

different annotators who were required to verify

the correctness of the previous annotations. Once

conﬁrmed, the annotations were expanded specify-

ing the possible sub-arguments (e.g., for subjects:

age,gender,population,race,subject.disorder),

and attributes (e.g., negation). To ease the cog-

nitive demand required to highlight ﬁne-grained

sub-arguments during the second stage, the anno-

tators were split into three groups, each specialis-

ing in just one of the three main argument types.

Speciﬁcally, four annotators are allocated for sub-

ject sub-argument annotation and four for effect

and attribute annotation, while seven annotators

are allocated for treatment sub-argument annota-

tion due to the task complexity. Each annotator is

responsible for around 1.4k or 700 instances dur-

ing this stage. Additional notes on the annotation

process can be found in the Appendix B.

Data Validation

To ensure quality annotations,

each stage of annotation was proceeded by sev-

eral rounds of annotation trials, after which we dis-

cussed frequent inconsistencies. When questions

about speciﬁc instances surfaced during the anno-

tation process, annotators ﬂagged these sentences

for review. While the main annotations of stage

one were double-checked by the annotators in stage

two, we randomly duplicated 20% of the stage-two

samples and assigned them to different groups to

measure Inter-Annotator Agreement (IAA).

We compute F1-score

as a measure of agree-

ment between annotators. We calculate F1 scores

between the sets of duplicated cases by (arbitrar-

ily) selecting one annotation set as a “reference” to

the other. Speciﬁcally, we adopted the EM_F1 (at

span-level) and Token_F1 (at token-level) metric

which are explained in details in Section 4.2. We

report agreement scores in Table 1.

Consistency across trigger and argument types is

over 80%, indicating the effectiveness of two-stage

approaches. Agreement on sub-arguments is lower,

which is expected due to the higher complexity

of ﬁne-grained medical annotations. In particular,

we notice a difﬁculty in consistency over the an-

notation of duration and time_elapsed. One type

of common inconsistent cases is "generalized ex-

pressions" (e.g., "chronic", "long-term", "shortly

after"), which are annotated by some annotators

but ignored by others. In addition, it is easy for

annotators to confuse these two types of annotation.

For example, the phrase "48 months" in "48 months

postchemotherapy" is mistakenly annotated to be

duration, which, however, is generally believed

should be time_elapsed. Other less inconsistent

sub-argument types including frequency and sub-

ject.disorder. For frequency, inconsistent cases

including generalized expressions (e.g., "repeated",

"continuous") and certain speciﬁc expressions such

as "0.32mg/kg/day" that some annotators prefer to

annotate "0.32mg/kg" as dosage and "/day" as fre-

quency while others prefer to annotate the whole

span as dosage. For subject.disorder, conﬂicts exist

in "neutral" expressions that describe the subject’s

health condition but not necessarily to be a disorder,

such as "pregnant" and "nondiabetic". Apart from

the difﬁcult cases, inconsistency also occurs in the

Traditional Cohen’s Kappa as IAA evaluation is not appli-

cable for span-level computation due to an unknown number

of negative cases. We therefore follow previous work (Thomp-

son et al.,2018;Gurulingappa et al.,2012b) choosing the F1

score as the more relevant IAA measurement.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PHEE:ADatasetforPharmacovigilanceEventExtractionfromTextZhaoyueSun1,JiazhengLi1,GabrielePergola1,ByronC.Wallace2BinoJohn3,NigelGreene3,JosephKim3andYulanHe1,4,51DepartmentofComputerScience,UniversityofWarwick2KhouryCollegeofComputerSciences,NortheasternUniversity3AstraZeneca4DepartmentofInformatics,...

展开>> 收起<<

PHEE A Dataset for Pharmacovigilance Event Extraction from Text Zhaoyue Sun1 Jiazheng Li1 Gabriele Pergola1 Byron C. Wallace2 Bino John3 Nigel Greene3 Joseph Kim3and Yulan He145.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PHEE A Dataset for Pharmacovigilance Event Extraction from Text Zhaoyue Sun1 Jiazheng Li1 Gabriele Pergola1 Byron C. Wallace2 Bino John3 Nigel Greene3 Joseph Kim3and Yulan He145

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: