PHEE A Dataset for Pharmacovigilance Event Extraction from Text Zhaoyue Sun1 Jiazheng Li1 Gabriele Pergola1 Byron C. Wallace2 Bino John3 Nigel Greene3 Joseph Kim3and Yulan He145

2025-05-02 0 0 1.38MB 17 页 10玖币
侵权投诉
PHEE: A Dataset for Pharmacovigilance Event Extraction from Text
Zhaoyue Sun1, Jiazheng Li1, Gabriele Pergola1, Byron C. Wallace2
Bino John3, Nigel Greene3, Joseph Kim3and Yulan He1,4,5
1Department of Computer Science, University of Warwick
2Khoury College of Computer Sciences, Northeastern University
3AstraZeneca
4Department of Informatics, King’s College London
5The Alan Turing Institute
{Zhaoyue.Sun, Jiazheng.Li, Gabriele.Pergola.1}@warwick.ac.uk
{bino.john, nigel.greene, joseph.kim1}@astrazeneca.com
b.wallace@northeastern.edu, yulan.he@kcl.ac.uk
Abstract
The primary goal of drug safety researchers
and regulators is to promptly identify adverse
drug reactions. Doing so may in turn prevent
or reduce the harm to patients and ultimately
improve public health. Evaluating and moni-
toring drug safety (i.e., pharmacovigilance) in-
volves analyzing an ever growing collection of
spontaneous reports from health professionals,
physicians, and pharmacists, and information
voluntarily submitted by patients. In this sce-
nario, facilitating analysis of such reports via
automation has the potential to rapidly iden-
tify safety signals. Unfortunately, public re-
sources for developing natural language mod-
els for this task are scant. We present PHEE, a
novel dataset for pharmacovigilance compris-
ing over 5000 annotated events from medical
case reports and biomedical literature, mak-
ing it the largest such public dataset to date.
We describe the hierarchical event schema de-
signed to provide coarse and fine-grained in-
formation about patients’ demographics, treat-
ments and (side) effects. Along with the dis-
cussion of the dataset, we present a thorough
experimental evaluation of current state-of-the-
art approaches for biomedical event extrac-
tion, point out their limitations, and highlight
open challenges to foster future research in this
area1.
1 Introduction
Pharmacovigilance is the pharmaceutical science
that entails monitoring and evaluating the safety
and efficiency of medicine use, which is vital for
improving public health (World Health Organi-
zation,2004). Unexpected adverse drug effects
(ADEs) could lead to considerable morbidity and
mortality (Lazarou et al.,1998). It has been re-
ported that more than half of ADEs are preventable
1
Our data and code is available at
https://github.com/ZhaoyueSun/PHEE
(Gurwitz et al.,2000). Pharmacovigilance is there-
fore important for detecting and understanding
ADE-related events, as it may inform clinical prac-
tice and ultimately mitigate preventable hazards.
Collecting and maintaining the clinical evidence
for pharmacovigilance can be difficult because it
requires time-consuming manual curation to cap-
ture emerging data about drugs (Thompson et al.,
2018). Much of this information can be found
in unstructured textual data including medical lit-
erature, notes in electronic health records (EHR),
and social media posts. Using NLP methods to
discover and extract adverse drug events from un-
structured text may permit efficient monitoring of
such sources (Nikfarjam et al.,2015;Huynh et al.,
2016;Ju et al.,2020;Wei et al.,2020).
Past work has introduced pharmacovigilance cor-
pora to support training and evaluation of NLP ap-
proaches for ADE extraction. However, most of
these datasets (e.g., the ADE corpus; Gurulingappa
et al. 2012b) contain annotations only on entities
(such as drugs and side effects) and their binary
relations as shown in Figure 1(a). This ignores
contextual information relating to human subjects,
treatments administered, and more complex sit-
uations such as multi-drug concomitant use. To
address this problem, Thompson et al. (2018) de-
veloped the PHAEDRA corpus, which includes
annotations of not only drugs and side effects, but
also subjects (human, specific species, bacteria, and
so on) and events encoding descriptions of drug ef-
fects, which involve multiple arguments, and event
attributes — see Figure 1(b).
Despite these refinements, however, PHAEDRA
does not provide detailed, nested annotations such
as dosages, conditions, and patient demographic de-
tails. This granular information may provide criti-
cal context to clinical studies. Furthermore, PHAE-
DRA consists of only 600 annotated abstracts of
arXiv:2210.12560v1 [cs.CL] 22 Oct 2022
medical case reports, making it challenging to train
NLP models for pharmacovigilance events extrac-
tion since its annotations are in the document level
and the actual annotated events are sparse.
In this work we introduce a new annotated cor-
pus, PHEE, for adverse and potential therapeutic
effect event extraction for pharmacovigilance study.
The dataset consists of nearly 5,000 sentences ex-
tracted from MEDLINE case reports, and each sen-
tence features two levels of annotations. With re-
spect to coarse-grained annotations, each sentence
is annotated with the event trigger word/phrase,
event type and text spans indicating the event’s
associated subject, treatment, and effect. In a
fine-grained annotation pass, further details are
marked, such as patient demographic information,
the context information about the treatments in-
cluding drug dosage levels, administration routes,
frequency, and attributes relating to events. An
example annotation is shown in Figure 1(c).
Using PHEE as the benchmark, we conduct
thorough experiments to assess the state-of-the-
art NLP technologies for the pharmacovigilance-
related event extraction task. We use sequence
labelling and (both extractive and generative) QA-
based methods as baselines and evaluate event trig-
ger extraction and argument extraction. The ex-
tractive QA method performs best for trigger ex-
traction with the exact match F1 score of 70.09%,
while the generative QA method achieves the best
exact match F1 score of 68.60% and 76.16% for
the main argument and sub-argument extraction,
respectively. Further analysis shows that current
models perform well on average cases but often
fail on more complex examples.
Our contributions can be summarised as fol-
lows:
1) We introduce PHEE, a new pharma-
covigilance dataset containing over 5,000 finely
annotated events from public medical case reports.
To the best of our knowledge, this is the largest
and most comprehensively annotated dataset of
this type to date. 2) We collect hierarchical anno-
tations to provide granular information about pa-
tients and conditions in addition to coarse-grained
event information. 3) We conduct thorough ex-
periments to compare current state-of-the-art ap-
proaches for biomedical event extraction, demon-
strating the strength and weaknesses of current tech-
nologies and use this to highlight challenges for
future research in this area.
2 Related Work
Pharmacovigilance Related Corpora
Prior
pharmacovigilance-related corpora mainly has
focused on annotation of entities (e.g., drugs,
diseases, medications) and binary relations
between them, namely, drug-ADE relations
(Gurulingappa et al.,2012a;Patki et al.,2014;
Ginn et al.,2014), disorder-treatment relations
(Rosario and Hearst,2004;Roberts et al.,2009;
Uzuner et al.,2011;Van Mulligen et al.,2012),
and drug-drug interactions (Segura-Bedmar et al.,
2011;Boyce et al.,2012;Rubrichi and Quaglini,
2012;Herrero-Zazo et al.,2013). More recent
open challenges, including the 2018 n2c2 shared
task (Henry et al.,2020) and MADE1.0 challenge
(Jagannatha et al.,2019), have considered annotat-
ing additional relation types, such as drug-attribute
and drug-reason relations, but they are still binary
relationships.
Thompson et al. (2018) introduced the PHAE-
DRA corpus, extending the drug-ADE annotations
to pharmacovigilance events. Compared to corpora
that only annotate simple drug-ADE relations—
referred to as AE events in PHAEDRA—they fur-
ther annotate three additional relations, namely the
Potential Therapeutic Effect (PTE) event which
refers to the potential beneficial effects of drugs,
the Combination and the Drug-Drug Interaction
event which indicates multiple drug use and inter-
actions between administered drugs, respectively.
In addition, PHAEDRA includes the subject as a
type of named entities (NEs) and annotates three
types event attributes, i.e., negated, speculated and
manner. However, some key informative details are
still missing in PHAEDRA. As the NE annotation
of PHAEDRA is usually a single noun or a short
noun phrase, detailed information about the subject
(such as age and gender), and of the medication
(e.g., dosage and frequency) is not captured.
We set out to annotate a larger corpus with more
detailed information to facilitate training of phar-
macovigilance event extraction models. We build
on existing corpora (PHAEDRA and ADE). The
ADE corpus comprises
3,000 MEDLINE case
reports and annotations on
4,000 sentences indi-
cating adverse effects, but their annotations only
involve drugs, dosages and adverse effects, and
lack sufficient event details of interest. The PHAE-
DRA corpus reuses 227 abstracts from ADE and
integrates an additional 370 abstracts (from other
corpora and some novel entries). However, the
6414095
Transient hemiparesis caused by phenytoin toxicity.
A case report.
A 52-year-old Black woman on phenytoin therapy for post-traumatic epilepsy developed transient hemiparesis contralateral to the injury.
The episode appeared to have been precipitated by toxicity due to ingestion of a large amount of phenytoin.
A possible mechanism for focal neurological deficit in brain-damaged patients on phenytoin therapy is discussed.
Disorder Adverse_effect Pharmacological_substance Disorder
has_agent
affects
Subject Pharmacological_substance Potential_therapeutic_effect Disorder Adverse_effect Disorder
has_agent affectsaffects
has_subject
has_agent
Subject_Disorder
has_subject
Disorder Adverse_effect Pharmacological_substance
affects has_agent
Disorder Disorder Subject Pharmacological_substance Potential_therapeutic_effect
Adverse_effect
has_agent
has_agent
Subject_Disorder
has_subject
has_subject
affects
1
4
5
7
8
9
bratbrat
/PHAEDRA/train/6414095
666
A 52-year-old Black woman on phenytoin therapy for post-traumatic epilepsy developed transient hemiparesis contralateral to the injury.
666
Drug Adverse_effect
has
1
3
5
bratbrat
/ade_example/6414095
A 52-year-old Black woman on phenytoin therapy for post-traumatic epilepsy developed transient hemiparesis contralateral to the injury.
GenderRaceAge
Subject
Drug
Treatment
Treat-Disorder Adverse_event [Low] Severity_cue Effect
1
bratbrat
/clean/single/6414095_3
(a) An example of the ADE dataset.
(b) An example from the PHAEDRA dataset.
(c) An example from our PHEE dataset.
Figure 1: Comparison of annotations from (a) the ADE corpus, (b) the PHEADRA corpus and (c) our developed
PHEE corpus.
PHAEDRA corpus is annotated at the document
level, the actual annotated events are very sparse.
We collected sentences in ADE and those in PHAE-
DRA with AE or PTE event annotations and en-
riched these using our proposed annotation scheme.
Biomedical Event Extraction
Most existing
biomedical event extraction methods work as
“pipelines”, treating trigger extraction and argument
extraction as two stages (Björne and Salakoski,
2018;Li et al.,2018,2020a;Huang et al.,2020;
Zhu and Zheng,2020); this can lead to error propa-
gation. Trieu et al. (2020) propose an end-to-end
model that jointly extracts the trigger/entity and
assigns argument roles to mitigate the problem of
error propagation, but in contrast to our span-based
annotation, this requires full annotation of all en-
tities. Ramponi et al. (2020) consider biomedical
event extraction as a sequence labelling task, allow-
ing them jointly model event trigger and argument
extraction via multi-task learning.
In other domains, recent work has formulated
event extraction as a question answering task (Du
and Cardie,2020;Li et al.,2020b;Liu et al.,2020).
This new paradigm transforms the extraction of
event trigger and arguments into multiple rounds
of questioning, obtaining an answer about a trigger
or an argument in each round. Such methods can
reduce the reliance on the entity information for
argument extraction and have proved to be data
efficient. The current QA-based event extraction
methods are mainly built on extractive QA which
obtains the answer to a question by predicting the
position of the target span in the original text. As
such, a separate question needs to be formulated
for different event and argument type. We also
experiment with a generative QA method, which
generates the answers directly, for comparison.
3 The PHEE Dataset
3.1 Task Definition and Schema
The PHEE corpus comprises sentences from
biomedical literature annotated with information
relevant to pharmacovigilance. Annotations are hi-
erarchically structured in terms of textual events.
Following prior work (Thompson et al.,2018),
we define two main clinical event types: Adverse
Drug Effect (ADE) and Potential Therapeutic Ef-
fect (PTE), denoting potentially harmful and ben-
eficial effects of medical therapies, respectively.
Events consist of a trigger and several arguments,
as defined by the ACE Semantic Structure (
LDC
,
2005). The trigger is a word or phrase that best
indicates the occurrence of an event (e.g., ‘in-
duced’, ‘developed’), while the arguments specify
the information characterizing an event, such as
patient’s demographic information, treatments, and
(side-)
effects (Figure 1(c)). We further organise ar-
guments into two hierarchical levels, namely main
and sub-arguments. Main arguments are longer
text spans that contain the full description of an
event aspect (e.g., treatment), while sub-arguments
are usually words or short phrases included in main
argument spans and highlighting specific details of
the argument (e.g., drug,dosage,duration, etc).
More specifically, in PHEE, event arguments are
defined as:
Subject
highlights the patients involved in the
medical event, with sub-arguments including
age, gender, race, number of patients (labeled
as population) and preexisting conditions (la-
beled as subject.disorder) of the subject.
Treatment
describes the therapy administered to
the patients, with sub-arguments specifying
drug (and their combinations), dosage,fre-
quency, route, time elapsed,duration and the
target disorder (labeled as treatment.disorder)
of the treatment.
Effect indicates the outcome of the treatment.
We also collected annotations indicating three
types of attributes characterizing whether an event
is negated,speculated or its severity is indicated.
See more details about the schema in Appendix A.
3.2 Data Collection and Validation
Data Collection
To compose the PHEE corpus,
we collect existing medical case report abstracts
from the ADE (Gurulingappa et al.,2012b) and
PHAEDRA (Thompson et al.,2018) datasets. We
extract sentences from the abstracts and annotate
them containing at least one adverse or therapeutic
effect (ADE or PTE) event, for a total of over 4.8k
sentences after deduplication.
Annotation Process
We hired 15 annotators in
total to participate in our annotation, who are
PhD students in the computer science or medi-
cal domain. We consulted our annotation schema
with pharmacovigilance researchers and biomedi-
cal NLP researchers before starting the annotation.
We conducted the corpus annotation through two
stages to reduce the difficulty in dealing with medi-
cal text. In the first stage, we provided the annota-
tors with sets of single sentences and asked them
to highlight the event triggers and the text spans
functioning as main arguments (i.e., subject,treat-
ment and effect). Each annotator annotates about
330 sentences during this stage. In the next stage,
we randomly assigned the annotated sentences to
different annotators who were required to verify
the correctness of the previous annotations. Once
confirmed, the annotations were expanded specify-
ing the possible sub-arguments (e.g., for subjects:
age,gender,population,race,subject.disorder),
and attributes (e.g., negation). To ease the cog-
nitive demand required to highlight fine-grained
sub-arguments during the second stage, the anno-
tators were split into three groups, each specialis-
ing in just one of the three main argument types.
Specifically, four annotators are allocated for sub-
ject sub-argument annotation and four for effect
and attribute annotation, while seven annotators
are allocated for treatment sub-argument annota-
tion due to the task complexity. Each annotator is
responsible for around 1.4k or 700 instances dur-
ing this stage. Additional notes on the annotation
process can be found in the Appendix B.
Data Validation
To ensure quality annotations,
each stage of annotation was proceeded by sev-
eral rounds of annotation trials, after which we dis-
cussed frequent inconsistencies. When questions
about specific instances surfaced during the anno-
tation process, annotators flagged these sentences
for review. While the main annotations of stage
one were double-checked by the annotators in stage
two, we randomly duplicated 20% of the stage-two
samples and assigned them to different groups to
measure Inter-Annotator Agreement (IAA).
We compute F1-score
2
as a measure of agree-
ment between annotators. We calculate F1 scores
between the sets of duplicated cases by (arbitrar-
ily) selecting one annotation set as a “reference” to
the other. Specifically, we adopted the EM_F1 (at
span-level) and Token_F1 (at token-level) metric
which are explained in details in Section 4.2. We
report agreement scores in Table 1.
Consistency across trigger and argument types is
over 80%, indicating the effectiveness of two-stage
approaches. Agreement on sub-arguments is lower,
which is expected due to the higher complexity
of fine-grained medical annotations. In particular,
we notice a difficulty in consistency over the an-
notation of duration and time_elapsed. One type
of common inconsistent cases is "generalized ex-
pressions" (e.g., "chronic", "long-term", "shortly
after"), which are annotated by some annotators
but ignored by others. In addition, it is easy for
annotators to confuse these two types of annotation.
For example, the phrase "48 months" in "48 months
postchemotherapy" is mistakenly annotated to be
duration, which, however, is generally believed
should be time_elapsed. Other less inconsistent
sub-argument types including frequency and sub-
ject.disorder. For frequency, inconsistent cases
including generalized expressions (e.g., "repeated",
"continuous") and certain specific expressions such
as "0.32mg/kg/day" that some annotators prefer to
annotate "0.32mg/kg" as dosage and "/day" as fre-
quency while others prefer to annotate the whole
span as dosage. For subject.disorder, conflicts exist
in "neutral" expressions that describe the subject’s
health condition but not necessarily to be a disorder,
such as "pregnant" and "nondiabetic". Apart from
the difficult cases, inconsistency also occurs in the
2
Traditional Cohen’s Kappa as IAA evaluation is not appli-
cable for span-level computation due to an unknown number
of negative cases. We therefore follow previous work (Thomp-
son et al.,2018;Gurulingappa et al.,2012b) choosing the F1
score as the more relevant IAA measurement.
摘要:

PHEE:ADatasetforPharmacovigilanceEventExtractionfromTextZhaoyueSun1,JiazhengLi1,GabrielePergola1,ByronC.Wallace2BinoJohn3,NigelGreene3,JosephKim3andYulanHe1,4,51DepartmentofComputerScience,UniversityofWarwick2KhouryCollegeofComputerSciences,NortheasternUniversity3AstraZeneca4DepartmentofInformatics,...

展开>> 收起<<
PHEE A Dataset for Pharmacovigilance Event Extraction from Text Zhaoyue Sun1 Jiazheng Li1 Gabriele Pergola1 Byron C. Wallace2 Bino John3 Nigel Greene3 Joseph Kim3and Yulan He145.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.38MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注