RedHOT A Corpus of Annotated Medical Questions Experiences and Claims on Social Media Somin WadhwayVivek Khetan3Silvio AmiryByron C. Wallacey

2025-04-26 0 0 948.3KB 19 页 10玖币
侵权投诉
RedHOT: A Corpus of Annotated Medical Questions, Experiences, and
Claims on Social Media
Somin WadhwaVivek Khetan3Silvio AmirByron C. Wallace
Northeastern UniversityAccenture AI Labs3
{wadhwa.s,s.amir,b.wallace}@northeastern.edu
vivek.a.khetan@accenture.com
Abstract
We present Reddit Health Online Talk
(RedHOT), a corpus of 22,000 richly anno-
tated social media posts from Reddit spanning
24 health conditions. Annotations include de-
marcations of spans corresponding to medical
claims, personal experiences, and questions.
We collect additional granular annotations
on identified claims. Specifically, we mark
snippets that describe patient Populations,
Interventions, and Outcomes (PIO elements)
within these. Using this corpus, we introduce
the task of retrieving trustworthy evidence rel-
evant to a given claim made on social media.
We propose a new method to automatically
derive (noisy) supervision for this task which
we use to train a dense retrieval model; this
outperforms baseline models. Manual evalu-
ation of retrieval results performed by medi-
cal doctors indicate that while our system per-
formance is promising, there is considerable
room for improvement. We release all anno-
tations collected (and scripts to assemble the
dataset), and all code necessary to reproduce
the results in this paper at: https://sominw.
com/redhot.
1 Introduction
Social media platforms such as Reddit provide in-
dividuals places to discuss (potentially rare) med-
ical conditions that affect them. This allows peo-
ple to communicate with others who share in their
condition, exchanging information about symptom
trajectories, personal experiences, and treatment
options. Such communities can provide support
(Biyani et al.,2014) and access to information
about rare conditions which may otherwise be dif-
ficult to find (Glenn,2015).
However, the largely unvetted nature of social
media platforms make them vulnerable to mis and
disinformation (Swire-Thompson and Lazer,2019).
An illustrative and timely example is the idea that
consuming bleach might be a viable treatment for
r/ibs
r/Psychosis
r/Costochondritis
I just ordered Metamucil bc I read
psyllium may be better for IBS-D.
Or maybe the fiber is what is making
me go more? Definitely produces
more gas.
Surprising I'm seeing research articles that
ketamine doesn't increase psychosis risk or
induce psychosis past the duration of the drug. I
only took a brief look into it. Has anyone here had
ketamine induced psychosis? What is r/psychosis
experience with ketamine?
Ive had costo for a while, usually comes and
goes. Done all the heart / lung checks all clear.
Ive just recovered covid and what I'm left with is
chest pain / pressure. I mean it could be a costo
flare up which makes sense, but also been
reading about myocarditis after covid and I’m
worried.
Figure 1: Examples of health-related Reddit posts an-
notated for populations, interventions, and outcomes.
COVID-19,
1
which quickly gained traction on so-
cial media. All misinformation can be dangerous,
but medical misinformation poses unique risks to
public health, especially as individuals increasingly
turn to social media to inform personal health deci-
sions (Nobles et al.,2018;Barua et al.,2020).
In this paper, we introduce
RedHOT
: an anno-
tated dataset of health-related claims, questions,
and personal experiences posted to Reddit. This
dataset can support development of a wide range
of models for processing health-related posts from
social media. Unlike existing health-related social
media corpora,
RedHOT
: (a) Covers a broad range
of health topics (e.g., not just COVID-19), and,
(b) Comprises “natural” claims collected from real
health-related fora (along with annotated questions
and personal experiences). Furthermore, we have
collected granular annotations on claims, demarcat-
ing descriptions of the Population (e.g., diabetics),
Interventions, and Outcomes, i.e., the PIO elements
1https://www.theguardian.com/world/2020/sep/
19/bleach-miracle-cure-amazon-covid
arXiv:2210.06331v3 [cs.CL] 7 Feb 2023
(Richardson et al.,1995). Such annotations may
permit useful downstream processing: For exam-
ple, in this work we use them to facilitate retrieval
of evidence relevant to a claim.
Specifically, we develop and evaluate a pipeline
to automatically identify and contextualize health-
related claims on social media, as we anticipate that
such a tool might be useful for moderators keen to
keep their communities free of potentially harmful
misinformation. With this use-case in mind, we
propose methods for automatically retrieving trust-
worthy published scientific evidence relevant to a
given claim made on social media, which may in
aggregate support or debunk a particular claim.
The contributions of this work are summarized
as follows. First, we introduce
RedHOT
: A
new dataset comprising
22,000
health-related Red-
dit posts across 24 medical conditions annotated
for claims, questions, and personal experiences.
Claims are additionally annotated with PIO ele-
ments. Second, we introduce the task of identifying
health-related claims on social media, extracting
the associated PIO elements, and then retrieving rel-
evant and trustworthy evidence to support or refute
such claims. Third, we propose
RedHOT
-DER, a
Dense Evidence Retriever trained with heuristically
derived supervision to retrieve medical literature
relevant to health-related claims made on social
media. We evaluate baseline models for the first
two steps on the
RedHOT
dataset and assess the
retrieval step with relevance judgments collected
from domain experts (medical doctors).
The Reddit posts we have collected are public
and typically made under anonymous pseudonyms,
but nonetheless these are health-related comments
and so inherently sensitive. To respect this, we
(a) notified all users in the dataset of their (poten-
tial) inclusion in this corpus, and provided oppor-
tunity to opt-out, and, (b) we do not release the
data directly, but rather a script to download an-
notated comments, so that individuals may choose
to remove their comments in the future. Further-
more, we consulted with our Institutional Review
Board (IRB) and confirmed that the initial collec-
tion and annotation of such data does not constitute
human subjects research. However, EACL review-
ers rightly pointed out that certain uses of this data
may be sensitive. Therefore, to access the collected
dataset we require researchers to self-attest that
they have obtained prior approval from their own
IRB regarding their intended use of the corpus.
2 The RedHOT Dataset
We have collected and manually annotated health
related posts from Reddit to support development
of language technologies which might, e.g., flag po-
tentially problematic claims for moderation. Reddit
is a social media platform that allows users to cre-
ate their own communities (subreddits) focused on
specific topics. Subreddits are often about niche
topics, and this permits in-depth discussion cater-
ing to a long tail of interests and experiences. No-
tably, subreddits exist for most common (and many
rare) medical conditions; we can therefore sample
posts from such communities for annotation.
2.1 Data Annotation
We decomposed data annotation into two stages,
performed in sequence. In the first, workers are
asked to demarcate spans of text corresponding to a
Claim
,
Personal Experience
, or
Question
. We
characterize these classes as follows (we provide
detailed annotation instructions in Appendix A):
Claim
suggests (explicitly or implicitly) a causal
relationship between an Intervention and an Out-
come (e.g., “ Icompletely cured my O”). Opera-
tionally, we are interested in identifying statements
that might reasonably be interpreted by the reader
as implying a causal link between an intervention
and outcome, as this may in turn influence their
perception regarding the efficacy of an interven-
tion for a particular condition and/or outcome (i.e.,
relationship between an Iand O).
Question
poses a direct question, e.g., “Is this
normal?”; “Should I increase my dosage?”.
Personal Experience
describes an individual’s
experience, for instance the trajectory of their con-
dition, or experiences with specific interventions.
This is a multi-label scheme: Spans can (and
often do) belong to more than one of the above
categories. For example, personal experiences can
often be read as implying a causal relationship.
Consider this example: “My doctor put me on Ifor
my P, and I am no longer experiencing O”. This
describes an individual treatment history, but could
also be read as implying that Iis a viable treat-
ment for P(and specifically for the outcome O).
Therefore, we would mark this as both a
Claim
and
a
Personal Experience
. By contrast, a general
statement asserting a causal relationship outside of
any personal context like “Ican cure O” is what
Reddit post Span labels PIO elements from claims
I’ve seen a bunch of posts on here from people
who say that glycopyrrolate suddenly isn’t work-
ing anymore for hyperhidrosis. I’m one of those
person who has been facing this for a while now.
Just wondering if anyone fixed it? Can’t really
ask my GP about it since he didn’t even know
the meds existed. He just prescribed them for
me when I asked for it
Claim:
I’ve seen a bunch of posts on
here from people who say that gly-
copyrrolate suddenly isn’t working
anymore for Hyperhidrosis
Question:
Just wondering if anyone
fixed it?
Phyperhidrosis
Iglycopyrrolate
so i recently read that adderall can trigger a psy-
chotic break
&
i was prescribed adderall years
ago for my adhd but now i just have constant
hallucination episodes. anyone else experience
adderall induced psychosis?
Claim:
so i recently read that adder-
all can trigger a psychotic break
Personal Experience:
i was pre-
scribed adderall years ago for my
adhd but now i just have constant hal-
lucination episodes
Question:
anyone else experience
adderall induced psychosis?
Padhd
Iadderall
Ohallucinations
I’ve had costochondritis for a while, usually
comes and goes. Done all the heart/lung checks
all clear. I’ve just recovered covid and what I’m
left with is chest pain/pressure. I mean it could
be a costo flare up which makes sense, but also
been reading about myocarditis after covid and
I’m worried, how can I tell which is which?
Claim:
been reading about my-
ocarditis after covid
Personal Experience:
I’m left
with is chest pain/pressure
Question:
how can I tell which is
which?
Pcostochondritis
Icovid
O
myocarditis, chest-
pain
Table 1: Example annotations, which include: extracted spans (phase 1), and spans describing Populations,
Interventions, and Outcomes — PIO elements — within them (phase 2). We collect the latter only for claims.
we will refer to as a “pure claim”, meaning it ex-
clusively belongs to the Claim category.
In the second stage, workers are asked to further
annotate “pure claim” instances by marking spans
within them that correspond to the Populations,
Interventions/Comparators,
2
Outcomes (the PIO
elements) associated with the claim.
2.2 Crowdsourcing Annotations
We hired crowdworkers to perform the above anno-
tation tasks on Amazon Mechanical Turk (AMT).
3
To estimate required annotation time and determine
fair pay rates, we ran an internal pilot with two PhD
students (both broadly familiar with this research
area) on 100 samples.
4
To gauge quality and recruit
workers from AMT, we ran two pilot experiments
in which we collected sentence-level annotations
on posts sampled from three medical populations
(i.e., subreddits), comprising 6,000 posts in all.
We required all workers have an overall job ap-
proval rate of
90
%
. Based on an initial set of
AMT annotations we re-hired only workers who
2
This is the standard PICO framework, but we collapse
Interventions and Comparators into the Intervention category,
as the distinction is arbitrary.
3
We consulted with an Institutional Review Board (IRB)
to confirm that this annotation work did not constitute human
subjects research.
4
Based on the estimate from our pilot experiments, payrate
for AMT workers was fixed to US $9 per hour for stage-
1 annotations and US $11 per hour for stage-2 annotations,
irrespective of geographic location.
Fliess κP R F1
Questions 0.86 0.85 0.82 0.84
Claims 0.69 0.63 0.53 0.58
Experiences 0.71 0.78 0.69 0.73
POP 0.92 0.94 0.91 0.92
INT 0.74 0.76 0.70 0.73
OUT 0.78 0.73 0.68 0.70
Table 2: Token-wise label agreement among experts
measured by Fleiss κon a subset of data. We further
compute precision, recall, and F1 scores for “aggre-
gated” labels by evaluating them against unioned “in-
house” expert labels.
reliably followed annotation instructions (details
in Appendix A), and we actively recruited the top
workers to continue on with increased pay. We
obtained annotations from at least three workers
for each post, allowing for robust inference of ref-
erence labels. Recruited workers were also paid
periodic bonuses (equivalent to two hours of pay)
based on the quality of their annotated samples.
2.3 Quality Validation
To evaluate annotation quality we calculate token-
wise label agreement between annotators, and
amongst ourselves. We emphasize here that token-
level
κ
for sequences is quite strict and disagree-
ments often reflect where annotators decide to mark
Ketamine and Psychosis History:
Antidepressant Efficacy and
Psychotomimetic Effects Postinfusion
Abstract: Because of a theoretical risk of
exacerbating psychosis in predisposed patients,
subjects with current psychotic symptoms or a
past history of psychosis are typically excluded
from ketamine trials.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore
magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est
laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore
magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est
laborum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore
magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure
+dj
dj,l
dj,l
xj
r/Psychosis
Surprising I'm seeing research articles that
ketamine doesn't increase psychosis risk or
induce psychosis past the duration of the drug. I
only took a brief look into it. Has anyone here had
ketamine induced psychosis? What is r/psychosis
experience with ketamine?
Has anyone here had ketamine induced psychosis?
What is r/psychosis experience with ketamine?
Questions
Personal experiences
I’m seeing research articles that ketamine doesn’t
increase psychosis risk or induce psychosis.
Claims
None
(A) Extract questions,
experiences, and claims
(B) Extract PICO elements
psychosis
Population
Interventions
ketamine
Outcomes
psychosis
(C) Retrieve relevant trustworthy evidence
Figure 2: Examples portraying potential use cases of our corpus. We showcase three distinct tasks, to be performed
in sequence. The first (A) entails extracting spans corresponding to claims (highlighted in bold) from a given
Reddit post. The second step (B) is to identify the PICO elements associated with each claim. In the final step (C),
we use the outputs of the first two models with the original post to obtain a dense representation, enabling us to
retrieve relevant evidence from a large dataset of trusted medical evidence (e.g., PubMed).
span boundaries. Despite this, for the first stage
agreement (Fleiss
κ
) on labeled questions, expe-
riences, and claims was
0.62
, and for the second
stage
0.55
. We consider this moderately strong
agreement, in line with agreement reported for re-
lated annotation tasks in the literature (Nye et al.,
2018;Deléger et al.,2012). To quantify this and
further gauge the quality of collected annotations,
we run a few additional analyses.
As previously stated, prior to collecting annota-
tions on Amazon MTurk, we (the authors) anno-
tated a subset of data (100 samples/stage) internally
to assess task difficulty and to estimate the time re-
quired for annotation. As an additional quality
check, we use these annotations to calculate token-
wise label agreement. Table 2reports the results;
while there remains some discrepancy owing to
the inherent complexity of the task, there is higher
agreement between the us than between workers.
Each of these samples was also annotated by
three workers. We aggregate these labels using
majority-vote and compute token-wise precision-
recall of these aggregated labels against the refer-
ence “in-house” labels (Table 2). We report the
same metrics per annotator evaluated against ag-
gregated MTurk labels in Table 9(Appendix B).
Despite moderate agreement between annotators,
aggregated labels agree comparatively well with
the “expert” consensus, indicating that while in-
dividual worker annotations are somewhat noisy,
aggregated annotations are reasonably robust.
2.4 Dataset Details
Table 1provides illustrative samples from
RedHOT
and Table 8provides some descriptive
statistics along with examples of included health
populations. We broadly characterize populations
(conditions) as Very Common,Common or Rare,
and sought a mix of these. This was not the only at-
tribute that informed which conditions we selected
for inclusion in our dataset, however. For example,
we wanted a mix of populations with respect to vol-
ume of online activity (e.g., the Diabetes subreddit
has over
60k
active visitors; Lupus has
8k
). We
also wanted to include both chronic and treatable
conditions (e.g., Narcolepsy is a rare and chronic
condition, while Gout is common and treatable),
and mental and physical disorders (e.g., ADHD,
Rheumatoid Arthritis). Another consideration was
whether a condition can be self-diagnosed or re-
quires professional assessment (e.g., Bulimia is
usually self-diagnosable but can potentially be life-
threatening; Gastroparesis is chronic but requires a
professional medical diagnosis).
The number of claims across different categories
of health populations are far outnumbered by ques-
摘要:

RedHOT:ACorpusofAnnotatedMedicalQuestions,Experiences,andClaimsonSocialMediaSominWadhwayVivekKhetan3SilvioAmiryByronC.WallaceyNortheasternUniversityyAccentureAILabs3{wadhwa.s,s.amir,b.wallace}@northeastern.eduvivek.a.khetan@accenture.comAbstractWepresentRedditHealthOnlineTalk(RedHOT),acorpusof22,000...

展开>> 收起<<
RedHOT A Corpus of Annotated Medical Questions Experiences and Claims on Social Media Somin WadhwayVivek Khetan3Silvio AmiryByron C. Wallacey.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:948.3KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注