Multimodal Model with Text and Drug Embeddings for Adverse Drug Reaction Classification

2025-05-02 0 0 546.97KB 30 页 10玖币
侵权投诉
Multimodal Model with Text and Drug Embeddings for
Adverse Drug Reaction Classification
Andrey Sakhovskiya,b, Elena Tutubalinaa,c,d
aKazan Federal University, 18 Kremlyovskaya street, Kazan, Russian Federation, 420008
bLomonosov Moscow State University, 1 Leninskie gory, Moscow, Russian Federation,
119991
cSber AI, 19 Vavilova St., Moscow, Russian Federation, 117997
dNational Research University Higher School of Economics, 11 Pokrovsky Bulvar, Moscow,
Russian Federation, 109028
Abstract
In this paper, we focus on the classification of tweets as sources of potential
signals for adverse drug effects (ADEs) or drug reactions (ADRs). Follow-
ing the intuition that text and drug structure representations are complemen-
tary, we introduce a multimodal model with two components. These compo-
nents are state-of-the-art BERT-based models for language understanding and
molecular property prediction. Experiments were carried out on multilingual
benchmarks of the Social Media Mining for Health Research and Applications
(#SMM4H) initiative. Our models obtained state-of-the-art results of 0.61 F1-
measure and 0.57 F1-measure on #SMM4H 2021 Shared Tasks 1a and 2 in
English and Russian, respectively. On the classification of French tweets from
SMM4H 2020 Task 1, our approach pushes the state of the art by an absolute
gain of 8% F1. Our experiments show that the molecular information obtained
from neural networks is more beneficial for ADE classification than traditional
molecular descriptors. The source code for our models is freely available at
https://github.com/Andoree/smm4h_2021_classification.
Keywords: natural language processing, social media, adverse drug reactions,
Email addresses: andrey.sakhovskiy@gmail.com (Andrey Sakhovskiy),
ElVTutubalina@kpfu.ru (Elena Tutubalina)
Preprint submitted to Journal of Biomedical Informatics October 25, 2022
arXiv:2210.13238v1 [q-bio.QM] 21 Oct 2022
text representations, drug representations
1. Introduction
The popularity of social media as a source for health-related information has
increased tremendously in the past decade. One of the well-studied research
areas is pharmacovigilance from social media data that focuses on discovering
adverse drug effects (ADEs) from user-generated texts. ADEs1are unwanted
negative effects of a drug, in other words, harmful and undesired reactions due
to its intake.
In recent years, researchers have increasingly applied neural networks, in-
cluding Bidirectional Encoder Representations from Transformers (BERT) [1],
to ADE detection from texts [2, 3, 4, 5]. This is directly related to the creation of
annotated multilingual corpora and Social Media Mining for Health (#SMM4H)
shared tasks [2, 3]. Given a tweet, participants of this shared task are required
to detect whether the tweet contains a mention of ADE using natural language
processing (NLP) techniques. However, these studies mostly share the same
limitations: models consider textual information only without leveraging drug
structure. In particular, [6, 7] utilized transformer-based classifiers with ensem-
ble modeling and undersampling ranking first on SMM4H 2020 task 1a and 1b,
respectively. On the other hand, studies from cheminformatics and drug dis-
covery areas have focused on the prediction of the side effects of a given drug
[8, 9, 10, 11]. These studies utilized supervised models trained and evaluated
on a database of marketed drugs and ADRs the Side Effect Resource (SIDER)
[12] from the MoleculeNet benchmark [13]. Each chemical structure is encoded
with molecular descriptors or neural representations that are usually fed to a
multi-label classification model. In addition, a number of BERT architectures
has been proposed for molecular property prediction such as SMILES-BERT
[11], MolBERT [14], ChemBERTa [15] with fine-tuning on the SIDER dataset.
1The terms adverse drug effects (ADEs) and adverse drug reactions (ADRs) are often used
interchangeably.
2
Inspired by multimodal studies, we propose a novel method to utilize both
textual and molecular information for ADE classification. We study the impact
of using different molecular representation approaches, including traditional
molecular descriptors calculated with Mordred [16] and BERT-based encoders
ChemBERTa and MolBERT. We explore two strategies to fuse drug represen-
tations and tweet representations: straightforward concatenation of represen-
tations and use of a co-attention mechanism to integrate features of different
modalities. Along with the textual information, the incorporation of molecular
structure can aid in understanding the relationship between different pharma-
cological and chemical properties and the occurrence of ADEs.
A preliminary version of this work has appeared in [17]. Compared to the
conference version, we have: (1) significantly extended the experimental part
of this work to assess the performance of the proposed multimodal model; in
particular, we extended experiments to three datasets in English, French, and
Russian. Our model achieved state-of-the-art results on SMM4H 2021 Tasks 1a
& 2 and SMM4H 2020 Task 1b; (2) extended the description of the experimental
datasets for ADE classification; (3) investigated model performance on different
drug groups, adding new experimental results and conclusions; (4) performed
error analysis and discussed the limitations of our model.
In Section 2 we present related work, Section 3 describes the datasets, Sec-
tion 4 introduces our approach, Section 5 describes and discusses experimental
results, and Section 7 concludes the paper.
2. Related Work
2.1. ADE classification
Although there is a wide range of supervised machine learning methods that
have been applied to classify user-generated posts in English, a relatively small
number of recent studies have been focused on texts in other languages, i.e.,
Russian and French. In the SMM4H 2020 shared task [3], the precision of op-
timal ADE classification systems varies across different languages. For English,
3
precision previously has stayed in the range of 0.45-0.65 reaching a score of 0.64
with the winning system, while precision for Russian and French have stayed
in the range of 0.34-0.54 and 0.15-0.33, respectively [2]. Most of these models
leverage raw text to classify each text or a token in a text to ADE class. [6]
pre-trained a large version of RoBERTa [18] on an unlabeled corpus of 6 mil-
lion tweets in English where each tweet includes drug mention; they fine-tuned
this model without any imbalance techniques on SMM4H data. [19] trained a
fully-connected network on pre-computed embeddings obtained using a distilled
version of multilingual sentence BERT to classify French tweets. [7] used an
ensemble of BERT-based models and logistic regression with undersampling to
classify Russian tweets. [4] used an ensemble of ten EnRuDR-BERT models [20]
pretrained on 5M health-related user posts. Overall, the percentage of teams
using Transformer [21] architectures for ADE classification rose from 80% in
SMM4H 2020 to 100% in SMM4H 2021 [3].
Our model improves upon the state-of-the-art models in three very criti-
cal ways: (1) incorporation of drug representations, (2) extensive experiments
on three languages, (iii) analysis of model performance for various Anatomical
Therapeutic Chemical (ATC) main groups.
2.2. Multimodal learning on biomedical tasks
A number of studies proposed multimodal deep learning models to combine
information from multiple modalities. These models show promising results
compared to uni-modal models on protein-protein and drug-drug interaction
identification from scientific texts in English [22, 23], image captioning [24],
image ads understanding [25]. [22] utilize textual sentence representations and
representation of the molecular structures of active substances for classification
of drug-drug interactions (DDI). The authors proposed a network with two com-
ponents. The first component of the architecture is a convolutional neural net-
work, where the original text is converted into embedding representations using
the word2vec model, which are then combined with the positional embeddings
of the two drugs in the text. The second component is a graph convolutional
4
network (GCN). Experiments show a 2.39% increase in the F-measure on the
DDI problem compared to methods that do not use graph representations of
molecules as features. Similar to the DDI task, [23] investigated multimodal
networks on BioInfer and HRPD50 datasets for the protein-protein interaction
(PPI) task. The authors used BioBERT pre-trained on large corpora of medical
texts (PubMed, PMC) to obtain text representations. Spatial structure in PDB
format or FASTA-sequence of each protein is available according to the corre-
sponding identifiers (PDB ID, ensemble ID). To obtain features from FASTA
sequences, the nucleotide sequences are first one-hot encoded and serve as the
input for three convolutional layers. To obtain structural features, the coordi-
nates of the atoms from the PDB file are converted into an adjacency matrix,
and for each atom, its feature vector is calculated. The obtained information
is used as the graph representation of the protein, which is fed to GCN. The
authors used a transformer layer with attention combining three modalities; the
output of this layer serves as an input to the softmax layer for the final classi-
fication. On both datasets, the authors achieve state-of-the-art results, in both
cases improving the previous F-measure result by about 7% in absolute terms.
3. Data
The ADE classification task involves distinguishing tweets that report an
adverse effect of a medication (annotated as “ADE”) from those that do not
(annotated as “non-ADE”). The aim of the Social Media Mining for Health
Applications (#SMM4H) shared tasks is to take a community-driven approach
to addressing NLP challenges of utilizing social media data for health informat-
ics, including informal, colloquial expressions of clinical concepts, noise, data
sparsity, ambiguity, and multilingual posts. In 2020, the fifth iteration of the
SMM4H shared tasks included Task 2 on automatic classification of multilingual
tweets that report adverse effects of a medication. This dataset includes tweets
posted in English, French, and Russian [2].
All SMM4H datasets are manually labeled using the same annotation guide-
5
摘要:

MultimodalModelwithTextandDrugEmbeddingsforAdverseDrugReactionClassi cationAndreySakhovskiya,b,ElenaTutubalinaa,c,daKazanFederalUniversity,18Kremlyovskayastreet,Kazan,RussianFederation,420008bLomonosovMoscowStateUniversity,1Leninskiegory,Moscow,RussianFederation,119991cSberAI,19VavilovaSt.,Moscow,Ru...

展开>> 收起<<
Multimodal Model with Text and Drug Embeddings for Adverse Drug Reaction Classification.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:546.97KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注