Multimodal Model with Text and Drug Embeddings for Adverse Drug Reaction Classification

2025-05-02 0 0 546.97KB 30 页 10玖币

侵权投诉

Multimodal Model with Text and Drug Embeddings for

Adverse Drug Reaction Classiﬁcation

Andrey Sakhovskiya,b, Elena Tutubalinaa,c,d

aKazan Federal University, 18 Kremlyovskaya street, Kazan, Russian Federation, 420008

bLomonosov Moscow State University, 1 Leninskie gory, Moscow, Russian Federation,

119991

cSber AI, 19 Vavilova St., Moscow, Russian Federation, 117997

dNational Research University Higher School of Economics, 11 Pokrovsky Bulvar, Moscow,

Russian Federation, 109028

Abstract

In this paper, we focus on the classiﬁcation of tweets as sources of potential

signals for adverse drug eﬀects (ADEs) or drug reactions (ADRs). Follow-

ing the intuition that text and drug structure representations are complemen-

tary, we introduce a multimodal model with two components. These compo-

nents are state-of-the-art BERT-based models for language understanding and

molecular property prediction. Experiments were carried out on multilingual

benchmarks of the Social Media Mining for Health Research and Applications

(#SMM4H) initiative. Our models obtained state-of-the-art results of 0.61 F1-

measure and 0.57 F1-measure on #SMM4H 2021 Shared Tasks 1a and 2 in

English and Russian, respectively. On the classiﬁcation of French tweets from

SMM4H 2020 Task 1, our approach pushes the state of the art by an absolute

gain of 8% F1. Our experiments show that the molecular information obtained

from neural networks is more beneﬁcial for ADE classiﬁcation than traditional

molecular descriptors. The source code for our models is freely available at

https://github.com/Andoree/smm4h_2021_classification.

Keywords: natural language processing, social media, adverse drug reactions,

Email addresses: andrey.sakhovskiy@gmail.com (Andrey Sakhovskiy),

ElVTutubalina@kpfu.ru (Elena Tutubalina)

Preprint submitted to Journal of Biomedical Informatics October 25, 2022

arXiv:2210.13238v1 [q-bio.QM] 21 Oct 2022

text representations, drug representations

1. Introduction

The popularity of social media as a source for health-related information has

increased tremendously in the past decade. One of the well-studied research

areas is pharmacovigilance from social media data that focuses on discovering

adverse drug eﬀects (ADEs) from user-generated texts. ADEs1are unwanted

negative eﬀects of a drug, in other words, harmful and undesired reactions due

to its intake.

In recent years, researchers have increasingly applied neural networks, in-

cluding Bidirectional Encoder Representations from Transformers (BERT) [1],

to ADE detection from texts [2, 3, 4, 5]. This is directly related to the creation of

annotated multilingual corpora and Social Media Mining for Health (#SMM4H)

shared tasks [2, 3]. Given a tweet, participants of this shared task are required

to detect whether the tweet contains a mention of ADE using natural language

processing (NLP) techniques. However, these studies mostly share the same

limitations: models consider textual information only without leveraging drug

structure. In particular, [6, 7] utilized transformer-based classiﬁers with ensem-

ble modeling and undersampling ranking ﬁrst on SMM4H 2020 task 1a and 1b,

respectively. On the other hand, studies from cheminformatics and drug dis-

covery areas have focused on the prediction of the side eﬀects of a given drug

[8, 9, 10, 11]. These studies utilized supervised models trained and evaluated

on a database of marketed drugs and ADRs the Side Eﬀect Resource (SIDER)

[12] from the MoleculeNet benchmark [13]. Each chemical structure is encoded

with molecular descriptors or neural representations that are usually fed to a

multi-label classiﬁcation model. In addition, a number of BERT architectures

has been proposed for molecular property prediction such as SMILES-BERT

[11], MolBERT [14], ChemBERTa [15] with ﬁne-tuning on the SIDER dataset.

1The terms adverse drug eﬀects (ADEs) and adverse drug reactions (ADRs) are often used

interchangeably.

Inspired by multimodal studies, we propose a novel method to utilize both

textual and molecular information for ADE classiﬁcation. We study the impact

of using diﬀerent molecular representation approaches, including traditional

molecular descriptors calculated with Mordred [16] and BERT-based encoders

ChemBERTa and MolBERT. We explore two strategies to fuse drug represen-

tations and tweet representations: straightforward concatenation of represen-

tations and use of a co-attention mechanism to integrate features of diﬀerent

modalities. Along with the textual information, the incorporation of molecular

structure can aid in understanding the relationship between diﬀerent pharma-

cological and chemical properties and the occurrence of ADEs.

A preliminary version of this work has appeared in [17]. Compared to the

conference version, we have: (1) signiﬁcantly extended the experimental part

of this work to assess the performance of the proposed multimodal model; in

particular, we extended experiments to three datasets in English, French, and

Russian. Our model achieved state-of-the-art results on SMM4H 2021 Tasks 1a

& 2 and SMM4H 2020 Task 1b; (2) extended the description of the experimental

datasets for ADE classiﬁcation; (3) investigated model performance on diﬀerent

drug groups, adding new experimental results and conclusions; (4) performed

error analysis and discussed the limitations of our model.

In Section 2 we present related work, Section 3 describes the datasets, Sec-

tion 4 introduces our approach, Section 5 describes and discusses experimental

results, and Section 7 concludes the paper.

2. Related Work

2.1. ADE classiﬁcation

Although there is a wide range of supervised machine learning methods that

have been applied to classify user-generated posts in English, a relatively small

number of recent studies have been focused on texts in other languages, i.e.,

Russian and French. In the SMM4H 2020 shared task [3], the precision of op-

timal ADE classiﬁcation systems varies across diﬀerent languages. For English,

precision previously has stayed in the range of 0.45-0.65 reaching a score of 0.64

with the winning system, while precision for Russian and French have stayed

in the range of 0.34-0.54 and 0.15-0.33, respectively [2]. Most of these models

leverage raw text to classify each text or a token in a text to ADE class. [6]

pre-trained a large version of RoBERTa [18] on an unlabeled corpus of 6 mil-

lion tweets in English where each tweet includes drug mention; they ﬁne-tuned

this model without any imbalance techniques on SMM4H data. [19] trained a

fully-connected network on pre-computed embeddings obtained using a distilled

version of multilingual sentence BERT to classify French tweets. [7] used an

ensemble of BERT-based models and logistic regression with undersampling to

classify Russian tweets. [4] used an ensemble of ten EnRuDR-BERT models [20]

pretrained on 5M health-related user posts. Overall, the percentage of teams

using Transformer [21] architectures for ADE classiﬁcation rose from 80% in

SMM4H 2020 to 100% in SMM4H 2021 [3].

Our model improves upon the state-of-the-art models in three very criti-

cal ways: (1) incorporation of drug representations, (2) extensive experiments

on three languages, (iii) analysis of model performance for various Anatomical

Therapeutic Chemical (ATC) main groups.

2.2. Multimodal learning on biomedical tasks

A number of studies proposed multimodal deep learning models to combine

information from multiple modalities. These models show promising results

compared to uni-modal models on protein-protein and drug-drug interaction

identiﬁcation from scientiﬁc texts in English [22, 23], image captioning [24],

image ads understanding [25]. [22] utilize textual sentence representations and

representation of the molecular structures of active substances for classiﬁcation

of drug-drug interactions (DDI). The authors proposed a network with two com-

ponents. The ﬁrst component of the architecture is a convolutional neural net-

work, where the original text is converted into embedding representations using

the word2vec model, which are then combined with the positional embeddings

of the two drugs in the text. The second component is a graph convolutional

network (GCN). Experiments show a 2.39% increase in the F-measure on the

DDI problem compared to methods that do not use graph representations of

molecules as features. Similar to the DDI task, [23] investigated multimodal

networks on BioInfer and HRPD50 datasets for the protein-protein interaction

(PPI) task. The authors used BioBERT pre-trained on large corpora of medical

texts (PubMed, PMC) to obtain text representations. Spatial structure in PDB

format or FASTA-sequence of each protein is available according to the corre-

sponding identiﬁers (PDB ID, ensemble ID). To obtain features from FASTA

sequences, the nucleotide sequences are ﬁrst one-hot encoded and serve as the

input for three convolutional layers. To obtain structural features, the coordi-

nates of the atoms from the PDB ﬁle are converted into an adjacency matrix,

and for each atom, its feature vector is calculated. The obtained information

is used as the graph representation of the protein, which is fed to GCN. The

authors used a transformer layer with attention combining three modalities; the

output of this layer serves as an input to the softmax layer for the ﬁnal classi-

ﬁcation. On both datasets, the authors achieve state-of-the-art results, in both

cases improving the previous F-measure result by about 7% in absolute terms.

3. Data

The ADE classiﬁcation task involves distinguishing tweets that report an

adverse eﬀect of a medication (annotated as “ADE”) from those that do not

(annotated as “non-ADE”). The aim of the Social Media Mining for Health

Applications (#SMM4H) shared tasks is to take a community-driven approach

to addressing NLP challenges of utilizing social media data for health informat-

ics, including informal, colloquial expressions of clinical concepts, noise, data

sparsity, ambiguity, and multilingual posts. In 2020, the ﬁfth iteration of the

SMM4H shared tasks included Task 2 on automatic classiﬁcation of multilingual

tweets that report adverse eﬀects of a medication. This dataset includes tweets

posted in English, French, and Russian [2].

All SMM4H datasets are manually labeled using the same annotation guide-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MultimodalModelwithTextandDrugEmbeddingsforAdverseDrugReactionClassicationAndreySakhovskiya,b,ElenaTutubalinaa,c,daKazanFederalUniversity,18Kremlyovskayastreet,Kazan,RussianFederation,420008bLomonosovMoscowStateUniversity,1Leninskiegory,Moscow,RussianFederation,119991cSberAI,19VavilovaSt.,Moscow,Ru...

展开>> 收起<<

Multimodal Model with Text and Drug Embeddings for Adverse Drug Reaction Classification.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multimodal Model with Text and Drug Embeddings for Adverse Drug Reaction Classification

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: