InForecaster Forecasting Influenza Hemagglutinin Mutations Through the Lens of Anomaly Detection Ali Garjani1 Atoosa Malemir Chegini1 Mohammadreza Salehi1 Alireza Tabibzadeh2

2025-05-05 0 0 857.83KB 15 页 10玖币
侵权投诉
InForecaster: Forecasting Influenza Hemagglutinin
Mutations Through the Lens of Anomaly Detection
Ali Garjani1, Atoosa Malemir Chegini1, Mohammadreza Salehi1, Alireza Tabibzadeh2,
Parastoo Yousefi2, Mohammad Hossein Razizadeh2, Moein Esghaei3, Maryam Esghaei2,
and Mohammad Hossein Rohban1,*
1Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
2Department of Virology, School of Medicine, Iran University of Medical Sciences, Tehran, Iran
3Cognitive Neuroscience Laboratory, German Primate Center, Leibniz Institute for Primate Research, Goettingen,
Germany
*Corresponding Author’s email: rohban@sharif.edu
ABSTRACT
The influenza virus hemagglutinin is an important part of the virus attachment to the host cells. The hemagglutinin proteins are
one of the genetic regions of the virus with a high potential for mutations. Due to the importance of predicting mutations in
producing effective and low-cost vaccines, solutions that attempt to approach this problem have recently gained a significant
attention. A historical record of mutations have been used to train predictive models in such solutions. However, the imbalance
between mutations and the preserved proteins is a big challenge for the development of such models that needs to be
addressed. Here, we propose to tackle this challenge through anomaly detection (AD). AD is a well-established field in Machine
Learning (ML) that tries to distinguish unseen anomalies from the normal patterns using only normal training samples. By
considering mutations as the anomalous behavior, we could benefit existing rich solutions in this field that have emerged
recently. Such methods also fit the problem setup of extreme imbalance between the number of unmutated vs. mutated training
samples. Motivated by this formulation, our method tries to find a compact representation for unmutated samples while forcing
anomalies to be separated from the normal ones. This helps the model to learn a shared unique representation between
normal training samples as much as possible, which improves the discernibility and detectability of mutated samples from the
unmutated ones at the test time. We conduct a large number of experiments on four publicly available datasets, consisting of 3
different hemagglutinin protein datasets, and one SARS-CoV-2 dataset, and show the effectiveness of our method through
different standard criteria.
Introduction
The influenza virus infection is mostly presented as a self-limited respiratory infection in immunocompetent people. However,
influenza viruses could lead to a life-threateing infection in the eldery and other risk group patients. Hemagglutinin is a
glycoprotein located on the surface of influenza viruses and acts as an attaching ligand to the host cells and inserting the
virus into the cells. To escape from immune responses, the virus can alter the antigenic features of the hemagglutinin (HA)
protein by point mutations. This phenomenon is known as antigenic drifts. Mutations in the genes of influenza viruses could
cause antigenic drift by changing the HA protein structure
14
. This results in a new strain of the virus that is not effectively
recognized by the immune system and makes the virus spread easily and cause epidemic. Influenza viruses are classified
in the Orthomyxoviridae family. In this family, there are three important human pathogens, including Alphainfluenzavirus,
Betainfluenzavirus, and Gammainfluenzavirus. Influenza A virus is the only member of Alphainfluenzavirus, and is an important
human pathogen due to its wide host range and the higher rate of drift and shift mutations57.
HA is the main part of the virus attachment to the host cell receptor. The globular head domain of HA, which is critical
for neutralizing antibody generation by the host immune system, is one of the most potent genomic locations for mutation.
Influenza A viruses are divided into two different groups based on the globular head domain. The HA 1, 2, 5, 6, 9, 11, 12, 13,
16, and 18 types are placed in one group and types 3, 4, 7, 10, 14, and 15 are considered as members of the second group of
HAs
8
. The Cb, Ca, Sb, and Sa are four important antigenic sites in the H1 domain of HA
911
. The amino acid residues number
143, 156, 158, 190, 193, and 197 are the most important residues for evolutionary and antigenic features of HA1
12
. The role
of HA1 mutations in the adequacy of influenza vaccination has made WHO Collaborating Centers and Vaccines and Related
Biological Products Advisory Committee (VRBPAC)
13
responsible for functional monitoring, reports, and decision for new
season vaccines. Despite this delicate process, there are shortages and some strain mismatches between the vaccine strains and
circulating strains14,15.
arXiv:2210.13709v1 [cs.LG] 25 Oct 2022
In the current study, we also evaluated the SARS-CoV-2 Spike mutations as an extra evaluation and a demo for the future
consideration in this field. The SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is the etiological agent for the
COVID-19 (Coronavirus disease-2019) pandemic. The SARS-CoV-2 is a Betacoronavirus and a member of the sarbecoviruses
sublineage
16
. The virus genome is an ssRNA (Single strand RNA) of 34kb in length. SARS-CoV-2 contains different genes
including ORF1a/b, Spike (S), Envelope (E), Membrane (M), Nucleoprotein (N), and accessory ORFs
17
. The virus binding into
cells by using the S protein attachment to the cellular receptor ACE-2 (Angiotensin-converting enzyme 2)
18
. The S protein is
the most important antigenic part of the virus19.
In recent years, Artificial Intelligence (AI) algorithms have achieved human or even super-human performance on different
tasks such as image classification
20
, text classification
21
, action recognition
22
, etc. Anomaly Detection (AD) is a sub-domain of
AI that is responsible for learning a normal representation space, and detecting anomalous samples at the test time by exploiting
the learned representation. Due to different challenges in labeling of the anomalous samples, such as the high cost or rareness
of such samples, most methods in this domain only use normal samples for the training. This is called the unsupervised AD.
Alternatively, one may use a very limited number of labeled anomalous samples in the training process, which is called the
semi-supervised AD23.
Unsupervised
2428
and semi-supervised
29,30
anomaly detection methods have recently achieved satisfactory results on a
variety of domains such as image, text, time-series, and video. Deep Semi-Supervised Anomaly Detection (DeepSAD)
29
, as a
recently proposed semi-supervised AD method, made clear that semi-supervised anomaly detectors are significantly superior
compared to the supervised training classification algorithms, specifically when the training dataset is complex, and the number
of normal samples is much higher than the anomalous ones. This is because anomaly detectors attempt to find a compact
representation space for the normal samples while maximizing the margin that exists between normal and abnormal ones. This
helps them to learn the most general and unique features of the normal samples, and not rely overly on the contrast that exists
between normal and anomalous samples to classify them.
Since in the mutation prediction tasks the number of unmutated samples is much higher than the mutated ones, the problem
can be formulated as an anomaly detection task. In this formulation, unmutated and mutated samples are considered as
normal and anomalous samples, respectively. The benefits of this approach are two-fold. Firstly, a semantically meaningful
representation could be learned even with a small number of training samples, which makes generalization to unseen test time
samples possible. Secondly, as the finding and labeling procedure of mutated viruses is an expensive and time-consuming
process, anomaly detectors could work fine with, or without a limited number of anomalous or mutated, training samples23.
By this motivation, we propose the first anomaly detection framework for predicting virus mutations. We use the Long
Short-Term Memory (LSTM)
31
neural network in combination with the Deep Semi-Supervised Anomaly Detection (DeepSAD)
loss
29
to not only learn long-term input dependencies, but also to find a semantic representation space for the mutated and
unmutated training samples. Figure 1shows the overall architecture of the proposed method. We conduct extensive experiments
to show the effectiveness of our method in improving the average recall, F1-score, precision, and Area Under the Curve (AUC)
for three different publicly available Influenza datasets.
Background
For the sake of clarity, we discuss some of the important prerequisites from deep learning literature in this section. At first,
some Recurrent Neural Network architectures, such as LSTMs
31
, are discussed. Then, a brief introduction about the anomaly
detection methods is presented.
Recurrent Neural Networks (RNN):
RNNs are broadly used to model the data sequential dependencies, where the
sequence could be formed based on the temporal or spatial arrangements. Initial architectures of RNNs, such as the vanilla
RNN, suffer from memorizing long-term as well as short-term dependencies. To address this issue, alternative architectures, such
as LSTM
31
networks, bi-directional RNNs
32
, and gated recurrent units
33
GRU’s have been introduced. All these approaches
attempt to summarize previous inputs into their hidden state that is updated in each time step
t
. The mentioned information is
regulated using some parameters or gates. For instance, the LSTM network consists LSTM cells. Each cell contains a state,
ht
,
and memory,
st
. These two are updated based on three different gates that are called input gate,
it
,forget gate,
ft
, and output
gate,
ot
. The input gate selects some of the memory dimensions to modify (Eq. 2). The forget gate decides which memory cell
dimensions should be ignored in the next time step (Eq. 1). The output gate decides which dimensions of the memory should
be transferred to the state (Eq. 3). The cell and state vectors are updated based on these gates and activation values that are
produced through the tanh activation (Eqs. 4,5). Specifically, the memory constitutes previous memory dimensions that are not
forgotten, plus the input activation values that the input gate selects. Finally, the state constitutes memory activation values that
are selected by the output gate.
Note that a sigmoid activation function is used in the gates to map the gate outputs between zero and one, which models
the selection, i.e. gate output of 1 represents complete selection of an embedding, and the 0 value corresponds to a complete
2/15
X1
softmax
Splittings and embeddings
Construction of training data
Year i
Year i+1
Year i+2
Year i+t
Xt
XT
LSTM
h1
ht-1
ht
hT-1
LSTM
LSTM
h1
ht
hT
Temporal
Attention
et1
ett
etT
wt1
wtt
wtT
h1
ht
hT
ct
Legend
connection with time-lag
branching point
multiplication
sum over all inputs
gate activation function (always
sigmoid)
input activation function
(usually tanh)
output activation function
(usually tanh)
.
+
𝜎
g
h
t
ht
DeepSAD loss
function
yt
y = 1
y = -1
Figure 1.
The overall architecture of our method. First, the raw data is processed and the output
(Xt
1,Xt
2,...,Xt
n)
is prepared at
the time step
t
, where
n
is the embedding dimensions,
t
denotes the time. After the pre-processing phase, LSTM cells are used
to produce hidden states, hi, for each time point t. Then, the attention function takes hiand the cell state st1, and outputs ei
t.
Next, by using a softmax function, the weights wi
ts are produced. The weighted sum of the hidden states, his, is obtained by
using the mentioned weights. The output of this weighted sum, ct, and the hidden state htwill then be used to produce the
encoded vector
ˆ
ht
. At the last step, DeepSAD loss function is applied to
ˆ
ht
to decide whether the input data is in-class (normal)
1 or out-class (anomaly) -1.
non-selection. A tanh activation function is used in the cell and state update rules to produce activation values that are between
-1 and 1.
ft=σ(Wf[ht1;xt] + bf)(1)
it=σ(Wi[ht1;xt] + bi)(2)
ot=σ(Wo[ht1;xt] + bo)(3)
st=ftst1+ittanh(Ws[ht1;xt] + bs)(4)
ht=ottanh(st)(5)
Despite huge efforts on making different LSTM architectures to improve its performance, no architecture has been proposed
yet that is generally better than the original one
34
. Therefore, our proposed method is based on the LSTM networks with some
improvements on its ability to maintain long-term information and interpretability.
Anomaly Detection:
As mentioned before, anomaly detection is a sub-branch of artificial intelligence seeking to solve
one-class classification problems. One-class methods only access to the labels of one category of a dataset, called the normal
class
35
. These methods then seek to design a classifier that can distinguish the normal class vs. the unseen classes, which is
also referred to as anomaly classes. For instance, in the mutation prediction problems, the anomaly detection method assumes
access to only unmutated samples. This setup could be adopted for reasons such as the large cost of the data gathering process
from both kinds of mutated and unmutated classes, or even the impossibility of gathering all kinds of mutations in our training
dataset. Such issues make the classification setup ineffective, as the classifier may get biased towards accurate prediction of
only known mutations that are reflected in the training set. Deep Support Vector Data Description (DSVDD)
24
is one of the
basic anomaly detection methods that is trained in an unsupervised manner. It tries to find a latent space and the most compact
hyper-sphere that contains the normal training samples in this space. The pre-assumption of DSVDD is that anomalous samples
layout of the circle in contrast to normal ones, which could make them detectable. Recently, Chong et al.
36
has shown the
3/15
摘要:

InForecaster:ForecastingInuenzaHemagglutininMutationsThroughtheLensofAnomalyDetectionAliGarjani1,AtoosaMalemirChegini1,MohammadrezaSalehi1,AlirezaTabibzadeh2,ParastooYouse2,MohammadHosseinRazizadeh2,MoeinEsghaei3,MaryamEsghaei2,andMohammadHosseinRohban1,*1DepartmentofComputerEngineering,SharifUniv...

展开>> 收起<<
InForecaster Forecasting Influenza Hemagglutinin Mutations Through the Lens of Anomaly Detection Ali Garjani1 Atoosa Malemir Chegini1 Mohammadreza Salehi1 Alireza Tabibzadeh2.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:857.83KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注