InForecaster Forecasting Inﬂuenza Hemagglutinin Mutations Through the Lens of Anomaly Detection Ali Garjani1 Atoosa Malemir Chegini1 Mohammadreza Salehi1 Alireza Tabibzadeh2

2025-05-05 0 0 857.83KB 15 页 10玖币

侵权投诉

InForecaster: Forecasting Inﬂuenza Hemagglutinin

Mutations Through the Lens of Anomaly Detection

Ali Garjani1, Atoosa Malemir Chegini1, Mohammadreza Salehi1, Alireza Tabibzadeh2,

Parastoo Youseﬁ2, Mohammad Hossein Razizadeh2, Moein Esghaei3, Maryam Esghaei2,

and Mohammad Hossein Rohban1,*

1Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

2Department of Virology, School of Medicine, Iran University of Medical Sciences, Tehran, Iran

3Cognitive Neuroscience Laboratory, German Primate Center, Leibniz Institute for Primate Research, Goettingen,

Germany

*Corresponding Author’s email: rohban@sharif.edu

ABSTRACT

The inﬂuenza virus hemagglutinin is an important part of the virus attachment to the host cells. The hemagglutinin proteins are

one of the genetic regions of the virus with a high potential for mutations. Due to the importance of predicting mutations in

producing effective and low-cost vaccines, solutions that attempt to approach this problem have recently gained a signiﬁcant

attention. A historical record of mutations have been used to train predictive models in such solutions. However, the imbalance

between mutations and the preserved proteins is a big challenge for the development of such models that needs to be

addressed. Here, we propose to tackle this challenge through anomaly detection (AD). AD is a well-established ﬁeld in Machine

Learning (ML) that tries to distinguish unseen anomalies from the normal patterns using only normal training samples. By

considering mutations as the anomalous behavior, we could beneﬁt existing rich solutions in this ﬁeld that have emerged

recently. Such methods also ﬁt the problem setup of extreme imbalance between the number of unmutated vs. mutated training

samples. Motivated by this formulation, our method tries to ﬁnd a compact representation for unmutated samples while forcing

anomalies to be separated from the normal ones. This helps the model to learn a shared unique representation between

normal training samples as much as possible, which improves the discernibility and detectability of mutated samples from the

unmutated ones at the test time. We conduct a large number of experiments on four publicly available datasets, consisting of 3

different hemagglutinin protein datasets, and one SARS-CoV-2 dataset, and show the effectiveness of our method through

different standard criteria.

Introduction

The inﬂuenza virus infection is mostly presented as a self-limited respiratory infection in immunocompetent people. However,

inﬂuenza viruses could lead to a life-threateing infection in the eldery and other risk group patients. Hemagglutinin is a

glycoprotein located on the surface of inﬂuenza viruses and acts as an attaching ligand to the host cells and inserting the

virus into the cells. To escape from immune responses, the virus can alter the antigenic features of the hemagglutinin (HA)

protein by point mutations. This phenomenon is known as antigenic drifts. Mutations in the genes of inﬂuenza viruses could

cause antigenic drift by changing the HA protein structure

1–4

. This results in a new strain of the virus that is not effectively

recognized by the immune system and makes the virus spread easily and cause epidemic. Inﬂuenza viruses are classiﬁed

in the Orthomyxoviridae family. In this family, there are three important human pathogens, including Alphainﬂuenzavirus,

Betainﬂuenzavirus, and Gammainﬂuenzavirus. Inﬂuenza A virus is the only member of Alphainﬂuenzavirus, and is an important

human pathogen due to its wide host range and the higher rate of drift and shift mutations5–7.

HA is the main part of the virus attachment to the host cell receptor. The globular head domain of HA, which is critical

for neutralizing antibody generation by the host immune system, is one of the most potent genomic locations for mutation.

Inﬂuenza A viruses are divided into two different groups based on the globular head domain. The HA 1, 2, 5, 6, 9, 11, 12, 13,

16, and 18 types are placed in one group and types 3, 4, 7, 10, 14, and 15 are considered as members of the second group of

HAs

. The Cb, Ca, Sb, and Sa are four important antigenic sites in the H1 domain of HA

9–11

. The amino acid residues number

143, 156, 158, 190, 193, and 197 are the most important residues for evolutionary and antigenic features of HA1

. The role

of HA1 mutations in the adequacy of inﬂuenza vaccination has made WHO Collaborating Centers and Vaccines and Related

Biological Products Advisory Committee (VRBPAC)

responsible for functional monitoring, reports, and decision for new

season vaccines. Despite this delicate process, there are shortages and some strain mismatches between the vaccine strains and

circulating strains14,15.

arXiv:2210.13709v1 [cs.LG] 25 Oct 2022

In the current study, we also evaluated the SARS-CoV-2 Spike mutations as an extra evaluation and a demo for the future

consideration in this ﬁeld. The SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is the etiological agent for the

COVID-19 (Coronavirus disease-2019) pandemic. The SARS-CoV-2 is a Betacoronavirus and a member of the sarbecoviruses

sublineage

. The virus genome is an ssRNA (Single strand RNA) of 34kb in length. SARS-CoV-2 contains different genes

including ORF1a/b, Spike (S), Envelope (E), Membrane (M), Nucleoprotein (N), and accessory ORFs

. The virus binding into

cells by using the S protein attachment to the cellular receptor ACE-2 (Angiotensin-converting enzyme 2)

. The S protein is

the most important antigenic part of the virus19.

In recent years, Artiﬁcial Intelligence (AI) algorithms have achieved human or even super-human performance on different

tasks such as image classiﬁcation

, text classiﬁcation

, action recognition

, etc. Anomaly Detection (AD) is a sub-domain of

AI that is responsible for learning a normal representation space, and detecting anomalous samples at the test time by exploiting

the learned representation. Due to different challenges in labeling of the anomalous samples, such as the high cost or rareness

of such samples, most methods in this domain only use normal samples for the training. This is called the unsupervised AD.

Alternatively, one may use a very limited number of labeled anomalous samples in the training process, which is called the

semi-supervised AD23.

Unsupervised

24–28

and semi-supervised

29,30

anomaly detection methods have recently achieved satisfactory results on a

variety of domains such as image, text, time-series, and video. Deep Semi-Supervised Anomaly Detection (DeepSAD)

, as a

recently proposed semi-supervised AD method, made clear that semi-supervised anomaly detectors are signiﬁcantly superior

compared to the supervised training classiﬁcation algorithms, speciﬁcally when the training dataset is complex, and the number

of normal samples is much higher than the anomalous ones. This is because anomaly detectors attempt to ﬁnd a compact

representation space for the normal samples while maximizing the margin that exists between normal and abnormal ones. This

helps them to learn the most general and unique features of the normal samples, and not rely overly on the contrast that exists

between normal and anomalous samples to classify them.

Since in the mutation prediction tasks the number of unmutated samples is much higher than the mutated ones, the problem

can be formulated as an anomaly detection task. In this formulation, unmutated and mutated samples are considered as

normal and anomalous samples, respectively. The beneﬁts of this approach are two-fold. Firstly, a semantically meaningful

representation could be learned even with a small number of training samples, which makes generalization to unseen test time

samples possible. Secondly, as the ﬁnding and labeling procedure of mutated viruses is an expensive and time-consuming

process, anomaly detectors could work ﬁne with, or without a limited number of anomalous or mutated, training samples23.

By this motivation, we propose the ﬁrst anomaly detection framework for predicting virus mutations. We use the Long

Short-Term Memory (LSTM)

neural network in combination with the Deep Semi-Supervised Anomaly Detection (DeepSAD)

loss

to not only learn long-term input dependencies, but also to ﬁnd a semantic representation space for the mutated and

unmutated training samples. Figure 1shows the overall architecture of the proposed method. We conduct extensive experiments

to show the effectiveness of our method in improving the average recall, F1-score, precision, and Area Under the Curve (AUC)

for three different publicly available Inﬂuenza datasets.

Background

For the sake of clarity, we discuss some of the important prerequisites from deep learning literature in this section. At ﬁrst,

some Recurrent Neural Network architectures, such as LSTMs

, are discussed. Then, a brief introduction about the anomaly

detection methods is presented.

Recurrent Neural Networks (RNN):

RNNs are broadly used to model the data sequential dependencies, where the

sequence could be formed based on the temporal or spatial arrangements. Initial architectures of RNNs, such as the vanilla

RNN, suffer from memorizing long-term as well as short-term dependencies. To address this issue, alternative architectures, such

as LSTM

networks, bi-directional RNNs

, and gated recurrent units

GRU’s have been introduced. All these approaches

attempt to summarize previous inputs into their hidden state that is updated in each time step

. The mentioned information is

regulated using some parameters or gates. For instance, the LSTM network consists LSTM cells. Each cell contains a state,

and memory,

. These two are updated based on three different gates that are called input gate,

,forget gate,

, and output

gate,

. The input gate selects some of the memory dimensions to modify (Eq. 2). The forget gate decides which memory cell

dimensions should be ignored in the next time step (Eq. 1). The output gate decides which dimensions of the memory should

be transferred to the state (Eq. 3). The cell and state vectors are updated based on these gates and activation values that are

produced through the tanh activation (Eqs. 4,5). Speciﬁcally, the memory constitutes previous memory dimensions that are not

forgotten, plus the input activation values that the input gate selects. Finally, the state constitutes memory activation values that

are selected by the output gate.

Note that a sigmoid activation function is used in the gates to map the gate outputs between zero and one, which models

the selection, i.e. gate output of 1 represents complete selection of an embedding, and the 0 value corresponds to a complete

2/15

softmax

…

Splittings and embeddings

Construction of training data

Year i

Year i+1

Year i+2

Year i+t

…

LSTM

ht-1

…

hT-1

…

LSTM

Temporal

Attention

et1

ett

etT

…

wt1

wtt

wtT

Legend

connection with time-lag

branching point

multiplication

sum over all inputs

gate activation function (always

sigmoid)

input activation function

(usually tanh)

output activation function

(usually tanh)

𝜎

t

DeepSAD loss

function

y = 1

y = -1

Figure 1.

The overall architecture of our method. First, the raw data is processed and the output

(Xt

1,Xt

2,...,Xt

is prepared at

the time step

, where

is the embedding dimensions,

denotes the time. After the pre-processing phase, LSTM cells are used

to produce hidden states, hi, for each time point t. Then, the attention function takes hiand the cell state st−1, and outputs ei

Next, by using a softmax function, the weights wi

t’s are produced. The weighted sum of the hidden states, hi’s, is obtained by

using the mentioned weights. The output of this weighted sum, ct, and the hidden state htwill then be used to produce the

encoded vector

. At the last step, DeepSAD loss function is applied to

to decide whether the input data is in-class (normal)

1 or out-class (anomaly) -1.

non-selection. A tanh activation function is used in the cell and state update rules to produce activation values that are between

-1 and 1.

ft=σ(Wf[ht−1;xt] + bf)(1)

it=σ(Wi[ht−1;xt] + bi)(2)

ot=σ(Wo[ht−1;xt] + bo)(3)

st=ftst−1+ittanh(Ws[ht−1;xt] + bs)(4)

ht=ottanh(st)(5)

Despite huge efforts on making different LSTM architectures to improve its performance, no architecture has been proposed

yet that is generally better than the original one

. Therefore, our proposed method is based on the LSTM networks with some

improvements on its ability to maintain long-term information and interpretability.

Anomaly Detection:

As mentioned before, anomaly detection is a sub-branch of artiﬁcial intelligence seeking to solve

one-class classiﬁcation problems. One-class methods only access to the labels of one category of a dataset, called the normal

class

. These methods then seek to design a classiﬁer that can distinguish the normal class vs. the unseen classes, which is

also referred to as anomaly classes. For instance, in the mutation prediction problems, the anomaly detection method assumes

access to only unmutated samples. This setup could be adopted for reasons such as the large cost of the data gathering process

from both kinds of mutated and unmutated classes, or even the impossibility of gathering all kinds of mutations in our training

dataset. Such issues make the classiﬁcation setup ineffective, as the classiﬁer may get biased towards accurate prediction of

only known mutations that are reﬂected in the training set. Deep Support Vector Data Description (DSVDD)

is one of the

basic anomaly detection methods that is trained in an unsupervised manner. It tries to ﬁnd a latent space and the most compact

hyper-sphere that contains the normal training samples in this space. The pre-assumption of DSVDD is that anomalous samples

layout of the circle in contrast to normal ones, which could make them detectable. Recently, Chong et al.

has shown the

3/15

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InForecaster:ForecastingInuenzaHemagglutininMutationsThroughtheLensofAnomalyDetectionAliGarjani1,AtoosaMalemirChegini1,MohammadrezaSalehi1,AlirezaTabibzadeh2,ParastooYouse2,MohammadHosseinRazizadeh2,MoeinEsghaei3,MaryamEsghaei2,andMohammadHosseinRohban1,*1DepartmentofComputerEngineering,SharifUniv...

展开>> 收起<<

InForecaster Forecasting Inﬂuenza Hemagglutinin Mutations Through the Lens of Anomaly Detection Ali Garjani1 Atoosa Malemir Chegini1 Mohammadreza Salehi1 Alireza Tabibzadeh2.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

InForecaster Forecasting Inﬂuenza Hemagglutinin Mutations Through the Lens of Anomaly Detection Ali Garjani1 Atoosa Malemir Chegini1 Mohammadreza Salehi1 Alireza Tabibzadeh2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: