Overlooked Video Classiﬁcation in Weakly Supervised Video Anomaly Detection Weijun Tan LinkSprite Technologies USA

2025-05-06 0 0 538.99KB 9 页 10玖币

侵权投诉

Overlooked Video Classiﬁcation in Weakly Supervised Video Anomaly Detection

Weijun Tan

LinkSprite Technologies, USA

and Jovisin-Deepcam Research, Shenzhen, China

weijun.tan@linksprite.com

sz.twj@jovision.com

Qi Yao, and Jingfeng Liu

Jovision-Deepcam Research

Shenzhen, China

{sz.yaoqi,sz.ljf}@jovision.com

{qi.yao,jingfeng.liu}@deepcam.com

Abstract

Current weakly supervised video anomaly detection al-

gorithms mostly use multiple instance learning (MIL) or

their varieties. Almost all recent approaches focus on how

to select the correct snippets for training to improve perfor-

mance. They overlook or do not realize the power of video

classiﬁcation in boosting the performance of anomaly de-

tection. In this paper, we study the power of video classiﬁ-

cation supervision explicitly using a BERT or LSTM. With

this BERT or LSTM, CNN features of all snippets of a video

can be aggregated into a single feature which can be used

for video classiﬁcation. This simple yet powerful video clas-

siﬁcation supervision, combined with the MIL and RTFM

framework, brings extraordinary performance improvement

on all three major video anomaly detection datasets. Partic-

ularly it improves the mean average precision (mAP) on the

XD-Violence from SOTA 78.84% to new 82.10%. These re-

sults demonstrate this video classiﬁcation can be combined

with other anomaly detection algorithms to achieve better

performance. The code is publicly available at xxx.

1. Introduction

Surveillance cameras are widely used in public places

for safety purposes. Enpowered by machine learning and

artiﬁcial intelligence, surveillance cameras become smarter

using automatic object or event detection and recognition.

Video anomaly detection is to identify the time and space of

abnormal objects or events in videos. Examples include in-

dustrial anomaly detection and security anomaly detection,

and more.

Depending on the annotation of the training data and

the algorithms, anomaly detection is categorized into three

types - unsupervised, supervised, and weakly supervised.

The unsupervised one learns only on normal videos assum-

ing the unseen anomalous videos have high reconstruction

errors. This approach’s performance is usually poor be-

cause it lacks knowledge of the abnormality in anomalous

videos and inability to learn the normal patterns in normal

videos. The supervised one is expect to have the best per-

formance. However, due to the fact that the frame-level

annotation is very time consuming to get and prone to hu-

man mistakes, it is less studied. In weakly supervised one,

since only video level annotation of if there is anomaly in a

video is needed, the dataset is a lot easier to get and robust

to human mistakes. It draws most attentions in the video

anomaly detection area.

In the weakly supervised anomaly detection, a multiple

instance learning (MIL) or its variety is typically used [21].

From a pair of abnormal and normal videos, a positive bag

of instances is formed on the abnormal video, and a neg-

ative bag of instances on the normal video. A pretrained

CNN network is used to extract a feature on a snippet of

video frames. A classiﬁcation network is trained on all the

instances of these two bags. The one instance with the max-

imum classiﬁcation score in a bag is chosen to represent the

bag. The MIL tries to maximize the separation between the

maximum scores of the positive bag and the negative bag.

In almost all follow-up studies, different approaches are

proposed on how to select the best quality snippets to train

the model. Some choose multiple snippets instead of one

out of a video [22], others choose a sequence of consecu-

tive snippets [14], [7]. Some of them use the snippet clas-

siﬁcation score to choose snippets, others use other metrics

including feature magnitude [22]. Some use GCN to im-

prove the quality of the chosen snippets [30].

However, almost all of them overlook or do not fully re-

alize the power of the video classiﬁcation and its impact

on the anomaly detection performance. In anomaly de-

tection, the videos are classiﬁed to anomalous or normal

videos. This strong information has been overlooked ex-

cept in RTFM [22], [14], and [29]. In RTFM, the top-k

snippets with maximum feature magnitude are chosen per

video, and the mean of their classiﬁcation scores is used

as a video classiﬁcation score in the binary cross entropy

(BCE) loss, even though the authors do not call it so.

In [29], a GCN is used to approximately model the video

arXiv:2210.06688v2 [cs.CV] 19 Apr 2023

classiﬁcation, and a video classiﬁcation BCE loss is used.

The work that is most relevant to ours is [14]. While we

are studying an explicit video classiﬁcation using BERT [4],

[9], we ﬁnd that they use a transformer to model the video

classiﬁcation with a BCE loss. In addition to this video clas-

siﬁcation, the transformer is also used to reﬁne the CNN

feature . They propose a multiple sequence learning (MSL)

ﬁnding consecutive snippets to improve the training, which

is claimed as their main contribution. However, in our work

we ﬁnd that a BERT or a transformer does not necessarily

fulﬁll both tasks of video classiﬁcation and feature reﬁne-

ment at the same time. We ﬁnd that it does not help the

feature reﬁnement, so we solely study its role in video clas-

siﬁcation. With this simple single change, without MSL

or RTFM, we achieve superior performance on the UCF-

Crime [21] and the ShanghaiTech [16] datasets.

We go further to use this BERT video classiﬁcation on

top of the RTFM. We combine their BCE loss and our pro-

posed BERT-enabled BCE loss, and achieve extraordinary

performance on the XD-Violence dataset. Based on these

results, we demonstrate the power of the video classiﬁca-

tion supervision in anomaly detection. It can work alone or

combine other techniques like RTFM to boost the perfor-

mance of anomaly detection.

Our contributions are summarized as follows,

• We explicitly study the power of video classiﬁcation

supervision in weakly supervised video anomaly de-

tection. This video classiﬁcation is achieved with a

BERT on snippet CNN features. We ﬁnd that the

BERT should only be used for the video classiﬁcation,

but should not be used for feature reﬁnement. It is

very important to emphasize that the combination

of two existing ideas (BERT and MIL) should not

be taken as our main contribution. Instead, our key

contribution is we ﬁnd out the fact that the power of

video classiﬁcation has been previously overlooked

and now the gap is ﬁlled in this work. As an ab-

lation study, we implement a simpler LSTM based

video classiﬁer. Even though its complexity is a lot

lower, its performance is almost the same as the

BERT.

• There are two inference modes of this proposed

scheme. The second online mode offers a very attrac-

tive low complexity option, even though it only gets

partial performance improvement from the video clas-

siﬁcation supervision.

• We study this algorithm alone in the standard MIL

framework on the UCF-Crime and the ShanghaiTech

datasets. We test RGB, Flow or RGB+Flow modal-

ity. This simple introduction of video classiﬁcation

in anomaly detection brings superior performance im-

provement on every modality. On the RGB+ﬂow

modality, we achieve the best ROC-AUC performance,

exceeding the SOTA by 1.5%.

• We study this algorithm on top of the RTFM [22] on

the UCF-Crime and the XD-Violence datasets. We

test the RGB modality only. While our algorithm

only achieves a marginal ROC-AUC performance im-

provement on the UCF-Crime dataset, it achieve nearly

3.51% AP performance improvement on the XD-

Violence dataset. This improvement demonstrates

that our proposed explicit video classiﬁcation can

combine with many other video anomaly detection

algorithms where an explicit video classiﬁcation is

not used.

2. Related Work

Unsupervised anomaly detection assume only normal

training data is available and solves this problem with

one-class classiﬁcation using hand-crafted features or deep

learning features. Typical approaches use pre-trained CNN,

apply constraints on the latent space of normal manifold

to learn normality representation, or use data reconstruc-

tion error with generative models. There are very few work

on the supervised learning for anomaly detection since the

frame level annotation is very hard to get. Two examples

are [15] and [13]. For a review of video anomaly detection,

the readers are referred to [10] and [18].

Weakly supervised anomaly detection has shown sub-

stantially improved performance over the self supervised

approaches by leveraging the available video-level anno-

tations. These annotation only gives a binary label of ab-

normal or normal for a video. Sultani et al. [21] propose

the MIL framework using only video-level labels and in-

troduce the large-scale anomaly detection dataset, UCF-

Crime. This work inspires quite a few follow-up studies

[30], [17], [26], [28], [27], [7], [22], [14].

However, in the MIL-based methods, abnormal video la-

bels are not easy to be used effectively. Typically, the clas-

siﬁcation score is used to tell if a snippet is abnormal or

normal. This score is noisy in the positive bag, where a

normal snippet can be mistakenly taken as the top abnor-

mal event in an anomaly video. To deal with this problem,

Zhong et al. [30] treat this problem as a binary classiﬁcation

under noisy label problem and use a graph convolution neu-

ral (GCN) network to clear the label noise. In [7], a mul-

tiple instance self-training framework (MIST) is proposed

to efﬁciently reﬁne task-speciﬁc discriminative representa-

tions with a multiple instance pseudo label generator and

a self-guided attention boosted feature encoder. In [28],

a weakly-supervised spatio-temporal anomaly detection is

proposed to localize a spatio-temporal tube that encloses

the abnormal event. In [28], causal temporal cue and fea-

ture discrimination are explored. In [17], a high-order con-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OverlookedVideoClassicationinWeaklySupervisedVideoAnomalyDetectionWeijunTanLinkSpriteTechnologies,USAandJovisin-DeepcamResearch,Shenzhen,Chinaweijun.tan@linksprite.comsz.twj@jovision.comQiYao,andJingfengLiuJovision-DeepcamResearchShenzhen,Chinafsz.yaoqi,sz.ljfg@jovision.comfqi.yao,jingfeng.liug@dee...

展开>> 收起<<

Overlooked Video Classiﬁcation in Weakly Supervised Video Anomaly Detection Weijun Tan LinkSprite Technologies USA.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Overlooked Video Classiﬁcation in Weakly Supervised Video Anomaly Detection Weijun Tan LinkSprite Technologies USA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: