Overlooked Video Classification in Weakly Supervised Video Anomaly Detection Weijun Tan LinkSprite Technologies USA

2025-05-06 0 0 538.99KB 9 页 10玖币
侵权投诉
Overlooked Video Classification in Weakly Supervised Video Anomaly Detection
Weijun Tan
LinkSprite Technologies, USA
and Jovisin-Deepcam Research, Shenzhen, China
weijun.tan@linksprite.com
sz.twj@jovision.com
Qi Yao, and Jingfeng Liu
Jovision-Deepcam Research
Shenzhen, China
{sz.yaoqi,sz.ljf}@jovision.com
{qi.yao,jingfeng.liu}@deepcam.com
Abstract
Current weakly supervised video anomaly detection al-
gorithms mostly use multiple instance learning (MIL) or
their varieties. Almost all recent approaches focus on how
to select the correct snippets for training to improve perfor-
mance. They overlook or do not realize the power of video
classification in boosting the performance of anomaly de-
tection. In this paper, we study the power of video classifi-
cation supervision explicitly using a BERT or LSTM. With
this BERT or LSTM, CNN features of all snippets of a video
can be aggregated into a single feature which can be used
for video classification. This simple yet powerful video clas-
sification supervision, combined with the MIL and RTFM
framework, brings extraordinary performance improvement
on all three major video anomaly detection datasets. Partic-
ularly it improves the mean average precision (mAP) on the
XD-Violence from SOTA 78.84% to new 82.10%. These re-
sults demonstrate this video classification can be combined
with other anomaly detection algorithms to achieve better
performance. The code is publicly available at xxx.
1. Introduction
Surveillance cameras are widely used in public places
for safety purposes. Enpowered by machine learning and
artificial intelligence, surveillance cameras become smarter
using automatic object or event detection and recognition.
Video anomaly detection is to identify the time and space of
abnormal objects or events in videos. Examples include in-
dustrial anomaly detection and security anomaly detection,
and more.
Depending on the annotation of the training data and
the algorithms, anomaly detection is categorized into three
types - unsupervised, supervised, and weakly supervised.
The unsupervised one learns only on normal videos assum-
ing the unseen anomalous videos have high reconstruction
errors. This approach’s performance is usually poor be-
cause it lacks knowledge of the abnormality in anomalous
videos and inability to learn the normal patterns in normal
videos. The supervised one is expect to have the best per-
formance. However, due to the fact that the frame-level
annotation is very time consuming to get and prone to hu-
man mistakes, it is less studied. In weakly supervised one,
since only video level annotation of if there is anomaly in a
video is needed, the dataset is a lot easier to get and robust
to human mistakes. It draws most attentions in the video
anomaly detection area.
In the weakly supervised anomaly detection, a multiple
instance learning (MIL) or its variety is typically used [21].
From a pair of abnormal and normal videos, a positive bag
of instances is formed on the abnormal video, and a neg-
ative bag of instances on the normal video. A pretrained
CNN network is used to extract a feature on a snippet of
video frames. A classification network is trained on all the
instances of these two bags. The one instance with the max-
imum classification score in a bag is chosen to represent the
bag. The MIL tries to maximize the separation between the
maximum scores of the positive bag and the negative bag.
In almost all follow-up studies, different approaches are
proposed on how to select the best quality snippets to train
the model. Some choose multiple snippets instead of one
out of a video [22], others choose a sequence of consecu-
tive snippets [14], [7]. Some of them use the snippet clas-
sification score to choose snippets, others use other metrics
including feature magnitude [22]. Some use GCN to im-
prove the quality of the chosen snippets [30].
However, almost all of them overlook or do not fully re-
alize the power of the video classification and its impact
on the anomaly detection performance. In anomaly de-
tection, the videos are classified to anomalous or normal
videos. This strong information has been overlooked ex-
cept in RTFM [22], [14], and [29]. In RTFM, the top-k
snippets with maximum feature magnitude are chosen per
video, and the mean of their classification scores is used
as a video classification score in the binary cross entropy
(BCE) loss, even though the authors do not call it so.
In [29], a GCN is used to approximately model the video
arXiv:2210.06688v2 [cs.CV] 19 Apr 2023
classification, and a video classification BCE loss is used.
The work that is most relevant to ours is [14]. While we
are studying an explicit video classification using BERT [4],
[9], we find that they use a transformer to model the video
classification with a BCE loss. In addition to this video clas-
sification, the transformer is also used to refine the CNN
feature . They propose a multiple sequence learning (MSL)
finding consecutive snippets to improve the training, which
is claimed as their main contribution. However, in our work
we find that a BERT or a transformer does not necessarily
fulfill both tasks of video classification and feature refine-
ment at the same time. We find that it does not help the
feature refinement, so we solely study its role in video clas-
sification. With this simple single change, without MSL
or RTFM, we achieve superior performance on the UCF-
Crime [21] and the ShanghaiTech [16] datasets.
We go further to use this BERT video classification on
top of the RTFM. We combine their BCE loss and our pro-
posed BERT-enabled BCE loss, and achieve extraordinary
performance on the XD-Violence dataset. Based on these
results, we demonstrate the power of the video classifica-
tion supervision in anomaly detection. It can work alone or
combine other techniques like RTFM to boost the perfor-
mance of anomaly detection.
Our contributions are summarized as follows,
We explicitly study the power of video classification
supervision in weakly supervised video anomaly de-
tection. This video classification is achieved with a
BERT on snippet CNN features. We find that the
BERT should only be used for the video classification,
but should not be used for feature refinement. It is
very important to emphasize that the combination
of two existing ideas (BERT and MIL) should not
be taken as our main contribution. Instead, our key
contribution is we find out the fact that the power of
video classification has been previously overlooked
and now the gap is filled in this work. As an ab-
lation study, we implement a simpler LSTM based
video classifier. Even though its complexity is a lot
lower, its performance is almost the same as the
BERT.
There are two inference modes of this proposed
scheme. The second online mode offers a very attrac-
tive low complexity option, even though it only gets
partial performance improvement from the video clas-
sification supervision.
We study this algorithm alone in the standard MIL
framework on the UCF-Crime and the ShanghaiTech
datasets. We test RGB, Flow or RGB+Flow modal-
ity. This simple introduction of video classification
in anomaly detection brings superior performance im-
provement on every modality. On the RGB+flow
modality, we achieve the best ROC-AUC performance,
exceeding the SOTA by 1.5%.
We study this algorithm on top of the RTFM [22] on
the UCF-Crime and the XD-Violence datasets. We
test the RGB modality only. While our algorithm
only achieves a marginal ROC-AUC performance im-
provement on the UCF-Crime dataset, it achieve nearly
3.51% AP performance improvement on the XD-
Violence dataset. This improvement demonstrates
that our proposed explicit video classification can
combine with many other video anomaly detection
algorithms where an explicit video classification is
not used.
2. Related Work
Unsupervised anomaly detection assume only normal
training data is available and solves this problem with
one-class classification using hand-crafted features or deep
learning features. Typical approaches use pre-trained CNN,
apply constraints on the latent space of normal manifold
to learn normality representation, or use data reconstruc-
tion error with generative models. There are very few work
on the supervised learning for anomaly detection since the
frame level annotation is very hard to get. Two examples
are [15] and [13]. For a review of video anomaly detection,
the readers are referred to [10] and [18].
Weakly supervised anomaly detection has shown sub-
stantially improved performance over the self supervised
approaches by leveraging the available video-level anno-
tations. These annotation only gives a binary label of ab-
normal or normal for a video. Sultani et al. [21] propose
the MIL framework using only video-level labels and in-
troduce the large-scale anomaly detection dataset, UCF-
Crime. This work inspires quite a few follow-up studies
[30], [17], [26], [28], [27], [7], [22], [14].
However, in the MIL-based methods, abnormal video la-
bels are not easy to be used effectively. Typically, the clas-
sification score is used to tell if a snippet is abnormal or
normal. This score is noisy in the positive bag, where a
normal snippet can be mistakenly taken as the top abnor-
mal event in an anomaly video. To deal with this problem,
Zhong et al. [30] treat this problem as a binary classification
under noisy label problem and use a graph convolution neu-
ral (GCN) network to clear the label noise. In [7], a mul-
tiple instance self-training framework (MIST) is proposed
to efficiently refine task-specific discriminative representa-
tions with a multiple instance pseudo label generator and
a self-guided attention boosted feature encoder. In [28],
a weakly-supervised spatio-temporal anomaly detection is
proposed to localize a spatio-temporal tube that encloses
the abnormal event. In [28], causal temporal cue and fea-
ture discrimination are explored. In [17], a high-order con-
摘要:

OverlookedVideoClassicationinWeaklySupervisedVideoAnomalyDetectionWeijunTanLinkSpriteTechnologies,USAandJovisin-DeepcamResearch,Shenzhen,Chinaweijun.tan@linksprite.comsz.twj@jovision.comQiYao,andJingfengLiuJovision-DeepcamResearchShenzhen,Chinafsz.yaoqi,sz.ljfg@jovision.comfqi.yao,jingfeng.liug@dee...

展开>> 收起<<
Overlooked Video Classification in Weakly Supervised Video Anomaly Detection Weijun Tan LinkSprite Technologies USA.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:538.99KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注