classification, and a video classification BCE loss is used.
The work that is most relevant to ours is [14]. While we
are studying an explicit video classification using BERT [4],
[9], we find that they use a transformer to model the video
classification with a BCE loss. In addition to this video clas-
sification, the transformer is also used to refine the CNN
feature . They propose a multiple sequence learning (MSL)
finding consecutive snippets to improve the training, which
is claimed as their main contribution. However, in our work
we find that a BERT or a transformer does not necessarily
fulfill both tasks of video classification and feature refine-
ment at the same time. We find that it does not help the
feature refinement, so we solely study its role in video clas-
sification. With this simple single change, without MSL
or RTFM, we achieve superior performance on the UCF-
Crime [21] and the ShanghaiTech [16] datasets.
We go further to use this BERT video classification on
top of the RTFM. We combine their BCE loss and our pro-
posed BERT-enabled BCE loss, and achieve extraordinary
performance on the XD-Violence dataset. Based on these
results, we demonstrate the power of the video classifica-
tion supervision in anomaly detection. It can work alone or
combine other techniques like RTFM to boost the perfor-
mance of anomaly detection.
Our contributions are summarized as follows,
• We explicitly study the power of video classification
supervision in weakly supervised video anomaly de-
tection. This video classification is achieved with a
BERT on snippet CNN features. We find that the
BERT should only be used for the video classification,
but should not be used for feature refinement. It is
very important to emphasize that the combination
of two existing ideas (BERT and MIL) should not
be taken as our main contribution. Instead, our key
contribution is we find out the fact that the power of
video classification has been previously overlooked
and now the gap is filled in this work. As an ab-
lation study, we implement a simpler LSTM based
video classifier. Even though its complexity is a lot
lower, its performance is almost the same as the
BERT.
• There are two inference modes of this proposed
scheme. The second online mode offers a very attrac-
tive low complexity option, even though it only gets
partial performance improvement from the video clas-
sification supervision.
• We study this algorithm alone in the standard MIL
framework on the UCF-Crime and the ShanghaiTech
datasets. We test RGB, Flow or RGB+Flow modal-
ity. This simple introduction of video classification
in anomaly detection brings superior performance im-
provement on every modality. On the RGB+flow
modality, we achieve the best ROC-AUC performance,
exceeding the SOTA by 1.5%.
• We study this algorithm on top of the RTFM [22] on
the UCF-Crime and the XD-Violence datasets. We
test the RGB modality only. While our algorithm
only achieves a marginal ROC-AUC performance im-
provement on the UCF-Crime dataset, it achieve nearly
3.51% AP performance improvement on the XD-
Violence dataset. This improvement demonstrates
that our proposed explicit video classification can
combine with many other video anomaly detection
algorithms where an explicit video classification is
not used.
2. Related Work
Unsupervised anomaly detection assume only normal
training data is available and solves this problem with
one-class classification using hand-crafted features or deep
learning features. Typical approaches use pre-trained CNN,
apply constraints on the latent space of normal manifold
to learn normality representation, or use data reconstruc-
tion error with generative models. There are very few work
on the supervised learning for anomaly detection since the
frame level annotation is very hard to get. Two examples
are [15] and [13]. For a review of video anomaly detection,
the readers are referred to [10] and [18].
Weakly supervised anomaly detection has shown sub-
stantially improved performance over the self supervised
approaches by leveraging the available video-level anno-
tations. These annotation only gives a binary label of ab-
normal or normal for a video. Sultani et al. [21] propose
the MIL framework using only video-level labels and in-
troduce the large-scale anomaly detection dataset, UCF-
Crime. This work inspires quite a few follow-up studies
[30], [17], [26], [28], [27], [7], [22], [14].
However, in the MIL-based methods, abnormal video la-
bels are not easy to be used effectively. Typically, the clas-
sification score is used to tell if a snippet is abnormal or
normal. This score is noisy in the positive bag, where a
normal snippet can be mistakenly taken as the top abnor-
mal event in an anomaly video. To deal with this problem,
Zhong et al. [30] treat this problem as a binary classification
under noisy label problem and use a graph convolution neu-
ral (GCN) network to clear the label noise. In [7], a mul-
tiple instance self-training framework (MIST) is proposed
to efficiently refine task-specific discriminative representa-
tions with a multiple instance pseudo label generator and
a self-guided attention boosted feature encoder. In [28],
a weakly-supervised spatio-temporal anomaly detection is
proposed to localize a spatio-temporal tube that encloses
the abnormal event. In [28], causal temporal cue and fea-
ture discrimination are explored. In [17], a high-order con-