work in which the model incrementally trains as the data
arrives without forgetting the learnt (past) information. This
type of framework can be feasible in visual surveillance as
video data keep coming into the monitoring systems. How-
ever, all these approaches suffer a few limitations as follows:
(1) in continual learning, a separate mechanism needs to be
designed to avoid catastrophic forgetting [
8
], (2) GANs and
AEs are highly vulnerable to unstable training, i.e., a subtle
change in data imposes large changes in the labels, thus af-
fecting the normal distribution, (3) most of the state-of-art
VAD methods heavily depend on labeled normal/abnormal
data, and (4) VAD approaches either utilize appearance-
based features or deep features.
To address these limitations, we adopt an iterative learn-
ing [
44
] mechanism in which models are repeatedly tuned
with more refined data during each pass. Moreover, we
aim to combine the technical advantages of continual and
AEs learning. Our proposed framework combines the power
of DNNs with well-justified handcrafted motion features.
These spatio-temporal features equipped with low-level mo-
tion features help to detect wide range of anomalies. The
framework can also be retrained in an end-to-end fashion as
input data arrives. The overview of the proposed framework
is depicted in Fig. 1. It is divided into three stages: i) pseudo
label assignment, ii) regressors training, and iii) refinement
of labels using optimized regressors. For enabling the re-
gressors to understand subtle anomalies, we have obtained
motion features, namely dynamicity score using optical flow.
In the first stage, we do not know the actual labels; hence we
have obtained intermediate low confidence anomaly labels
using OneClassSVM and iForest [
19
]. We also obtain the
dynamicity labels using dynamicity scores. We have trained
two regressor networks in the second stage by using the la-
bels generated in the first stage. This is an iterative process to
improve the confidence scores. In this way, both regressors
are trained over refined labels and they learn discriminating
features. The iterative learning approach also ensures that
both the regressors learn new distinguish patterns without
losing the past information. We have experimentally found
that for first few iterations, both regressors gradually learn
internal patterns and stabilizes after some iterations. Both
regressors are trained independently in parallel. Precisely,
in iterative learning, the model is retrained using refined
data in each iteration. In this way, the proposed approach
do not need any level of supervision. However, some form
of supervision is mandatory for continual learning [
8
] or
weakly-supervised methods [
27
,
38
,
48
]. These methods
consider a video anomalous even if a small segment contains
anomaly. In contrast, we identify anomalous segments using
dynamicity and anomaly scores estimated using unsuper-
vised ways, thus eliminating the requirement of supervision.
To achieve this, we have made the following contributions:
•
design an unsupervised end-to-end video anomaly de-
tection framework that uses iterative learning to tune
the model using refined labels in each iteration;
•
propose a novel technique to assign intermediate labels
in unsupervised scenarios by combining deep features
with well-justified motion features and;
•
conduct extensive experiments to understand the ef-
fectiveness of the proposed framework with respect to
other state-of-the-art methods.
The rest of the paper is organized as follows. In the next
section, we present the related work. In Sec. 3, we present the
proposed framework. Experiments and results are presented
in Sec. 4. The conclusions and future works are presented in
Sec. 5.
2. Related Work
Existing work in the Video Anomaly Detection (VAD)
domain largely draw motivation from activity recognition
and scene understanding [
38
]. These methods utilize various
types of video features, training procedures or both. In
this section, we briefly discuss the main categories that are
extensively followed in very recent VAD approaches.
2.1. Reconstruction-based Approaches
Several VAD approaches [
1
,
10
,
22
,
27
,
29
,
30
,
39
,
46
]
employ Autoencoders (AEs), Generative Adversarial Nets
(GANs) and their variants under the assumption that the
models that are explicitly trained on normal data may not
be successful to reconstruct abnormal event as such samples
are usually absent in the training set. Park et al. [
29
] have
used AE to generate cuboids within normal frames using
spatial and temporal transformation. Zaheer et al. [
46
] have
generated good quality reconstructions using the current gen-
erator and used the previous state generator to obtain bad
quality examples. This way, the new discriminator learns
to detect even small distortions in abnormal input. Gong et
al. [
10
] have introduced a memory module to AE and con-
structed MemAE. This is an improved version of existing
AE. Szymanowicz et al. [
39
] have trained an AE to obtain
saliency maps using five consecutive frames and per-pixel
prediction error. Ravanbakhsh et al. [
36
] have imposed clas-
sic adversarial training using GANs to detect anomalous
activity. However, the effectiveness of these approaches
is highly dependent on the reconstruction capabilities of
the model. Failing which, it may significantly degrade the
model’s performance.
2.2. Features-based Approaches
Primarily, features-based VAD approaches can be cate-
gorized by anomaly detection using either handcrafted or