
that uses concatenated average features around the neigh-
bourhood of a candidate frame. However, since GEBD is a
new task, most of the current methods are an extension of
state-of-the-art video understanding tasks, which overlook
the subtle differentiating characteristics of GEBD. Hence
there is a necessity for GEBD specialized solutions.
DDM-Net [
72
] applied progressive attention on multi-
level dense difference maps (DDM) to characterize motion
patterns and jointly learn motion with appearance cues in a
supervised setting. However, we learn generic motion fea-
tures by augmenting the encoder with a MS module in a self-
supervised setting. Hong et al. [
29
] used a cascaded temporal
attention network for GEBD, while Rai et al. [
64
] explored
the use of spatio-temporal features using two-stream net-
works. Li et al. [
49
] designed an end-to-end spatial-channel
compressed encoder and temporal contrastive module to de-
termine event boundaries. Recently, SC-Transformer [
48
]
introduced a structured partition of sequences (SPoS) mech-
anism to learn structured context using a transformer based
architecture for GEBD and augmented it with the compu-
tation of group similarity to learn distinctive features for
boundary detection. One advantage of SC-Transformer is
that it is independent of video length and predicts all bound-
aries in a single forward pass by feeding in 100 frames,
however it requires substantial memory and computational
resources.
Regarding unsupervised GEBD approaches, a shot detec-
tor library
2
and PredictAbility (PA) have been investigated in
[
68
]. The authors of UBoCo [
36
,
35
] proposed a novel super-
vised/unsupervised method that applies contrastive learning
to a TSM
3
based intermediary representation of videos to
learn discriminatory boundary features. UBoCo’s recursive
TSM
3
parsing algorithm exploits generic patterns and de-
tects very precise boundaries. However, they pre-process
all the videos in the dataset to have the same frames per
second (fps) value of 24, which adds a computational over-
head. Furthermore, like the SC-Transformer, UBoCo inputs
the frames representing the whole video at once, whereas in
our work we use raw video signals for pre-training and only
the context around the candidate boundary as input to the
GEBD task. TeG [
63
] proposed a generic self-supervised
model for video understanding for learning persistent and
more fine-grained features and evaluated it on the GEBD
task. The main difference between TeG and our work is that
TeG uses a 3D-ResNet-50 encoder as their backbone, which
makes the training computationally expensive, whereas we
use 2D-ResNet-50 model and modify it by adding temporal
shift module (TSM
4
) [
51
] to achieve the same effect as 3D
convolution while keeping the complexity of a 2D CNN.
GEBD can be used as a preliminary step in a larger down-
2https://github.com/Breakthrough/PySceneDetect
3Temporal Self-Similarity Matrix
4Temporal Shift Module
stream application, e.g. video summarization, video cap-
tioning [
76
], or ad cue-point detection [
8
]. It is, therefore,
important that the GEBD model not add excessive computa-
tional overhead to the overall pipeline, unlike many of the
examples of related work presented here.
2.2. SSL for video representation learning.
Self-supervision has become the new norm for learning
representations given its ability to exploit unlabelled data
[
59
,
23
,
15
,
2
,
5
,
81
,
4
,
9
,
60
,
39
,
14
]. Recent approaches
devised for video understanding can be divided into two
categories based on the SSL objective, namely pretext task
based and contrastive learning based.
Pretext task based.
The key idea here is to design a pre-
text task for which labels are generated in an online fashion,
referred to as pseudo labels, without any human annota-
tion. Examples include: predicting correct temporal order
[
58
], Video Rot-Net [
34
] for video rotation prediction, clip
order prediction [
78
], odd-one-out networks [
20
], sorting
sequences [
45
], and pace prediction [
75
]
5
. All these ap-
proaches exploit raw spatio-temporal signals from videos in
different ways based on pretext tasks and consequently learn
representations suitable for varied downstream tasks.
Contrastive learning based.
Contrastive learning ap-
proaches bring semantically similar objects, clips, etc., close
together in the embedding space while contrasting them with
negative samples, using objectives based on some variant of
Noise Contrastive Estimation (NCE) [
24
]. The Contrastive
Predictive Coding (CPC) approach [
60
] for images was ex-
tended to videos in DPC [
26
] and MemDPC [
27
], which
augments DPC with the notion of compressed memory. Li
et al. [
73
] extends the contrasting mutli-view framework for
inter-intra style video representation, while Kong et al. [
38
]
combine ideas from cycle-consistency with contrastive learn-
ing to propose cycle-contrast. Likewise, Yang et al. [
79
]
exploits visual tempo in a contrastive framework to learn
spatio-temporal features. Similarly, [
12
,
3
] use temporal
cues with contrastive learning. VCLR [
40
] formulates a
video-level contrastive objective to capture global context.
In the work presented here, we exploit VCLR as our back-
bone objective. However, different to pretext tasks in VCLR,
which perform computation only on frame level, we mod-
ify those pretext tasks to not only operate on frame-level
but also on clip-level thereby leading to better modeling of
the spatio-temporal features in videos. See [
66
] for a more
extensive review of SSL methods for video understanding.
2.3. Motion estimation and Learning visual corre-
spondences for video understanding.
Motion estimation.
Two-stream architectures [
19
,
69
]
have exhibited promising performance on the action recog-
nition task by using pre-computed optical flow, although
5leverages contrastive learning as an additional objective as well.