Motion Aware Self-Supervision for Generic Event Boundary Detection Ayush K. Rai Tarun Krishna Julia Dietlmeier Kevin McGuinness Alan F. Smeaton Noel E. OConnor Insight SFI Centre for Data Analytics Dublin City University DCU

2025-04-26 0 0 3.42MB 12 页 10玖币
侵权投诉
Motion Aware Self-Supervision for Generic Event Boundary Detection
Ayush K. Rai, Tarun Krishna, Julia Dietlmeier, Kevin McGuinness, Alan F. Smeaton, Noel E. O’Connor
Insight SFI Centre for Data Analytics, Dublin City University (DCU)
ayush.rai3@mail.dcu.ie
Abstract
The task of Generic Event Boundary Detection (GEBD)
aims to detect moments in videos that are naturally perceived
by humans as generic and taxonomy-free event boundaries.
Modeling the dynamically evolving temporal and spatial
changes in a video makes GEBD a difficult problem to solve.
Existing approaches involve very complex and sophisticated
pipelines in terms of architectural design choices, hence
creating a need for more straightforward and simplified
approaches. In this work, we address this issue by revisiting
a simple and effective self-supervised method and augment
it with a differentiable motion feature learning module to
tackle the spatial and temporal diversities in the GEBD
task. We perform extensive experiments on the challenging
Kinetics-GEBD and TAPOS datasets to demonstrate the
efficacy of the proposed approach compared to the other
self-supervised state-of-the-art methods. We also show that
this simple self-supervised approach learns motion features
without any explicit motion-specific pretext task. Our results
can be reproduced on github.
1. Introduction
Modeling videos using deep learning methods in order to
learn effective global and local video representations is an
extremely challenging task. Current state-of-the-art video
models [
18
] are built upon a limited set of predefined action
classes and usually process short clips followed by a pooling
operation to generate global video-level predictions. Other
mainstream computer vision tasks for video processing have
been mainly focused on action anticipation [
56
,
1
], tempo-
ral action detection [
7
,
22
], temporal action segmentation
[
43
,
41
] and temporal action parsing [
62
,
67
]. However,
only limited attention has been given to understanding long
form videos. Cognitive scientists [
74
] have observed that
humans perceive videos by breaking them down into shorter
temporal units, each carrying a semantic meaning and can
also reason about them. This creates a need to investigate re-
*Equal supervision
search problems to detect temporal boundaries in videos that
is consistent with their semantic validity and interpretability
from a cognitive point of view.
To this end, the GEBD task was recently introduced in
[
68
]
1
with an objective to study the long form video under-
standing problem through the lens of a human perception
mechanism. GEBD aims at identifying changes in content,
independent of changes in action, brightness, object, etc.,
i.e. generic event boundaries, making it different to tasks
such as video localization [
77
]. Video events could indi-
cate completion of goals or sub-goals, or occasions where
it becomes difficult for humans to predict what will happen
next. The recently released Kinetics-GEBD dataset [
68
] is
the first dataset specific to the GEBD task. It is annotated
by 5 different event boundary annotators, thereby captur-
ing the subtlety involved in human perception and making
it the dataset with the greatest number of temporal bound-
aries (
8×
EPIC-Kitchen-100 [
11
] and
32×
ActivityNet [
16
]).
The primary challenge in the GEBD task is to effectively
model generic spatial and temporal diversity as described in
DDM-Net [
72
]. Spatial diversity is primarily the result of
both low-level changes, e.g. changes in brightness or appear-
ance, and high-level changes, e.g., changes in camera angle,
or appearance and disappearance of the dominant subject.
Temporal diversity, on the other hand, can be attributed to
changes in action or changes by the object of interaction
with different speeds and duration, depending on the subject.
These spatio-temporal diversities make GEBD a difficult
problem to address.
In this work, to address the biased nature of video mod-
els trained over predefined classes in a supervised setting,
and the spatial diversity in GEBD, we leverage the power
of self-supervised models. Self-supervised techniques like
TCLR [
12
] and CCL [
38
] have achieved breakthrough re-
sults on various downstream tasks for video understanding.
The representations learned using self-supervised learning
(SSL) methods are not biased towards any predefined action
class making SSL methods an ideal candidate for the GEBD
task. In addition, in order to characterize temporal diversity
in GEBD, learning motion information is essential to capture
1LOVEU@CVPR2021, LOVEU@CVPR2022
arXiv:2210.05574v2 [cs.CV] 12 Oct 2022
Figure 1. The overall architecture consists of two stages: a)
Stage 1
involves the pre-training of the modified ResNet50 encoder (augmented
with a MotionSqueeze layer) with four pretext tasks using a contrastive learning based objective; b)
Stage 2
consists of fine tuning of the
encoder on the downstream GEBD task. Refer to Table 4 in Supplementary material for encoder details.
the fine-grained temporal variations that occur during the
change of action scenarios. Previous methods in video mod-
eling learn temporal motion cues by pre-computing the opti-
cal flow [
53
,
52
,
54
] between consecutive frames, which is
done externally and requires substantial computation. Alter-
natively, methods such as those described in [
32
,
21
] estimate
optical flow internally by learning visual correspondences
between images. The motion features learnt on-the-fly can
also be used for downstream applications such as action
recognition as illustrated in [82, 42].
This presents an interesting research question: how can
we develop a SSL framework for video understanding that
accounts for both appearance and motion features? Do we
need an explicit motion-specific training objective or can
this be implicitly achieved? We answer these questions by
rethinking SSL by reformulating the training objective pro-
posed in VCLR [
40
] at clip-level and further integrating it
with a differentiable motion estimation layers using the Mo-
tionSqueeze (MS) module introduced in [
42
] to jointly learn
appearance and motion features for videos. To summarise,
the main contributions of our work are as follows:
We revisit a simple self-supervised method VCLR [
40
]
with a noticeable change by modifying its pretext tasks
by splitting them into frame-level and clip-level to learn
effective video representations (cVCLR) . We further
augment the encoder with a differentiable motion fea-
ture learning module for GEBD.
We conduct exhaustive evaluation on the Kinetics-
GEBD and TAPOS datasets and show that our approach
achieves comparable performance to the self-supervised
state-of-the-art methods without using enhancements
like model ensembles, pseudo-labeling or the need for
other modality features (e.g. audio).
We show that the model can learn motion features un-
der self-supervision even without having any explicit
motion specific pretext task.
2. Related Work
2.1. Generic Event Boundary Detection.
The task of GEBD [
68
] is similar in nature to the Tem-
poral Action Localization (TAL) task, where the goal is to
localize the start and end points of an action occurrence along
with the action category. Initial attempts to address GEBD
were inspired from popular TAL solvers including the bound-
ary matching networks (BMN) [
52
] and BMN-StartEnd [
68
],
which generates proposals with precise temporal boundaries
along with reliable confidence scores. Shou et al. [
68
] intro-
duced a supervised baseline Pairwise Classifier (PC), which
considers GEBD as a framewise binary classification prob-
lem (boundary or not) by having a simple linear classifier
that uses concatenated average features around the neigh-
bourhood of a candidate frame. However, since GEBD is a
new task, most of the current methods are an extension of
state-of-the-art video understanding tasks, which overlook
the subtle differentiating characteristics of GEBD. Hence
there is a necessity for GEBD specialized solutions.
DDM-Net [
72
] applied progressive attention on multi-
level dense difference maps (DDM) to characterize motion
patterns and jointly learn motion with appearance cues in a
supervised setting. However, we learn generic motion fea-
tures by augmenting the encoder with a MS module in a self-
supervised setting. Hong et al. [
29
] used a cascaded temporal
attention network for GEBD, while Rai et al. [
64
] explored
the use of spatio-temporal features using two-stream net-
works. Li et al. [
49
] designed an end-to-end spatial-channel
compressed encoder and temporal contrastive module to de-
termine event boundaries. Recently, SC-Transformer [
48
]
introduced a structured partition of sequences (SPoS) mech-
anism to learn structured context using a transformer based
architecture for GEBD and augmented it with the compu-
tation of group similarity to learn distinctive features for
boundary detection. One advantage of SC-Transformer is
that it is independent of video length and predicts all bound-
aries in a single forward pass by feeding in 100 frames,
however it requires substantial memory and computational
resources.
Regarding unsupervised GEBD approaches, a shot detec-
tor library
2
and PredictAbility (PA) have been investigated in
[
68
]. The authors of UBoCo [
36
,
35
] proposed a novel super-
vised/unsupervised method that applies contrastive learning
to a TSM
3
based intermediary representation of videos to
learn discriminatory boundary features. UBoCo’s recursive
TSM
3
parsing algorithm exploits generic patterns and de-
tects very precise boundaries. However, they pre-process
all the videos in the dataset to have the same frames per
second (fps) value of 24, which adds a computational over-
head. Furthermore, like the SC-Transformer, UBoCo inputs
the frames representing the whole video at once, whereas in
our work we use raw video signals for pre-training and only
the context around the candidate boundary as input to the
GEBD task. TeG [
63
] proposed a generic self-supervised
model for video understanding for learning persistent and
more fine-grained features and evaluated it on the GEBD
task. The main difference between TeG and our work is that
TeG uses a 3D-ResNet-50 encoder as their backbone, which
makes the training computationally expensive, whereas we
use 2D-ResNet-50 model and modify it by adding temporal
shift module (TSM
4
) [
51
] to achieve the same effect as 3D
convolution while keeping the complexity of a 2D CNN.
GEBD can be used as a preliminary step in a larger down-
2https://github.com/Breakthrough/PySceneDetect
3Temporal Self-Similarity Matrix
4Temporal Shift Module
stream application, e.g. video summarization, video cap-
tioning [
76
], or ad cue-point detection [
8
]. It is, therefore,
important that the GEBD model not add excessive computa-
tional overhead to the overall pipeline, unlike many of the
examples of related work presented here.
2.2. SSL for video representation learning.
Self-supervision has become the new norm for learning
representations given its ability to exploit unlabelled data
[
59
,
23
,
15
,
2
,
5
,
81
,
4
,
9
,
60
,
39
,
14
]. Recent approaches
devised for video understanding can be divided into two
categories based on the SSL objective, namely pretext task
based and contrastive learning based.
Pretext task based.
The key idea here is to design a pre-
text task for which labels are generated in an online fashion,
referred to as pseudo labels, without any human annota-
tion. Examples include: predicting correct temporal order
[
58
], Video Rot-Net [
34
] for video rotation prediction, clip
order prediction [
78
], odd-one-out networks [
20
], sorting
sequences [
45
], and pace prediction [
75
]
5
. All these ap-
proaches exploit raw spatio-temporal signals from videos in
different ways based on pretext tasks and consequently learn
representations suitable for varied downstream tasks.
Contrastive learning based.
Contrastive learning ap-
proaches bring semantically similar objects, clips, etc., close
together in the embedding space while contrasting them with
negative samples, using objectives based on some variant of
Noise Contrastive Estimation (NCE) [
24
]. The Contrastive
Predictive Coding (CPC) approach [
60
] for images was ex-
tended to videos in DPC [
26
] and MemDPC [
27
], which
augments DPC with the notion of compressed memory. Li
et al. [
73
] extends the contrasting mutli-view framework for
inter-intra style video representation, while Kong et al. [
38
]
combine ideas from cycle-consistency with contrastive learn-
ing to propose cycle-contrast. Likewise, Yang et al. [
79
]
exploits visual tempo in a contrastive framework to learn
spatio-temporal features. Similarly, [
12
,
3
] use temporal
cues with contrastive learning. VCLR [
40
] formulates a
video-level contrastive objective to capture global context.
In the work presented here, we exploit VCLR as our back-
bone objective. However, different to pretext tasks in VCLR,
which perform computation only on frame level, we mod-
ify those pretext tasks to not only operate on frame-level
but also on clip-level thereby leading to better modeling of
the spatio-temporal features in videos. See [
66
] for a more
extensive review of SSL methods for video understanding.
2.3. Motion estimation and Learning visual corre-
spondences for video understanding.
Motion estimation.
Two-stream architectures [
19
,
69
]
have exhibited promising performance on the action recog-
nition task by using pre-computed optical flow, although
5leverages contrastive learning as an additional objective as well.
摘要:

MotionAwareSelf-SupervisionforGenericEventBoundaryDetectionAyushK.Rai,TarunKrishna,JuliaDietlmeier,KevinMcGuinness,AlanF.Smeaton,NoelE.O'ConnorInsightSFICentreforDataAnalytics,DublinCityUniversity(DCU)ayush.rai3@mail.dcu.ieAbstractThetaskofGenericEventBoundaryDetection(GEBD)aimstodetectmomentsinv...

展开>> 收起<<
Motion Aware Self-Supervision for Generic Event Boundary Detection Ayush K. Rai Tarun Krishna Julia Dietlmeier Kevin McGuinness Alan F. Smeaton Noel E. OConnor Insight SFI Centre for Data Analytics Dublin City University DCU.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:3.42MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注