Motion Aware Self-Supervision for Generic Event Boundary Detection Ayush K. Rai Tarun Krishna Julia Dietlmeier Kevin McGuinness Alan F. Smeaton Noel E. OConnor Insight SFI Centre for Data Analytics Dublin City University DCU

2025-04-26 0 0 3.42MB 12 页 10玖币

侵权投诉

Motion Aware Self-Supervision for Generic Event Boundary Detection

Ayush K. Rai, Tarun Krishna, Julia Dietlmeier, Kevin McGuinness∗, Alan F. Smeaton∗, Noel E. O’Connor∗

Insight SFI Centre for Data Analytics, Dublin City University (DCU)

ayush.rai3@mail.dcu.ie

Abstract

The task of Generic Event Boundary Detection (GEBD)

aims to detect moments in videos that are naturally perceived

by humans as generic and taxonomy-free event boundaries.

Modeling the dynamically evolving temporal and spatial

changes in a video makes GEBD a difﬁcult problem to solve.

Existing approaches involve very complex and sophisticated

pipelines in terms of architectural design choices, hence

creating a need for more straightforward and simpliﬁed

approaches. In this work, we address this issue by revisiting

a simple and effective self-supervised method and augment

it with a differentiable motion feature learning module to

tackle the spatial and temporal diversities in the GEBD

task. We perform extensive experiments on the challenging

Kinetics-GEBD and TAPOS datasets to demonstrate the

efﬁcacy of the proposed approach compared to the other

self-supervised state-of-the-art methods. We also show that

this simple self-supervised approach learns motion features

without any explicit motion-speciﬁc pretext task. Our results

can be reproduced on github.

1. Introduction

Modeling videos using deep learning methods in order to

learn effective global and local video representations is an

extremely challenging task. Current state-of-the-art video

models [

] are built upon a limited set of predeﬁned action

classes and usually process short clips followed by a pooling

operation to generate global video-level predictions. Other

mainstream computer vision tasks for video processing have

been mainly focused on action anticipation [

], tempo-

ral action detection [

], temporal action segmentation

[

] and temporal action parsing [

]. However,

only limited attention has been given to understanding long

form videos. Cognitive scientists [

] have observed that

humans perceive videos by breaking them down into shorter

temporal units, each carrying a semantic meaning and can

also reason about them. This creates a need to investigate re-

*Equal supervision

search problems to detect temporal boundaries in videos that

is consistent with their semantic validity and interpretability

from a cognitive point of view.

To this end, the GEBD task was recently introduced in

[

]

with an objective to study the long form video under-

standing problem through the lens of a human perception

mechanism. GEBD aims at identifying changes in content,

independent of changes in action, brightness, object, etc.,

i.e. generic event boundaries, making it different to tasks

such as video localization [

]. Video events could indi-

cate completion of goals or sub-goals, or occasions where

it becomes difﬁcult for humans to predict what will happen

next. The recently released Kinetics-GEBD dataset [

] is

the ﬁrst dataset speciﬁc to the GEBD task. It is annotated

by 5 different event boundary annotators, thereby captur-

ing the subtlety involved in human perception and making

it the dataset with the greatest number of temporal bound-

aries (

8×

EPIC-Kitchen-100 [

] and

32×

ActivityNet [

]).

The primary challenge in the GEBD task is to effectively

model generic spatial and temporal diversity as described in

DDM-Net [

]. Spatial diversity is primarily the result of

both low-level changes, e.g. changes in brightness or appear-

ance, and high-level changes, e.g., changes in camera angle,

or appearance and disappearance of the dominant subject.

Temporal diversity, on the other hand, can be attributed to

changes in action or changes by the object of interaction

with different speeds and duration, depending on the subject.

These spatio-temporal diversities make GEBD a difﬁcult

problem to address.

In this work, to address the biased nature of video mod-

els trained over predeﬁned classes in a supervised setting,

and the spatial diversity in GEBD, we leverage the power

of self-supervised models. Self-supervised techniques like

TCLR [

] and CCL [

] have achieved breakthrough re-

sults on various downstream tasks for video understanding.

The representations learned using self-supervised learning

(SSL) methods are not biased towards any predeﬁned action

class making SSL methods an ideal candidate for the GEBD

task. In addition, in order to characterize temporal diversity

in GEBD, learning motion information is essential to capture

1LOVEU@CVPR2021, LOVEU@CVPR2022

arXiv:2210.05574v2 [cs.CV] 12 Oct 2022

Figure 1. The overall architecture consists of two stages: a)

Stage 1

involves the pre-training of the modiﬁed ResNet50 encoder (augmented

with a MotionSqueeze layer) with four pretext tasks using a contrastive learning based objective; b)

Stage 2

consists of ﬁne tuning of the

encoder on the downstream GEBD task. Refer to Table 4 in Supplementary material for encoder details.

the ﬁne-grained temporal variations that occur during the

change of action scenarios. Previous methods in video mod-

eling learn temporal motion cues by pre-computing the opti-

cal ﬂow [

] between consecutive frames, which is

done externally and requires substantial computation. Alter-

natively, methods such as those described in [

] estimate

optical ﬂow internally by learning visual correspondences

between images. The motion features learnt on-the-ﬂy can

also be used for downstream applications such as action

recognition as illustrated in [82, 42].

This presents an interesting research question: how can

we develop a SSL framework for video understanding that

accounts for both appearance and motion features? Do we

need an explicit motion-speciﬁc training objective or can

this be implicitly achieved? We answer these questions by

rethinking SSL by reformulating the training objective pro-

posed in VCLR [

] at clip-level and further integrating it

with a differentiable motion estimation layers using the Mo-

tionSqueeze (MS) module introduced in [

] to jointly learn

appearance and motion features for videos. To summarise,

the main contributions of our work are as follows:

•

We revisit a simple self-supervised method VCLR [

]

with a noticeable change by modifying its pretext tasks

by splitting them into frame-level and clip-level to learn

effective video representations (cVCLR) . We further

augment the encoder with a differentiable motion fea-

ture learning module for GEBD.

•

We conduct exhaustive evaluation on the Kinetics-

GEBD and TAPOS datasets and show that our approach

achieves comparable performance to the self-supervised

state-of-the-art methods without using enhancements

like model ensembles, pseudo-labeling or the need for

other modality features (e.g. audio).

•

We show that the model can learn motion features un-

der self-supervision even without having any explicit

motion speciﬁc pretext task.

2. Related Work

2.1. Generic Event Boundary Detection.

The task of GEBD [

] is similar in nature to the Tem-

poral Action Localization (TAL) task, where the goal is to

localize the start and end points of an action occurrence along

with the action category. Initial attempts to address GEBD

were inspired from popular TAL solvers including the bound-

ary matching networks (BMN) [

] and BMN-StartEnd [

which generates proposals with precise temporal boundaries

along with reliable conﬁdence scores. Shou et al. [

] intro-

duced a supervised baseline Pairwise Classiﬁer (PC), which

considers GEBD as a framewise binary classiﬁcation prob-

lem (boundary or not) by having a simple linear classiﬁer

that uses concatenated average features around the neigh-

bourhood of a candidate frame. However, since GEBD is a

new task, most of the current methods are an extension of

state-of-the-art video understanding tasks, which overlook

the subtle differentiating characteristics of GEBD. Hence

there is a necessity for GEBD specialized solutions.

DDM-Net [

] applied progressive attention on multi-

level dense difference maps (DDM) to characterize motion

patterns and jointly learn motion with appearance cues in a

supervised setting. However, we learn generic motion fea-

tures by augmenting the encoder with a MS module in a self-

supervised setting. Hong et al. [

] used a cascaded temporal

attention network for GEBD, while Rai et al. [

] explored

the use of spatio-temporal features using two-stream net-

works. Li et al. [

] designed an end-to-end spatial-channel

compressed encoder and temporal contrastive module to de-

termine event boundaries. Recently, SC-Transformer [

]

introduced a structured partition of sequences (SPoS) mech-

anism to learn structured context using a transformer based

architecture for GEBD and augmented it with the compu-

tation of group similarity to learn distinctive features for

boundary detection. One advantage of SC-Transformer is

that it is independent of video length and predicts all bound-

aries in a single forward pass by feeding in 100 frames,

however it requires substantial memory and computational

resources.

Regarding unsupervised GEBD approaches, a shot detec-

tor library

and PredictAbility (PA) have been investigated in

[

]. The authors of UBoCo [

] proposed a novel super-

vised/unsupervised method that applies contrastive learning

to a TSM

based intermediary representation of videos to

learn discriminatory boundary features. UBoCo’s recursive

TSM

parsing algorithm exploits generic patterns and de-

tects very precise boundaries. However, they pre-process

all the videos in the dataset to have the same frames per

second (fps) value of 24, which adds a computational over-

head. Furthermore, like the SC-Transformer, UBoCo inputs

the frames representing the whole video at once, whereas in

our work we use raw video signals for pre-training and only

the context around the candidate boundary as input to the

GEBD task. TeG [

] proposed a generic self-supervised

model for video understanding for learning persistent and

more ﬁne-grained features and evaluated it on the GEBD

task. The main difference between TeG and our work is that

TeG uses a 3D-ResNet-50 encoder as their backbone, which

makes the training computationally expensive, whereas we

use 2D-ResNet-50 model and modify it by adding temporal

shift module (TSM

) [

] to achieve the same effect as 3D

convolution while keeping the complexity of a 2D CNN.

GEBD can be used as a preliminary step in a larger down-

2https://github.com/Breakthrough/PySceneDetect

3Temporal Self-Similarity Matrix

4Temporal Shift Module

stream application, e.g. video summarization, video cap-

tioning [

], or ad cue-point detection [

]. It is, therefore,

important that the GEBD model not add excessive computa-

tional overhead to the overall pipeline, unlike many of the

examples of related work presented here.

2.2. SSL for video representation learning.

Self-supervision has become the new norm for learning

representations given its ability to exploit unlabelled data

[

]. Recent approaches

devised for video understanding can be divided into two

categories based on the SSL objective, namely pretext task

based and contrastive learning based.

Pretext task based.

The key idea here is to design a pre-

text task for which labels are generated in an online fashion,

referred to as pseudo labels, without any human annota-

tion. Examples include: predicting correct temporal order

[

], Video Rot-Net [

] for video rotation prediction, clip

order prediction [

], odd-one-out networks [

], sorting

sequences [

], and pace prediction [

]

. All these ap-

proaches exploit raw spatio-temporal signals from videos in

different ways based on pretext tasks and consequently learn

representations suitable for varied downstream tasks.

Contrastive learning based.

Contrastive learning ap-

proaches bring semantically similar objects, clips, etc., close

together in the embedding space while contrasting them with

negative samples, using objectives based on some variant of

Noise Contrastive Estimation (NCE) [

]. The Contrastive

Predictive Coding (CPC) approach [

] for images was ex-

tended to videos in DPC [

] and MemDPC [

], which

augments DPC with the notion of compressed memory. Li

et al. [

] extends the contrasting mutli-view framework for

inter-intra style video representation, while Kong et al. [

]

combine ideas from cycle-consistency with contrastive learn-

ing to propose cycle-contrast. Likewise, Yang et al. [

]

exploits visual tempo in a contrastive framework to learn

spatio-temporal features. Similarly, [

] use temporal

cues with contrastive learning. VCLR [

] formulates a

video-level contrastive objective to capture global context.

In the work presented here, we exploit VCLR as our back-

bone objective. However, different to pretext tasks in VCLR,

which perform computation only on frame level, we mod-

ify those pretext tasks to not only operate on frame-level

but also on clip-level thereby leading to better modeling of

the spatio-temporal features in videos. See [

] for a more

extensive review of SSL methods for video understanding.

2.3. Motion estimation and Learning visual corre-

spondences for video understanding.

Motion estimation.

Two-stream architectures [

]

have exhibited promising performance on the action recog-

nition task by using pre-computed optical ﬂow, although

5leverages contrastive learning as an additional objective as well.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MotionAwareSelf-SupervisionforGenericEventBoundaryDetectionAyushK.Rai,TarunKrishna,JuliaDietlmeier,KevinMcGuinness,AlanF.Smeaton,NoelE.O'ConnorInsightSFICentreforDataAnalytics,DublinCityUniversity(DCU)ayush.rai3@mail.dcu.ieAbstractThetaskofGenericEventBoundaryDetection(GEBD)aimstodetectmomentsinv...

展开>> 收起<<

Motion Aware Self-Supervision for Generic Event Boundary Detection Ayush K. Rai Tarun Krishna Julia Dietlmeier Kevin McGuinness Alan F. Smeaton Noel E. OConnor Insight SFI Centre for Data Analytics Dublin City University DCU.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Motion Aware Self-Supervision for Generic Event Boundary Detection Ayush K. Rai Tarun Krishna Julia Dietlmeier Kevin McGuinness Alan F. Smeaton Noel E. OConnor Insight SFI Centre for Data Analytics Dublin City University DCU

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: