Reﬁning Action Boundaries for One-stage Detection Hanyuan Wang Majid Mirmehdi Dima Damen Toby Perrett

2025-04-29 0 0 2.4MB 8 页 10玖币

侵权投诉

Reﬁning Action Boundaries for One-stage Detection

Hanyuan Wang Majid Mirmehdi Dima Damen

Toby Perrett

Department of Computer Science, University of Bristol, Bristol, U.K.

{hanyuan.wang, dima.damen, toby.perrett}@bristol.ac.uk, majid@cs.bris.ac.uk

Abstract

Current one-stage action detection methods, which si-

multaneously predict action boundaries and the corre-

sponding class, do not estimate or use a measure of con-

ﬁdence in their boundary predictions, which can lead to

inaccurate boundaries. We incorporate the estimation

of boundary conﬁdence into one-stage anchor-free detec-

tion, through an additional prediction head that predicts

the reﬁned boundaries with higher conﬁdence. We ob-

tain state-of-the-art performance on the challenging EPIC-

KITCHENS-100 action detection as well as the standard

THUMOS14 action detection benchmarks, and achieve im-

provement on the ActivityNet-1.3 benchmark.

1. Introduction

Current video understanding approaches [12,6,40]

recognise actions on short, trimmed videos. These assume

the boundaries of actions are already given, and thus fo-

cus on the class prediction problem solely. However, most

real-life videos are untrimmed and contain irrelevant visual

content. Temporal action detection aims to temporally lo-

cate the boundaries of actions and classify them in longer,

unscripted and untrimmed videos [9,14,16,4], which is

crucial for video analysis.

Two-stage action detection approaches, such as [51,22,

2,20,7,37], were built on top of successful recognition

models [12,11,45] and widely used as reference methods

on simple action detection baselines [16,4]. They ﬁrst gen-

erate candidate proposals based on pre-deﬁned sliding win-

dows or matching locations with high probabilities scores,

and then classify them to obtain the ﬁnal predictions. How-

ever, such two-stage methods are inefﬁcient for the wider

variety of actions, action lengths and action/background

densities found in longer untrimmed videos, since a large

number of redundant candidate proposals are produced by

Figure 1. An illustration of the misalignment between the value

of tIoU and classiﬁcation scores of predicted proposals, caused by

the absence of boundary conﬁdences. Green denotes the ground

truth, blue and orange denote predictions produced by Action-

Former [48]. Speciﬁcally, when the boundary conﬁdence is not

considered as the ranking metric, the prediction with a higher clas-

siﬁcation score but poor boundaries (blue) is chosen, rather than

the prediction with better boundaries (orange).

sliding windows and location matching.

More recently, one-stage methods have been proposed,

where the network simultaneously predicts the current ac-

tion for each timestep and its associated boundaries [47,19,

48]. In this paper, we show that these methods are missing

the boundary conﬁdence in proposal regression and eval-

uation. This can lead to imprecise localisation due to in-

sufﬁcient boundary information, especially in the case of

actions of various lengths found in egocentric data, such

as EPIC-KITCHENS[9]. An example of the action ‘rinse

cloth’ from [9] is shown in Figure 1, where the prediction

with a higher classiﬁcation score has a lower overlap be-

tween boundaries and ground truth (blue), while the predic-

tion with better boundaries has a lower classiﬁcation score

(orange). This is due to the absence of boundary conﬁ-

dences, resulting in poor regression and unreliable scores.

In this paper, we consider the extent of an action pro-

posal and estimate the conﬁdence of the start and end

frames of the action segment, jointly. We supervise the

arXiv:2210.14284v1 [cs.CV] 25 Oct 2022

conﬁdence from the relative distance between the estimated

frame and the ground truth frame, for both the start and

end boundaries of the action. This conﬁdence information

is leveraged to reﬁne the boundaries of proposals which

leads to state-of-the-art action detection results on EPIC-

KITCHENS-100 [9] and THUMOS14 [16], and improve-

ment on ActivityNet-1.3 [4].

In summary, we introduce a boundary head for one-

stage anchor-free action detection which estimates bound-

ary conﬁdence scores based on relative distances. We ob-

tain state-of-the-art results on EPIC-KITCHENS-100 and

THUMOS14 action detection, using the same backbone as

the current state-of-the-art. Notably, signiﬁcant improve-

ment is achieved on EPIC-KITCHENS-100, which indi-

cates that our method performs well on complex actions

of various lengths. Further, we provide detailed ablations,

including investigating conﬁdence scores and the effect of

action lengths.

2. Related Work

Action detection methods can be grouped into two cate-

gories: two-stage and one-stage.

Two-stage action detection: Two-stage methods ﬁrst gen-

erate a set of candidate proposals and then classify each

proposal. They typically generate proposals by pre-deﬁned

sliding window and grouping temporal locations with high

probabilities of being within an action [51] or close to a

boundary [22,2]. Action and boundary combinations can

be selected based on high boundary conﬁdence [20,7], or

a combination of separately calculated boundary and ac-

tion scores [37]. This generation process can struggle when

presented with sequences containing many dense actions of

varying lengths, such as EPIC-KITCHENS-100 [9].

One-stage action detection: One-stage methods improve

detection efﬁciency by simultaneously predicting action

proposals and their associated classes. One approach is to

generate candidate boundaries by modelling temporal rela-

tionships [13,21,29,43,24]. However, these methods rely

on pre-deﬁned anchors, causing them to struggle when pre-

sented with a wide range of action durations. Inspired by

the DETR framework [5] for object detection, some works

use learned action [27] or graph [30] queries as input to a

transformer decoder. Whilst a promising direction, these

methods are not suitable for long videos due to attention

scaling issues. Anchor-free methods [47,19,48] simulta-

neously predict classiﬁcation scores and a pair of relative

distances to boundaries for each timestep.

Recently, ActionFormer [48] generated these predictions

with a multi-scale transformer encoder to model both short-

and long-range temporal dependencies, with simple classi-

ﬁcation and boundary regression heads, and achieved state-

of-the-art results on a number of benchmarks. In this work,

we adopt the same multi-scale transformer encoder and

pipeline as ActionFormer [48], but incorporate the ability

to estimate boundary conﬁdences.

3. Method

We ﬁrst brieﬂy review ActionFormer [48], and then in-

troduce our novel boundary head, which is incorporated into

ActionFormer to achieve better performance.

3.1. Overview of ActionFormer

ActionFormer ﬁrst extracts a feature pyramid based on

local self-attention, and then uses light-weight heads to si-

multaneously predict classiﬁcation scores and a pair of rel-

ative distances to boundaries for each timestep.

Transformer-based feature pyramid: ActionFormer ex-

tracts features from an untrimmed sequence and passes

them to a multi-scale transformer encoder [48] to construct

a feature pyramid sequence. The feature pyramid sequence

contains multiple resolutions for each timestep, which al-

lows a single timestep to detect short and long actions.

Prediction heads: A classiﬁcation head uses the feature

pyramid sequence to predict action labels and classiﬁcation

scores for each timestep in multiple resolutions, and simi-

larly, a regression head predicts relative distances to the pre-

dicted start and end boundary locations, for every timestep

in the feature pyramid.

Training and Inference: The network is trained by mini-

mizing the multi-part losses of the classiﬁcation head and

the regression head. For the classiﬁcation head, a focal

loss [23] is used to balance loss weights between easy and

hard examples. For the regression head, it minimises the

distance between the ground truth boundaries and the pre-

dicted boundaries using the GIoU loss [35]. At inference,

they predict a pair of relative distances to boundaries and

a classiﬁcation score to give a proposal for each timestep

across all pyramid levels. These candidate proposals are

ranked by classiﬁcation scores and further ﬁltered to obtain

the ﬁnal outputs of actions.

3.2. Boundary Head

In ActionFormer, the regression head nominates where

the boundaries are, without providing any conﬁdence of

the locations as boundaries. To address this, we compute

the boundary conﬁdence at the same time as the boundary

location prediction. One approach could be using a sepa-

rate branch to directly predict boundary conﬁdence. How-

ever, this may lead to learning conﬂicts in the anchor-free

pipeline, where the original network must learn the relative

distances between the current temporal location and ground

truth boundaries, rather than the conﬁdence that the current

location is a boundary (demonstrated in Section 4.3).

We design a simple but effective boundary head, which

computes boundary probabilities via a conﬁdence scaling,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ReningActionBoundariesforOne-stageDetectionHanyuanWangMajidMirmehdiDimaDamenTobyPerrettDepartmentofComputerScience,UniversityofBristol,Bristol,U.K.fhanyuan.wang,dima.damen,toby.perrettg@bristol.ac.uk,majid@cs.bris.ac.ukAbstractCurrentone-stageactiondetectionmethods,whichsi-multaneouslypredictaction...

展开>> 收起<<

Reﬁning Action Boundaries for One-stage Detection Hanyuan Wang Majid Mirmehdi Dima Damen Toby Perrett.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Reﬁning Action Boundaries for One-stage Detection Hanyuan Wang Majid Mirmehdi Dima Damen Toby Perrett

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: