Refining Action Boundaries for One-stage Detection Hanyuan Wang Majid Mirmehdi Dima Damen Toby Perrett

2025-04-29 0 0 2.4MB 8 页 10玖币
侵权投诉
Refining Action Boundaries for One-stage Detection
Hanyuan Wang Majid Mirmehdi Dima Damen
Toby Perrett
Department of Computer Science, University of Bristol, Bristol, U.K.
{hanyuan.wang, dima.damen, toby.perrett}@bristol.ac.uk, majid@cs.bris.ac.uk
Abstract
Current one-stage action detection methods, which si-
multaneously predict action boundaries and the corre-
sponding class, do not estimate or use a measure of con-
fidence in their boundary predictions, which can lead to
inaccurate boundaries. We incorporate the estimation
of boundary confidence into one-stage anchor-free detec-
tion, through an additional prediction head that predicts
the refined boundaries with higher confidence. We ob-
tain state-of-the-art performance on the challenging EPIC-
KITCHENS-100 action detection as well as the standard
THUMOS14 action detection benchmarks, and achieve im-
provement on the ActivityNet-1.3 benchmark.
1. Introduction
Current video understanding approaches [12,6,40]
recognise actions on short, trimmed videos. These assume
the boundaries of actions are already given, and thus fo-
cus on the class prediction problem solely. However, most
real-life videos are untrimmed and contain irrelevant visual
content. Temporal action detection aims to temporally lo-
cate the boundaries of actions and classify them in longer,
unscripted and untrimmed videos [9,14,16,4], which is
crucial for video analysis.
Two-stage action detection approaches, such as [51,22,
2,20,7,37], were built on top of successful recognition
models [12,11,45] and widely used as reference methods
on simple action detection baselines [16,4]. They first gen-
erate candidate proposals based on pre-defined sliding win-
dows or matching locations with high probabilities scores,
and then classify them to obtain the final predictions. How-
ever, such two-stage methods are inefficient for the wider
variety of actions, action lengths and action/background
densities found in longer untrimmed videos, since a large
number of redundant candidate proposals are produced by
978-1-6654-6382-9/22/$31.00 ©2022 IEEE
Figure 1. An illustration of the misalignment between the value
of tIoU and classification scores of predicted proposals, caused by
the absence of boundary confidences. Green denotes the ground
truth, blue and orange denote predictions produced by Action-
Former [48]. Specifically, when the boundary confidence is not
considered as the ranking metric, the prediction with a higher clas-
sification score but poor boundaries (blue) is chosen, rather than
the prediction with better boundaries (orange).
sliding windows and location matching.
More recently, one-stage methods have been proposed,
where the network simultaneously predicts the current ac-
tion for each timestep and its associated boundaries [47,19,
48]. In this paper, we show that these methods are missing
the boundary confidence in proposal regression and eval-
uation. This can lead to imprecise localisation due to in-
sufficient boundary information, especially in the case of
actions of various lengths found in egocentric data, such
as EPIC-KITCHENS[9]. An example of the action ‘rinse
cloth’ from [9] is shown in Figure 1, where the prediction
with a higher classification score has a lower overlap be-
tween boundaries and ground truth (blue), while the predic-
tion with better boundaries has a lower classification score
(orange). This is due to the absence of boundary confi-
dences, resulting in poor regression and unreliable scores.
In this paper, we consider the extent of an action pro-
posal and estimate the confidence of the start and end
frames of the action segment, jointly. We supervise the
arXiv:2210.14284v1 [cs.CV] 25 Oct 2022
confidence from the relative distance between the estimated
frame and the ground truth frame, for both the start and
end boundaries of the action. This confidence information
is leveraged to refine the boundaries of proposals which
leads to state-of-the-art action detection results on EPIC-
KITCHENS-100 [9] and THUMOS14 [16], and improve-
ment on ActivityNet-1.3 [4].
In summary, we introduce a boundary head for one-
stage anchor-free action detection which estimates bound-
ary confidence scores based on relative distances. We ob-
tain state-of-the-art results on EPIC-KITCHENS-100 and
THUMOS14 action detection, using the same backbone as
the current state-of-the-art. Notably, significant improve-
ment is achieved on EPIC-KITCHENS-100, which indi-
cates that our method performs well on complex actions
of various lengths. Further, we provide detailed ablations,
including investigating confidence scores and the effect of
action lengths.
2. Related Work
Action detection methods can be grouped into two cate-
gories: two-stage and one-stage.
Two-stage action detection: Two-stage methods first gen-
erate a set of candidate proposals and then classify each
proposal. They typically generate proposals by pre-defined
sliding window and grouping temporal locations with high
probabilities of being within an action [51] or close to a
boundary [22,2]. Action and boundary combinations can
be selected based on high boundary confidence [20,7], or
a combination of separately calculated boundary and ac-
tion scores [37]. This generation process can struggle when
presented with sequences containing many dense actions of
varying lengths, such as EPIC-KITCHENS-100 [9].
One-stage action detection: One-stage methods improve
detection efficiency by simultaneously predicting action
proposals and their associated classes. One approach is to
generate candidate boundaries by modelling temporal rela-
tionships [13,21,29,43,24]. However, these methods rely
on pre-defined anchors, causing them to struggle when pre-
sented with a wide range of action durations. Inspired by
the DETR framework [5] for object detection, some works
use learned action [27] or graph [30] queries as input to a
transformer decoder. Whilst a promising direction, these
methods are not suitable for long videos due to attention
scaling issues. Anchor-free methods [47,19,48] simulta-
neously predict classification scores and a pair of relative
distances to boundaries for each timestep.
Recently, ActionFormer [48] generated these predictions
with a multi-scale transformer encoder to model both short-
and long-range temporal dependencies, with simple classi-
fication and boundary regression heads, and achieved state-
of-the-art results on a number of benchmarks. In this work,
we adopt the same multi-scale transformer encoder and
pipeline as ActionFormer [48], but incorporate the ability
to estimate boundary confidences.
3. Method
We first briefly review ActionFormer [48], and then in-
troduce our novel boundary head, which is incorporated into
ActionFormer to achieve better performance.
3.1. Overview of ActionFormer
ActionFormer first extracts a feature pyramid based on
local self-attention, and then uses light-weight heads to si-
multaneously predict classification scores and a pair of rel-
ative distances to boundaries for each timestep.
Transformer-based feature pyramid: ActionFormer ex-
tracts features from an untrimmed sequence and passes
them to a multi-scale transformer encoder [48] to construct
a feature pyramid sequence. The feature pyramid sequence
contains multiple resolutions for each timestep, which al-
lows a single timestep to detect short and long actions.
Prediction heads: A classification head uses the feature
pyramid sequence to predict action labels and classification
scores for each timestep in multiple resolutions, and simi-
larly, a regression head predicts relative distances to the pre-
dicted start and end boundary locations, for every timestep
in the feature pyramid.
Training and Inference: The network is trained by mini-
mizing the multi-part losses of the classification head and
the regression head. For the classification head, a focal
loss [23] is used to balance loss weights between easy and
hard examples. For the regression head, it minimises the
distance between the ground truth boundaries and the pre-
dicted boundaries using the GIoU loss [35]. At inference,
they predict a pair of relative distances to boundaries and
a classification score to give a proposal for each timestep
across all pyramid levels. These candidate proposals are
ranked by classification scores and further filtered to obtain
the final outputs of actions.
3.2. Boundary Head
In ActionFormer, the regression head nominates where
the boundaries are, without providing any confidence of
the locations as boundaries. To address this, we compute
the boundary confidence at the same time as the boundary
location prediction. One approach could be using a sepa-
rate branch to directly predict boundary confidence. How-
ever, this may lead to learning conflicts in the anchor-free
pipeline, where the original network must learn the relative
distances between the current temporal location and ground
truth boundaries, rather than the confidence that the current
location is a boundary (demonstrated in Section 4.3).
We design a simple but effective boundary head, which
computes boundary probabilities via a confidence scaling,
摘要:

ReningActionBoundariesforOne-stageDetectionHanyuanWangMajidMirmehdiDimaDamenTobyPerrettDepartmentofComputerScience,UniversityofBristol,Bristol,U.K.fhanyuan.wang,dima.damen,toby.perrettg@bristol.ac.uk,majid@cs.bris.ac.ukAbstractCurrentone-stageactiondetectionmethods,whichsi-multaneouslypredictaction...

展开>> 收起<<
Refining Action Boundaries for One-stage Detection Hanyuan Wang Majid Mirmehdi Dima Damen Toby Perrett.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:2.4MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注