confidence from the relative distance between the estimated
frame and the ground truth frame, for both the start and
end boundaries of the action. This confidence information
is leveraged to refine the boundaries of proposals which
leads to state-of-the-art action detection results on EPIC-
KITCHENS-100 [9] and THUMOS14 [16], and improve-
ment on ActivityNet-1.3 [4].
In summary, we introduce a boundary head for one-
stage anchor-free action detection which estimates bound-
ary confidence scores based on relative distances. We ob-
tain state-of-the-art results on EPIC-KITCHENS-100 and
THUMOS14 action detection, using the same backbone as
the current state-of-the-art. Notably, significant improve-
ment is achieved on EPIC-KITCHENS-100, which indi-
cates that our method performs well on complex actions
of various lengths. Further, we provide detailed ablations,
including investigating confidence scores and the effect of
action lengths.
2. Related Work
Action detection methods can be grouped into two cate-
gories: two-stage and one-stage.
Two-stage action detection: Two-stage methods first gen-
erate a set of candidate proposals and then classify each
proposal. They typically generate proposals by pre-defined
sliding window and grouping temporal locations with high
probabilities of being within an action [51] or close to a
boundary [22,2]. Action and boundary combinations can
be selected based on high boundary confidence [20,7], or
a combination of separately calculated boundary and ac-
tion scores [37]. This generation process can struggle when
presented with sequences containing many dense actions of
varying lengths, such as EPIC-KITCHENS-100 [9].
One-stage action detection: One-stage methods improve
detection efficiency by simultaneously predicting action
proposals and their associated classes. One approach is to
generate candidate boundaries by modelling temporal rela-
tionships [13,21,29,43,24]. However, these methods rely
on pre-defined anchors, causing them to struggle when pre-
sented with a wide range of action durations. Inspired by
the DETR framework [5] for object detection, some works
use learned action [27] or graph [30] queries as input to a
transformer decoder. Whilst a promising direction, these
methods are not suitable for long videos due to attention
scaling issues. Anchor-free methods [47,19,48] simulta-
neously predict classification scores and a pair of relative
distances to boundaries for each timestep.
Recently, ActionFormer [48] generated these predictions
with a multi-scale transformer encoder to model both short-
and long-range temporal dependencies, with simple classi-
fication and boundary regression heads, and achieved state-
of-the-art results on a number of benchmarks. In this work,
we adopt the same multi-scale transformer encoder and
pipeline as ActionFormer [48], but incorporate the ability
to estimate boundary confidences.
3. Method
We first briefly review ActionFormer [48], and then in-
troduce our novel boundary head, which is incorporated into
ActionFormer to achieve better performance.
3.1. Overview of ActionFormer
ActionFormer first extracts a feature pyramid based on
local self-attention, and then uses light-weight heads to si-
multaneously predict classification scores and a pair of rel-
ative distances to boundaries for each timestep.
Transformer-based feature pyramid: ActionFormer ex-
tracts features from an untrimmed sequence and passes
them to a multi-scale transformer encoder [48] to construct
a feature pyramid sequence. The feature pyramid sequence
contains multiple resolutions for each timestep, which al-
lows a single timestep to detect short and long actions.
Prediction heads: A classification head uses the feature
pyramid sequence to predict action labels and classification
scores for each timestep in multiple resolutions, and simi-
larly, a regression head predicts relative distances to the pre-
dicted start and end boundary locations, for every timestep
in the feature pyramid.
Training and Inference: The network is trained by mini-
mizing the multi-part losses of the classification head and
the regression head. For the classification head, a focal
loss [23] is used to balance loss weights between easy and
hard examples. For the regression head, it minimises the
distance between the ground truth boundaries and the pre-
dicted boundaries using the GIoU loss [35]. At inference,
they predict a pair of relative distances to boundaries and
a classification score to give a proposal for each timestep
across all pyramid levels. These candidate proposals are
ranked by classification scores and further filtered to obtain
the final outputs of actions.
3.2. Boundary Head
In ActionFormer, the regression head nominates where
the boundaries are, without providing any confidence of
the locations as boundaries. To address this, we compute
the boundary confidence at the same time as the boundary
location prediction. One approach could be using a sepa-
rate branch to directly predict boundary confidence. How-
ever, this may lead to learning conflicts in the anchor-free
pipeline, where the original network must learn the relative
distances between the current temporal location and ground
truth boundaries, rather than the confidence that the current
location is a boundary (demonstrated in Section 4.3).
We design a simple but effective boundary head, which
computes boundary probabilities via a confidence scaling,