Segment-based representation.
Following the prevailing practice of bounding boxes [
20
,
13
,
32
,
1
]
in object detection, existing temporal action detectors incorporated action segments heavily with
three kinds of usage: as anchors, as intermediate proposals, and as final predictions. Segments as
anchors are explored mainly in anchor-based frameworks. These methods [
28
,
26
,
49
,
31
] used
sliding windows or pre-computed proposals as anchors. Most TAD methods [
49
,
19
,
45
,
4
,
52
,
51
]
use segments as intermediate proposals. Uniform sampling or pooling are commonly used to extract
features from these segments. P-GCN [
49
] applied max-pooling within local segments for proposal
features. G-TAD [
45
] uniformly divided segments into bins and average-pooled each bin to obtain
proposal features. AFSD [
19
] proposed boundary pooling in boundary region to refine action feature.
Segments as final predictions are employed among all TAD frameworks, because segments generally
facilitate the computation of action overlaps and loss functions. Instead, in this paper, we do not
need segments as anchors and directly employ learnable query points as intermediate proposals with
iterative refinement. The learnable query points represent the important frames within action and
action feature is extracted only from these keyframes rather than using RoI pooling.
Point-based representation.
Several existing works have used point representations to describe
keyframes [
12
,
38
], objects [
47
,
11
], tracks [
53
], and actions [
18
]. [
12
,
38
] tackled keyframe selection
by operating greedy algorithm on spatial SIFT keypoints [
12
] or clustering on local extremes of image
color/intensity [
38
]. These methods followed a bottom-up strategy to choose keyframes based on
local cues. In contrast, PointTAD represents action as a set of temporal points (keyframes). We follow
RepPoints [
47
] to handle the important frames of actions with point representations and refine these
points by action feature iteratively. Our method directly regresses keyframes from query vectors in a
top-down manner for more flexible temporal action detection. Note that PointTAD tackles different
tasks from RepPoints [
47
]. We also built PointTAD upon a query-based detector, where a small set of
action queries is employed to sparsely attend the frame sequence for potential actions, resulting in an
efficient detection framework.
Temporal context in videos.
Context aggregation at different levels of semantics is crucial for
temporal action modeling [
41
] and has been discussed in previous TAD methods. G-TAD [
45
] treated
each snippet input as graph node and applied graph convolution networks to enhance snippet-level
features with global context. ContextLoc [
56
] handled action semantics in hierarchy: it updated
snippet features with global context, obtained proposal features with frame-wise dynamic modeling
within each proposal and modeled the inter-proposal relations with GCNs. Although we considered
the same levels of semantic modeling, our method is different from ContextLoc. PointTAD focuses
on aggregating temporal cues at multiple levels, with deformable convolution at point-level as well
as frame and channel attentions at intra-proposal level. We also apply multi-head self-attention for
inter-proposal relation modeling.
3 PointTAD
We formulate the task of multi-label temporal action detection (TAD) as a set prediction problem.
Formally, given a video clip with
T
consecutive frames, we predict a set of action instances
Ψ =
{ψn= (ts
n, te
n, cn)}Nq
n=1
,
Nq
is the number of learnable queries,
ts
n, te
n
are the starting and ending
timestamp of the
n
-th detected instance,
cn
is its action category. The groundtruth action set to detect
are denoted
ˆ
Ψ = {ˆ
ψn= ( ˆ
ts
n,ˆ
te
n,ˆcn)}Ng
n=1
, where
ˆ
ts
n,ˆ
te
n
are the starting and ending timestamp of the
n-th action, ˆcnis the groundtruth action category, Ngis the number of groundtruth actions.
The overall architecture of PointTAD is depicted in Fig. 2. PointTAD consists of a
video encoder
and
an
action decoder
. The model takes three inputs for each sample: RGB frame sequence of length
T
, a set of learnable query points
P={Pi}Nq
i=1
, and query vectors
q∈RNq×D
. Learnable query
points explicitly describe the action locations by positioning themselves around action boundaries and
semantic key frames, and the query vectors decode action semantics and locations from the sampled
features. In the model, the video encoder extracts video features
X∈RT×D
from RGB frames. The
action decoder contains
L
stacked decoder layers and takes query points
P
, query vectors
q
and
video features
X
as input. Each decoder layer contains two parts: 1) the multi-head self-attention
block models the pair-wise relationship of query vectors and establishes inter-proposal modeling for
action detection; 2) the
Multi-level Interactive Module
models the point-level and instance-level
semantics with dynamic weights based on query vector. Overall, the action decoder aggregates the
temporal context at
point-level
,
intra-proposal
level and
inter-proposal
level. Finally, we use two
3