PointTAD Multi-Label Temporal Action Detection with Learnable Query Points Jing Tan1Xiaotong Zhao2Xintian Shi2Bin Kang2Limin Wang13y

2025-05-02 0 0 3.44MB 16 页 10玖币
侵权投诉
PointTAD: Multi-Label Temporal Action Detection
with Learnable Query Points
Jing Tan 1Xiaotong Zhao 2Xintian Shi 2Bin Kang 2Limin Wang 1,3
1State Key Laboratory for Novel Software Technology, Nanjing University
2Platform and Content Group (PCG), Tencent 3Shanghai AI Lab
jtan@smail.nju.edu.cn, {davidxtzhao,tinaxtshi,binkang}@tencent.com, lmwang@nju.edu.cn
Abstract
Traditional temporal action detection (TAD) usually handles untrimmed videos
with small number of action instances from a single label (e.g., ActivityNet, THU-
MOS). However, this setting might be unrealistic as different classes of actions
often co-occur in practice. In this paper, we focus on the task of multi-label tem-
poral action detection that aims to localize all action instances from a multi-label
untrimmed video. Multi-label TAD is more challenging as it requires for fine-
grained class discrimination within a single video and precise localization of the
co-occurring instances. To mitigate this issue, we extend the sparse query-based
detection paradigm from the traditional TAD and propose the multi-label TAD
framework of PointTAD. Specifically, our PointTAD introduces a small set of
learnable query points to represent the important frames of each action instance.
This point-based representation provides a flexible mechanism to localize the dis-
criminative frames at boundaries and as well the important frames inside the action.
Moreover, we perform the action decoding process with the Multi-level Interactive
Module to capture both point-level and instance-level action semantics. Finally,
our PointTAD employs an end-to-end trainable framework simply based on RGB
input for easy deployment. We evaluate our proposed method on two popular
benchmarks and introduce the new metric of detection-mAP for multi-label TAD.
Our model outperforms all previous methods by a large margin under the detection-
mAP metric, and also achieves promising results under the segmentation-mAP
metric. Code is available at https://github.com/MCG-NJU/PointTAD.
1 Introduction
With the increasing amount of video resources on the Internet, video understanding is becoming one
of the most important topics in computer vision. Temporal action detection (TAD) [
52
,
23
,
21
,
4
,
10
,
19
,
37
] has been formally studied on traditional benchmarks such as THUMOS [
15
], ActivityNet [
14
],
and HACS [
50
]. However, the task seems impractical because their videos almost contain non-
overlapping actions from a single category:
85%
videos in THUMOS are annotated with single
action category. As a result, most TAD methods [
23
,
21
,
45
,
5
,
35
] simply cast this TAD problem into
sub-problems of action proposal generation and global video classification [
42
]. In this paper, we
shift our playground to the more complex setup of multi-label temporal action detection, which aims
to detect all action instances from multi-labeled untrimmed videos. Existing works [
9
,
16
,
40
,
8
] in
this field formulate the problem as a dense prediction task and perform multi-label classification in a
frame-wise manner. Consequently, these methods are weak in localization and fail to provide the
instance-level detection results (i.e., the starting time and ending time of each instance). In analogy
to image instance segmentation [
22
], we argue that it is necessary to redefine multi-label TAD as a
Work is done during internship at Tencent PCG. Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11035v3 [cs.CV] 21 Mar 2023
39.0s
31.5s Jump 36.2s
34.6s BodyContact 35.4s
39.0s 43.4s
24.0s
(24.0s, 25.0s, Sit)
(25.7s, 32.4s, Run)
S1 (32.1s, 34.1s, Jump)
S2 (32.2s, 36.3s, Sit)
(38.8s, 43.9s, Run)
(39.0s, 43.8s, Walk)
Predictions
Run
Sit
Groundtruths
Mistake S1 Mistake S2 Other Mistakes
Figure 1: Illustration of action predictions by segment-based action detectors in multi-label TAD.
instance-level detection problem rather than a frame-wise segmentation task. In this sense, multi-label
TAD results not only provide the action labels, but also the exact temporal extent of each instance.
Direct adaptation of action detectors is insufficient to deal with the challenges of concurrent instances
and complex action relations in multi-label TAD. The convention of extracting action features from
action segments
[
4
,
31
,
49
,
43
] lacks the flexibility of handling both important semantic frames inside
the instance as well as discriminative boundary frames. Consider the groundtruth action of “Jump”
in Fig. 1, segment-based action detectors mainly produce two kinds of error predictions, as in type
S1
and type
S2
.
S1
successfully predicts the correct action category with an
incomplete
segment
of action highlights, whereas
S2
does a better job in locating the boundaries yet get
misclassified
as “Sit” due to the inclusion of confusing frames. In addition, most action detectors [
49
,
31
,
25
]
are inadequate in processing sampled frames and classifying fine-grained action classes. They often
exploit temporal context modeling at a single level and ignore the exploration of channel semantics.
To address the above issues, we present PointTAD, a sparse query-based action detector that lever-
ages learnable query points to flexibly attend important frames for action instances. Inspired by
RepPoints [
47
], the query points are directly supervised by regression loss. Given specific regression
targets, the query points learn to locate at discriminative frames at action boundaries as well as
semantic key frames within actions. Hence, concurrent actions of different categories can yield dis-
tinctive features through the specific arrangement of query points. Moreover, we improve the action
localization and semantics decoding by proposing the Multi-level Interactive Module with dynamic
kernels based on query vector. Direct interpolation or pooling at each query point lacks temporal
reasoning over consecutive frames. Following deformable DETR [
55
], we extract point-level features
with deformable convolution [
6
,
54
] from a local snippet to capture the temporal cues of action
change or important movement. At instance-level, both temporal context and channel semantics are
captured with frame-wise and channel-wise dynamic mixing [
36
,
11
] to further decode the distinctive
features of simultaneous actions.
PointTAD streamlines end-to-end TAD with joint optimization of the backbone network and action
decoder without any post-processing technique. We validate our model on two challenging multi-
label TAD benchmarks. Our model achieves the state-of-the-art detection-mAP performance and
competitive segmentation-mAP performance to previous methods with RGB input.
2 Related Work
Multi-label temporal action detection.
Multi-label temporal action detection has been studied as a
multi-label frame-wise classification problem in the previous literature. Early methods [
29
,
30
] paid
a lot of attention on modeling the temporal relations between frames with the help of Gaussian filters
in temporal dimension. Other works integrated features at different temporal scales with dilated
temporal kernels [
9
] or iterative convolution-attention pairs [
8
]. Recently, attention has shifted beyond
temporal modeling. Coarse-Fine [
16
] handled different temporal resolutions in the slow-fast fashion
and performed spatial-temporal attention during fusion. MLAD [
40
] used multi-head self-attention
blocks at both spatial and class dimension to model class relations at each timestamp. In our proposed
method, we view the task as a instance-level detection problem and employ query-based framework
with sparse temporal points for accurate action detection. In addition, we study the temporal context
at different semantic levels, including inter-proposal, intra-proposal and point-level of modeling.
2
Segment-based representation.
Following the prevailing practice of bounding boxes [
20
,
13
,
32
,
1
]
in object detection, existing temporal action detectors incorporated action segments heavily with
three kinds of usage: as anchors, as intermediate proposals, and as final predictions. Segments as
anchors are explored mainly in anchor-based frameworks. These methods [
28
,
26
,
49
,
31
] used
sliding windows or pre-computed proposals as anchors. Most TAD methods [
49
,
19
,
45
,
4
,
52
,
51
]
use segments as intermediate proposals. Uniform sampling or pooling are commonly used to extract
features from these segments. P-GCN [
49
] applied max-pooling within local segments for proposal
features. G-TAD [
45
] uniformly divided segments into bins and average-pooled each bin to obtain
proposal features. AFSD [
19
] proposed boundary pooling in boundary region to refine action feature.
Segments as final predictions are employed among all TAD frameworks, because segments generally
facilitate the computation of action overlaps and loss functions. Instead, in this paper, we do not
need segments as anchors and directly employ learnable query points as intermediate proposals with
iterative refinement. The learnable query points represent the important frames within action and
action feature is extracted only from these keyframes rather than using RoI pooling.
Point-based representation.
Several existing works have used point representations to describe
keyframes [
12
,
38
], objects [
47
,
11
], tracks [
53
], and actions [
18
]. [
12
,
38
] tackled keyframe selection
by operating greedy algorithm on spatial SIFT keypoints [
12
] or clustering on local extremes of image
color/intensity [
38
]. These methods followed a bottom-up strategy to choose keyframes based on
local cues. In contrast, PointTAD represents action as a set of temporal points (keyframes). We follow
RepPoints [
47
] to handle the important frames of actions with point representations and refine these
points by action feature iteratively. Our method directly regresses keyframes from query vectors in a
top-down manner for more flexible temporal action detection. Note that PointTAD tackles different
tasks from RepPoints [
47
]. We also built PointTAD upon a query-based detector, where a small set of
action queries is employed to sparsely attend the frame sequence for potential actions, resulting in an
efficient detection framework.
Temporal context in videos.
Context aggregation at different levels of semantics is crucial for
temporal action modeling [
41
] and has been discussed in previous TAD methods. G-TAD [
45
] treated
each snippet input as graph node and applied graph convolution networks to enhance snippet-level
features with global context. ContextLoc [
56
] handled action semantics in hierarchy: it updated
snippet features with global context, obtained proposal features with frame-wise dynamic modeling
within each proposal and modeled the inter-proposal relations with GCNs. Although we considered
the same levels of semantic modeling, our method is different from ContextLoc. PointTAD focuses
on aggregating temporal cues at multiple levels, with deformable convolution at point-level as well
as frame and channel attentions at intra-proposal level. We also apply multi-head self-attention for
inter-proposal relation modeling.
3 PointTAD
We formulate the task of multi-label temporal action detection (TAD) as a set prediction problem.
Formally, given a video clip with
T
consecutive frames, we predict a set of action instances
Ψ =
{ψn= (ts
n, te
n, cn)}Nq
n=1
,
Nq
is the number of learnable queries,
ts
n, te
n
are the starting and ending
timestamp of the
n
-th detected instance,
cn
is its action category. The groundtruth action set to detect
are denoted
ˆ
Ψ = {ˆ
ψn= ( ˆ
ts
n,ˆ
te
n,ˆcn)}Ng
n=1
, where
ˆ
ts
n,ˆ
te
n
are the starting and ending timestamp of the
n-th action, ˆcnis the groundtruth action category, Ngis the number of groundtruth actions.
The overall architecture of PointTAD is depicted in Fig. 2. PointTAD consists of a
video encoder
and
an
action decoder
. The model takes three inputs for each sample: RGB frame sequence of length
T
, a set of learnable query points
P={Pi}Nq
i=1
, and query vectors
qRNq×D
. Learnable query
points explicitly describe the action locations by positioning themselves around action boundaries and
semantic key frames, and the query vectors decode action semantics and locations from the sampled
features. In the model, the video encoder extracts video features
XRT×D
from RGB frames. The
action decoder contains
L
stacked decoder layers and takes query points
P
, query vectors
q
and
video features
X
as input. Each decoder layer contains two parts: 1) the multi-head self-attention
block models the pair-wise relationship of query vectors and establishes inter-proposal modeling for
action detection; 2) the
Multi-level Interactive Module
models the point-level and instance-level
semantics with dynamic weights based on query vector. Overall, the action decoder aggregates the
temporal context at
point-level
,
intra-proposal
level and
inter-proposal
level. Finally, we use two
3
Multi-level
Interactive
Module
MHSA
Video
Encoder
x L
Query Points:
Nq x Ns
Query Vectors
Nq x D
Updated Query Points:
Nq x Ns
Updated Query Vectors
Nq x D
Offsets: Nq x Ns
FFN
Transform
FFN
Video features: T x D
Class: Nq x C
Proposals: Nq x 2
RGB frame sequence
Class Scores: T * C
Action Decoder
Figure 2:
Pipeline of PointTAD
. It consists of a backbone network that extracts video features
from consecutive RGB frames and an action decoder of
L
layers that directly decodes actions from
video features. PointTAD enables end-to-end training of backbone and action decoder without any
post-processing of predictions.
linear projection heads to decode action labels from query vectors, and transform query points to
detection outputs.
3.1 Video Encoder
We use the I3D backbone [
3
] as the video encoder in our framework. The video encoder is trained
end-to-end with the action decoder and optimized by action detection loss to bridge the domain gap
between action recognition and detection. For easy deployment of our framework in practice, we
avoid the usage of optical flow due to its cumbersome preparation procedure. In order to achieve good
performance on par with two-stream features by only using RGB input, we follow [
46
] to remove the
temporal pooling at
Mixed_5c
and fuse the features from
Mixed_5c
with features from
Mixed_4f
as in [
24
]. As a result, the temporal stride of encoded video features is 4. Spatial average pooling is
performed to squeeze the spatiotemporal representations from backbone to temporal features.
3.2 Learnable Query Points
Segment-based action representation (i.e., representing each action instance simply with a starting
and ending time) is limited in describing its boundary and content at the same time. To increase
the representation flexibility, we present a novel point-based representation to automatically learn
the positions of action boundary as well as its semantic key frames inside the instance. Specifically,
the point-based representation is denoted by
P={tj}Ns
j=1
for each query, where
tj
is the temporal
location of
jth
query point, and the point quantity per query is
Ns
and set to 21 empirically. We
explain the updating strategy and the learning of query points below.
Iterative point refinement. During training, the query points are initially placed at the midpoint of
the input video clip. Then, they are refined by query vectors
q
through iterations of decoder layers to
reach final positions. To be specific, at each decoder layer, the query point offsets are predicted from
updated query vector (see Sec. 3.3) by linear projection. We design a self-paced updating strategy
with adaptive scaling for each query at each layer to stabilize the training process. At decoder layer
l
, the query points for one query are represented by
Pl={tl
j}Ns
j=1
. The
Ns
offsets are denoted
{tl
j}Ns
j=1. The refinement can be summarized as:
Pl+1 ={(tl
j+ ∆tl
j·sl·0.5)}Ns
j=1,(1)
where
sl=max({tl
j})min({tl
j})
is the scaling parameter and describes the span of query points
at layer
l
. As a result, the updated step size gets smaller for shorter action, which helps with the
localization of short actions. Updated query points from previous layer are inputs to the next layer.
Learning query points.
The training of query points is directly supervised by regression loss at
both intermediate and final stages. We follow [
47
] to transform query points to pseudo segments for
regression loss calculation. The resulted pseudo segments participate in the calculation of L1-loss
and tIoU loss with groundtruth action segments in both label assignment and loss computation.
The transformation function is denoted by
T:P → S = (ts, te)
. We experiment with two kinds
of functions: Min-max
T1
and Partial min-max
T2
.Min-max is to take the minimum and maximum
4
摘要:

PointTAD:Multi-LabelTemporalActionDetectionwithLearnableQueryPointsJingTan1XiaotongZhao2XintianShi2BinKang2LiminWang1;3y1StateKeyLaboratoryforNovelSoftwareTechnology,NanjingUniversity2PlatformandContentGroup(PCG),Tencent3ShanghaiAILabjtan@smail.nju.edu.cn,{davidxtzhao,tinaxtshi,binkang}@tencent.com...

展开>> 收起<<
PointTAD Multi-Label Temporal Action Detection with Learnable Query Points Jing Tan1Xiaotong Zhao2Xintian Shi2Bin Kang2Limin Wang13y.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:3.44MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注