PointTAD Multi-Label Temporal Action Detection with Learnable Query Points Jing Tan1Xiaotong Zhao2Xintian Shi2Bin Kang2Limin Wang13y

2025-05-02 1 0 3.44MB 16 页 10玖币

侵权投诉

PointTAD: Multi-Label Temporal Action Detection

with Learnable Query Points

Jing Tan 1∗Xiaotong Zhao 2Xintian Shi 2Bin Kang 2Limin Wang 1,3†

1State Key Laboratory for Novel Software Technology, Nanjing University

2Platform and Content Group (PCG), Tencent 3Shanghai AI Lab

jtan@smail.nju.edu.cn, {davidxtzhao,tinaxtshi,binkang}@tencent.com, lmwang@nju.edu.cn

Abstract

Traditional temporal action detection (TAD) usually handles untrimmed videos

with small number of action instances from a single label (e.g., ActivityNet, THU-

MOS). However, this setting might be unrealistic as different classes of actions

often co-occur in practice. In this paper, we focus on the task of multi-label tem-

poral action detection that aims to localize all action instances from a multi-label

untrimmed video. Multi-label TAD is more challenging as it requires for ﬁne-

grained class discrimination within a single video and precise localization of the

co-occurring instances. To mitigate this issue, we extend the sparse query-based

detection paradigm from the traditional TAD and propose the multi-label TAD

framework of PointTAD. Speciﬁcally, our PointTAD introduces a small set of

learnable query points to represent the important frames of each action instance.

This point-based representation provides a ﬂexible mechanism to localize the dis-

criminative frames at boundaries and as well the important frames inside the action.

Moreover, we perform the action decoding process with the Multi-level Interactive

Module to capture both point-level and instance-level action semantics. Finally,

our PointTAD employs an end-to-end trainable framework simply based on RGB

input for easy deployment. We evaluate our proposed method on two popular

benchmarks and introduce the new metric of detection-mAP for multi-label TAD.

Our model outperforms all previous methods by a large margin under the detection-

mAP metric, and also achieves promising results under the segmentation-mAP

metric. Code is available at https://github.com/MCG-NJU/PointTAD.

1 Introduction

With the increasing amount of video resources on the Internet, video understanding is becoming one

of the most important topics in computer vision. Temporal action detection (TAD) [

] has been formally studied on traditional benchmarks such as THUMOS [

], ActivityNet [

and HACS [

]. However, the task seems impractical because their videos almost contain non-

overlapping actions from a single category:

85%

videos in THUMOS are annotated with single

action category. As a result, most TAD methods [

] simply cast this TAD problem into

sub-problems of action proposal generation and global video classiﬁcation [

]. In this paper, we

shift our playground to the more complex setup of multi-label temporal action detection, which aims

to detect all action instances from multi-labeled untrimmed videos. Existing works [

] in

this ﬁeld formulate the problem as a dense prediction task and perform multi-label classiﬁcation in a

frame-wise manner. Consequently, these methods are weak in localization and fail to provide the

instance-level detection results (i.e., the starting time and ending time of each instance). In analogy

to image instance segmentation [

], we argue that it is necessary to redeﬁne multi-label TAD as a

∗Work is done during internship at Tencent PCG. †Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11035v3 [cs.CV] 21 Mar 2023

39.0s

31.5s Jump 36.2s

34.6s BodyContact 35.4s

39.0s 43.4s

24.0s

(24.0s, 25.0s, Sit)

(25.7s, 32.4s, Run)

S1 (32.1s, 34.1s, Jump)

S2 (32.2s, 36.3s, Sit)

(38.8s, 43.9s, Run)

(39.0s, 43.8s, Walk)

Predictions

Run

Sit

Groundtruths

Mistake S1 Mistake S2 Other Mistakes

Figure 1: Illustration of action predictions by segment-based action detectors in multi-label TAD.

instance-level detection problem rather than a frame-wise segmentation task. In this sense, multi-label

TAD results not only provide the action labels, but also the exact temporal extent of each instance.

Direct adaptation of action detectors is insufﬁcient to deal with the challenges of concurrent instances

and complex action relations in multi-label TAD. The convention of extracting action features from

action segments

[

] lacks the ﬂexibility of handling both important semantic frames inside

the instance as well as discriminative boundary frames. Consider the groundtruth action of “Jump”

in Fig. 1, segment-based action detectors mainly produce two kinds of error predictions, as in type

and type

successfully predicts the correct action category with an

incomplete

segment

of action highlights, whereas

does a better job in locating the boundaries yet get

misclassiﬁed

as “Sit” due to the inclusion of confusing frames. In addition, most action detectors [

]

are inadequate in processing sampled frames and classifying ﬁne-grained action classes. They often

exploit temporal context modeling at a single level and ignore the exploration of channel semantics.

To address the above issues, we present PointTAD, a sparse query-based action detector that lever-

ages learnable query points to ﬂexibly attend important frames for action instances. Inspired by

RepPoints [

], the query points are directly supervised by regression loss. Given speciﬁc regression

targets, the query points learn to locate at discriminative frames at action boundaries as well as

semantic key frames within actions. Hence, concurrent actions of different categories can yield dis-

tinctive features through the speciﬁc arrangement of query points. Moreover, we improve the action

localization and semantics decoding by proposing the Multi-level Interactive Module with dynamic

kernels based on query vector. Direct interpolation or pooling at each query point lacks temporal

reasoning over consecutive frames. Following deformable DETR [

], we extract point-level features

with deformable convolution [

] from a local snippet to capture the temporal cues of action

change or important movement. At instance-level, both temporal context and channel semantics are

captured with frame-wise and channel-wise dynamic mixing [

] to further decode the distinctive

features of simultaneous actions.

PointTAD streamlines end-to-end TAD with joint optimization of the backbone network and action

decoder without any post-processing technique. We validate our model on two challenging multi-

label TAD benchmarks. Our model achieves the state-of-the-art detection-mAP performance and

competitive segmentation-mAP performance to previous methods with RGB input.

2 Related Work

Multi-label temporal action detection.

Multi-label temporal action detection has been studied as a

multi-label frame-wise classiﬁcation problem in the previous literature. Early methods [

] paid

a lot of attention on modeling the temporal relations between frames with the help of Gaussian ﬁlters

in temporal dimension. Other works integrated features at different temporal scales with dilated

temporal kernels [

] or iterative convolution-attention pairs [

]. Recently, attention has shifted beyond

temporal modeling. Coarse-Fine [

] handled different temporal resolutions in the slow-fast fashion

and performed spatial-temporal attention during fusion. MLAD [

] used multi-head self-attention

blocks at both spatial and class dimension to model class relations at each timestamp. In our proposed

method, we view the task as a instance-level detection problem and employ query-based framework

with sparse temporal points for accurate action detection. In addition, we study the temporal context

at different semantic levels, including inter-proposal, intra-proposal and point-level of modeling.

Segment-based representation.

Following the prevailing practice of bounding boxes [

]

in object detection, existing temporal action detectors incorporated action segments heavily with

three kinds of usage: as anchors, as intermediate proposals, and as ﬁnal predictions. Segments as

anchors are explored mainly in anchor-based frameworks. These methods [

] used

sliding windows or pre-computed proposals as anchors. Most TAD methods [

]

use segments as intermediate proposals. Uniform sampling or pooling are commonly used to extract

features from these segments. P-GCN [

] applied max-pooling within local segments for proposal

features. G-TAD [

] uniformly divided segments into bins and average-pooled each bin to obtain

proposal features. AFSD [

] proposed boundary pooling in boundary region to reﬁne action feature.

Segments as ﬁnal predictions are employed among all TAD frameworks, because segments generally

facilitate the computation of action overlaps and loss functions. Instead, in this paper, we do not

need segments as anchors and directly employ learnable query points as intermediate proposals with

iterative reﬁnement. The learnable query points represent the important frames within action and

action feature is extracted only from these keyframes rather than using RoI pooling.

Point-based representation.

Several existing works have used point representations to describe

keyframes [

], objects [

], tracks [

], and actions [

]. [

] tackled keyframe selection

by operating greedy algorithm on spatial SIFT keypoints [

] or clustering on local extremes of image

color/intensity [

]. These methods followed a bottom-up strategy to choose keyframes based on

local cues. In contrast, PointTAD represents action as a set of temporal points (keyframes). We follow

RepPoints [

] to handle the important frames of actions with point representations and reﬁne these

points by action feature iteratively. Our method directly regresses keyframes from query vectors in a

top-down manner for more ﬂexible temporal action detection. Note that PointTAD tackles different

tasks from RepPoints [

]. We also built PointTAD upon a query-based detector, where a small set of

action queries is employed to sparsely attend the frame sequence for potential actions, resulting in an

efﬁcient detection framework.

Temporal context in videos.

Context aggregation at different levels of semantics is crucial for

temporal action modeling [

] and has been discussed in previous TAD methods. G-TAD [

] treated

each snippet input as graph node and applied graph convolution networks to enhance snippet-level

features with global context. ContextLoc [

] handled action semantics in hierarchy: it updated

snippet features with global context, obtained proposal features with frame-wise dynamic modeling

within each proposal and modeled the inter-proposal relations with GCNs. Although we considered

the same levels of semantic modeling, our method is different from ContextLoc. PointTAD focuses

on aggregating temporal cues at multiple levels, with deformable convolution at point-level as well

as frame and channel attentions at intra-proposal level. We also apply multi-head self-attention for

inter-proposal relation modeling.

3 PointTAD

We formulate the task of multi-label temporal action detection (TAD) as a set prediction problem.

Formally, given a video clip with

consecutive frames, we predict a set of action instances

Ψ =

{ψn= (ts

n, te

n, cn)}Nq

n=1

is the number of learnable queries,

n, te

are the starting and ending

timestamp of the

-th detected instance,

is its action category. The groundtruth action set to detect

are denoted

Ψ = {ˆ

ψn= ( ˆ

n,ˆ

n,ˆcn)}Ng

n=1

, where

n,ˆ

are the starting and ending timestamp of the

n-th action, ˆcnis the groundtruth action category, Ngis the number of groundtruth actions.

The overall architecture of PointTAD is depicted in Fig. 2. PointTAD consists of a

video encoder

and

action decoder

. The model takes three inputs for each sample: RGB frame sequence of length

, a set of learnable query points

P={Pi}Nq

i=1

, and query vectors

q∈RNq×D

. Learnable query

points explicitly describe the action locations by positioning themselves around action boundaries and

semantic key frames, and the query vectors decode action semantics and locations from the sampled

features. In the model, the video encoder extracts video features

X∈RT×D

from RGB frames. The

action decoder contains

stacked decoder layers and takes query points

, query vectors

and

video features

as input. Each decoder layer contains two parts: 1) the multi-head self-attention

block models the pair-wise relationship of query vectors and establishes inter-proposal modeling for

action detection; 2) the

Multi-level Interactive Module

models the point-level and instance-level

semantics with dynamic weights based on query vector. Overall, the action decoder aggregates the

temporal context at

point-level

intra-proposal

level and

inter-proposal

level. Finally, we use two

Multi-level

Interactive

Module

MHSA

Video

Encoder

x L

Query Points:

Nq x Ns

Query Vectors

Nq x D

Updated Query Points:

Nq x Ns

Updated Query Vectors

Nq x D

Offsets: Nq x Ns

FFN

Transform

FFN

Video features: T x D

Class: Nq x C

Proposals: Nq x 2

RGB frame sequence

Class Scores: T * C

Action Decoder

Figure 2:

Pipeline of PointTAD

. It consists of a backbone network that extracts video features

from consecutive RGB frames and an action decoder of

layers that directly decodes actions from

video features. PointTAD enables end-to-end training of backbone and action decoder without any

post-processing of predictions.

linear projection heads to decode action labels from query vectors, and transform query points to

detection outputs.

3.1 Video Encoder

We use the I3D backbone [

] as the video encoder in our framework. The video encoder is trained

end-to-end with the action decoder and optimized by action detection loss to bridge the domain gap

between action recognition and detection. For easy deployment of our framework in practice, we

avoid the usage of optical ﬂow due to its cumbersome preparation procedure. In order to achieve good

performance on par with two-stream features by only using RGB input, we follow [

] to remove the

temporal pooling at

Mixed_5c

and fuse the features from

Mixed_5c

with features from

Mixed_4f

as in [

]. As a result, the temporal stride of encoded video features is 4. Spatial average pooling is

performed to squeeze the spatiotemporal representations from backbone to temporal features.

3.2 Learnable Query Points

Segment-based action representation (i.e., representing each action instance simply with a starting

and ending time) is limited in describing its boundary and content at the same time. To increase

the representation ﬂexibility, we present a novel point-based representation to automatically learn

the positions of action boundary as well as its semantic key frames inside the instance. Speciﬁcally,

the point-based representation is denoted by

P={tj}Ns

j=1

for each query, where

is the temporal

location of

jth

query point, and the point quantity per query is

and set to 21 empirically. We

explain the updating strategy and the learning of query points below.

Iterative point reﬁnement. During training, the query points are initially placed at the midpoint of

the input video clip. Then, they are reﬁned by query vectors

through iterations of decoder layers to

reach ﬁnal positions. To be speciﬁc, at each decoder layer, the query point offsets are predicted from

updated query vector (see Sec. 3.3) by linear projection. We design a self-paced updating strategy

with adaptive scaling for each query at each layer to stabilize the training process. At decoder layer

, the query points for one query are represented by

Pl={tl

j}Ns

j=1

. The

offsets are denoted

{∆tl

j}Ns

j=1. The reﬁnement can be summarized as:

Pl+1 ={(tl

j+ ∆tl

j·sl·0.5)}Ns

j=1,(1)

where

sl=max({tl

j})−min({tl

j})

is the scaling parameter and describes the span of query points

at layer

. As a result, the updated step size gets smaller for shorter action, which helps with the

localization of short actions. Updated query points from previous layer are inputs to the next layer.

Learning query points.

The training of query points is directly supervised by regression loss at

both intermediate and ﬁnal stages. We follow [

] to transform query points to pseudo segments for

regression loss calculation. The resulted pseudo segments participate in the calculation of L1-loss

and tIoU loss with groundtruth action segments in both label assignment and loss computation.

The transformation function is denoted by

T:P → S = (ts, te)

. We experiment with two kinds

of functions: Min-max

and Partial min-max

.Min-max is to take the minimum and maximum

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PointTAD:Multi-LabelTemporalActionDetectionwithLearnableQueryPointsJingTan1XiaotongZhao2XintianShi2BinKang2LiminWang1;3y1StateKeyLaboratoryforNovelSoftwareTechnology,NanjingUniversity2PlatformandContentGroup(PCG),Tencent3ShanghaiAILabjtan@smail.nju.edu.cn,{davidxtzhao,tinaxtshi,binkang}@tencent.com...

展开>> 收起<<

PointTAD Multi-Label Temporal Action Detection with Learnable Query Points Jing Tan1Xiaotong Zhao2Xintian Shi2Bin Kang2Limin Wang13y.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PointTAD Multi-Label Temporal Action Detection with Learnable Query Points Jing Tan1Xiaotong Zhao2Xintian Shi2Bin Kang2Limin Wang13y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: