Holistic Interaction Transformer Network for Action Detection Gueter Josmy Faure1Min-Hung Chen2Shang-Hong Lai12 1National Tsing Hua University Taiwan2Microsoft AI RD Center Taiwan

2025-05-08 0 0 7.53MB 14 页 10玖币

侵权投诉

Holistic Interaction Transformer Network for Action Detection

Gueter Josmy Faure1Min-Hung Chen2Shang-Hong Lai1,2

1National Tsing Hua University, Taiwan 2Microsoft AI R&D Center, Taiwan

josmyfaure@gapp.nthu.edu.tw vitec6@gmail.com lai@cs.nthu.edu.tw

Abstract

Actions are about how we interact with the environment,

including other people, objects, and ourselves. In this pa-

per, we propose a novel multi-modal Holistic Interaction

Transformer Network (HIT) that leverages the largely ig-

nored, but critical hand and pose information essential

to most human actions. The proposed HIT network is a

comprehensive bi-modal framework that comprises an RGB

stream and a pose stream. Each of them separately mod-

els person, object, and hand interactions. Within each

sub-network, an Intra-Modality Aggregation module (IMA)

is introduced that selectively merges individual interaction

units. The resulting features from each modality are then

glued using an Attentive Fusion Mechanism (AFM). Finally,

we extract cues from the temporal context to better classify

the occurring actions using cached memory. Our method

signiﬁcantly outperforms previous approaches on the J-

HMDB, UCF101-24, and MultiSports datasets. We also

achieve competitive results on AVA. The code will be avail-

able at https://github.com/joslefaure/HIT.

1. Introduction

Spatio-temporal action detection is the task of recogniz-

ing actions in space and in time. In this regard, it is fun-

damentally different and more challenging than plain ac-

tion detection, whose goal is to label an entire video with

a single class. A sound spatio-temporal action detection

framework aims to deeply learn the information in each

video frame to correctly label each person in the frame. It

should also keep a link between neighboring frames to bet-

ter understand activities with continuous properties such as

“open” - “close” [1, 5, 14, 30, 41]. In recent years, more

robust frameworks have been introduced that explicitly con-

sider the relationship between the spatial entities [28, 43]

since if two persons are in the same frame, they are likely

to be interacting with each other. However, using only per-

son features is insufﬁcient for capturing object-related ac-

tion (e.g., volleyball spiking). Others try to understand the

relationship not only between persons on the frame but also

Figure 1: Intuition. This ﬁgure exempliﬁes how essential

hand features are for detecting actions. Both persons in the

frame are interacting with objects. Still, the instance detec-

tor fails to detect those very objects the persons are interact-

ing with (green boxes) and, instead, picks the unimportant

ones (dashed grey boxes). However, capturing the hands

and everything in between (yellow boxes) gives the model a

better idea of the actions being performed by the actors (red

boxes);“lift/pick up” (left) and “carry/hold” (right).

their surrounding objects [26, 40]. These methods have two

main shortcomings. First, they only rely on objects with

high detection conﬁdence which might result in ignoring

important objects that may be too small to be detected or

unknown to the off-the-shelf detector. For example, in Fig-

ure 1, none of the objects the actors are interacting with are

detected. Secondly, these models struggle to detect actions

related to objects not present in the frame. For instance,

consider the action “point to (an object)”. It is possible that

the object the actor is pointing at is not in the current frame.

Figure 1 illustrates one of our motivations for undertak-

ing this research. Most humans’ actions are contingent on

what they do with their hands and their poses when exe-

cuting speciﬁc actions. The person on the left is “picking

up/lifting (something)” which is not noticeable even by hu-

mans. Still, our model is able to capture this action since

we consider the person’s hand features and the pose of the

subject (the bending position is typical of someone picking

up something). A similar issue occurs with the person on

the right who is “sitting and holding (an object)”. The man

is holding a cup, but the object detector does not ﬁnd the

arXiv:2210.12686v2 [cs.CV] 18 Nov 2022

object, probably because it is very small or highly transpar-

ent. Using hand features, our model implicitly focuses on

these challenging objects.

Our proposed Holistic Interaction Transformer (HIT)

network uses ﬁne-grained context, including person pose,

hands, and objects, to construct a bi-modal interaction struc-

ture. Each modality comprises three main components:

person interaction, object interaction, and hand interaction.

Each of these components learns valuable local action pat-

terns. We then use an Attentive Fusion Mechanism to com-

bine the different modalities before learning temporal infor-

mation from neighboring frames that help us better detect

the actions occurring in the current frame. We perform ex-

periments on the J-HMDB [13], UCF101-24 [35], Multi-

sports [18] and AVA [10] datasets, and our method achieves

state-of-the-art performance on the ﬁrst three while being

competitive with the SOTA methods on AVA.

The main contributions in this paper can be summarized as

follows:

• We propose a novel framework that combines RGB,

pose and hand features for action detection.

• We introduce a bi-modal Holistic Interaction Trans-

former (HIT) network that combines different kinds of

interactions in an intuitive and meaningful way.

• We propose an Attentive Fusion Module (AFM) that

works as a selective ﬁlter to keep the most informa-

tive features from each modality and an Intra-Modality

Aggregator (IMA) for learning useful action represen-

tations within the modalities.

• Our method achieves state-of-the-art performance on

three of the most challenging spatio-temporal action

detection datasets.

2. Related Work

2.1. Video Classiﬁcation

Video classiﬁcation consists in recognizing the activity

happening in a video clip. Usually, the clip spans a few sec-

onds and has a single label. Most recent approaches to this

task use 3D CNNs [1, 5, 6, 41] since they can process the

whole video clip as input, as opposed to considering it as

a sequence of frames [30, 39]. Due to the scarcity of la-

beled video datasets, many researchers rely on models pre-

trained on ImageNet [1, 42, 48] and use them as backbones

to extract video features. Two-stream networks [5, 6] are

another widely used approach to video classiﬁcation thanks

to their ability to only process a fraction of the input frames,

striking a good balance between accuracy and complexity.

2.2. Spatio-Temporal Action Detection

In recent years, more attention has been given to spatio-

temporal action detection [5, 7, 17, 28, 40]. As the name

(spatio-temporal) suggests, instead of classifying the whole

video into one class, we need to detect the actions in space,

i.e., the actions of everyone in the current frame, and in

time since each frame might contain different sets of ac-

tions. Most recent works on spatio-temporal action detec-

tion use a 3D CNN backbone [27, 43] to extract video fea-

tures and then crop the person features from the video fea-

tures either using ROI pooling [8] or ROI align [12]. Such

methods discard all the other potentially useful information

contained in the video.

2.3. Interaction Modeling

What if the spatio-temporal action detection task really is

an interaction modeling task? In fact, most of our everyday

actions are interactions with our environment (e.g., other

persons, objects, ourselves) and interactions between our

actions (for instance, it is very likely that“open the door”

is followed by “close the door”). The interaction model-

ing idea spurs a wave of research about how to effectively

model interaction for video understanding [28, 40, 43].

Most researches in this area use the attention mecha-

nism. [25, 52] propose Temporal Relation Network (TRN),

which learns temporal dependencies between frames or, in

other words, the interaction between entities from adjacent

frames. Other methods further model not just temporal but

spatial interactions between different entities from the same

frame [26, 40, 43, 49, 53]. Nevertheless, the choice of en-

tities for which to model the interactions differs by model.

Rather than using only human features, [28, 46] chose to

use the background information to model interactions be-

tween the person in the frame and the context. They still

crop the persons’ features but do not discard the remaining

background features. Such an approach provides rich in-

formation about the person’s surroundings. However, while

the context says a lot, it might induce noise.

Attempting to be more selective about the features to

use, [26, 40] ﬁrst pass the video frames through an ob-

ject detector, crop both the object and person features, and

then model their interactions. This extra layer of interac-

tion provides better representations than standalone human

interaction modeling models and helps with classes related

to objects such as “work on a computer”. However, they

still fall short when the objects are too small to be detected

or not in the current frame.

2.4. Multi-modal Action Detection

Most recent action detection frameworks use only RGB

features. The few exceptions such as [10, 34, 36, 38] and

[29] use optical ﬂow to capture motion. [38] employs an

Figure 2: Overview of our HIT Network. On top of our

RGB stream is a 3D CNN backbone which we use to ex-

tract video features. Our pose encoder is a spatial trans-

former model. We parallelly compute rich local informa-

tion from both sub-networks using person, hands, and ob-

ject features. We then combine the learned features using

an attentive fusion module before modeling their interac-

tion with the global context.

inception-like model and concatenates RGB and ﬂow fea-

tures at the Mixed4blayer (early fusion) whereas [10]

and [36] use an I3D backbone to separately extract RGB

and ﬂow features, then concatenate the two modalities just

before the action classiﬁer. While skeleton-based action

recognition has been around for a while now [2, 11, 24], as

far as we know, no previous works have tackled skeleton-

based action detection.

In this paper, we propose a bi-modal approach to action

detection that employs visual and skeleton-based features.

Each modality computes a series of interactions, including

person, object, and hands, before being fused. A temporal

interaction module is then applied to the fused features to

learn global information regarding neighboring frames.

3. Proposed Method

In this section, we provide a detailed walk-through of our

approach. Our Holistic Interaction Transformer (HIT) net-

work is concurrently composed of an RGB and a pose sub-

network. Each aims to learn persons’ interactions with their

surroundings (space) by focusing on the key entities that

drive most of our actions (e.g., objects, pose, hands). Af-

ter fusing the two sub-networks’ outputs, we further model

how actions evolve in time by looking at cached features

from past and future frames. Such a comprehensive activ-

ity understanding scheme helps us achieve superior action

detection performance.

This section is organized as follows: we ﬁrst describe

Figure 3: Illustration of the Interaction module. ∗refers

to the module-speciﬁc inputs while e

Prefers to the person

features in A(P)or the output of the module that comes

before A(∗).

Figure 4: Illustration of the Intra-Modality Aggregator.

Features from one unit to the next are ﬁrst augmented with

contextual cues then ﬁltered.

the entity selection process in section 3.1. In section 3.2,

we elaborate on the RGB modality before introducing its

pose counterpart in section 3.3. Further, in section 3.4, we

explain our Attentive Fusion Module (AFM) and then the

Temporal Interaction Unit (Section 3.5).

Given an input video Vin ∈RC×T×H×Wwe extract

video features Vb∈RC×T×H×Wby applying a 3D video

backbone. Afterward, using ROIAlign, we crop person fea-

tures P, object features O, and hands features Hfrom the

video. We also keep a cache of memory features M=

[t−S, ..., t−1, t, t+1, ..., t+S], where 2S+1 is the temporal

window. Parallelly, we use a pose model to extract person

keypoints Kfrom each keyframe of the dataset. Further,

the RGB and pose sub-networks compute the RGB feature

Frgb and pose feature Fpose, respectively. These features

are then fused and subsequently used as anchors for learn-

ing global context information to obtain Fcls. Finally, our

network outputs ˆy=g(Fcls), where gis the classiﬁcation

head. The overall framework is shown in Figure 2.

3.1. Entity Selection

HIT consists of two mirroring modalities with distinct

modules designed to learn different types of interactions.

Human actions are largely based on their pose, hand move-

ments (and pose), and interaction with their surroundings.

Based on these observations, we select human poses and

hands bounding boxes as entities for our model, along with

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HolisticInteractionTransformerNetworkforActionDetectionGueterJosmyFaure1Min-HungChen2Shang-HongLai1,21NationalTsingHuaUniversity,Taiwan2MicrosoftAIR&DCenter,Taiwanjosmyfaure@gapp.nthu.edu.twvitec6@gmail.comlai@cs.nthu.edu.twAbstractActionsareabouthowweinteractwiththeenvironment,includingotherpeople,...

展开>> 收起<<

Holistic Interaction Transformer Network for Action Detection Gueter Josmy Faure1Min-Hung Chen2Shang-Hong Lai12 1National Tsing Hua University Taiwan2Microsoft AI RD Center Taiwan.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Holistic Interaction Transformer Network for Action Detection Gueter Josmy Faure1Min-Hung Chen2Shang-Hong Lai12 1National Tsing Hua University Taiwan2Microsoft AI RD Center Taiwan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: