Holistic Interaction Transformer Network for Action Detection Gueter Josmy Faure1Min-Hung Chen2Shang-Hong Lai12 1National Tsing Hua University Taiwan2Microsoft AI RD Center Taiwan

2025-05-08 0 0 7.53MB 14 页 10玖币
侵权投诉
Holistic Interaction Transformer Network for Action Detection
Gueter Josmy Faure1Min-Hung Chen2Shang-Hong Lai1,2
1National Tsing Hua University, Taiwan 2Microsoft AI R&D Center, Taiwan
josmyfaure@gapp.nthu.edu.tw vitec6@gmail.com lai@cs.nthu.edu.tw
Abstract
Actions are about how we interact with the environment,
including other people, objects, and ourselves. In this pa-
per, we propose a novel multi-modal Holistic Interaction
Transformer Network (HIT) that leverages the largely ig-
nored, but critical hand and pose information essential
to most human actions. The proposed HIT network is a
comprehensive bi-modal framework that comprises an RGB
stream and a pose stream. Each of them separately mod-
els person, object, and hand interactions. Within each
sub-network, an Intra-Modality Aggregation module (IMA)
is introduced that selectively merges individual interaction
units. The resulting features from each modality are then
glued using an Attentive Fusion Mechanism (AFM). Finally,
we extract cues from the temporal context to better classify
the occurring actions using cached memory. Our method
significantly outperforms previous approaches on the J-
HMDB, UCF101-24, and MultiSports datasets. We also
achieve competitive results on AVA. The code will be avail-
able at https://github.com/joslefaure/HIT.
1. Introduction
Spatio-temporal action detection is the task of recogniz-
ing actions in space and in time. In this regard, it is fun-
damentally different and more challenging than plain ac-
tion detection, whose goal is to label an entire video with
a single class. A sound spatio-temporal action detection
framework aims to deeply learn the information in each
video frame to correctly label each person in the frame. It
should also keep a link between neighboring frames to bet-
ter understand activities with continuous properties such as
“open” - “close” [1, 5, 14, 30, 41]. In recent years, more
robust frameworks have been introduced that explicitly con-
sider the relationship between the spatial entities [28, 43]
since if two persons are in the same frame, they are likely
to be interacting with each other. However, using only per-
son features is insufficient for capturing object-related ac-
tion (e.g., volleyball spiking). Others try to understand the
relationship not only between persons on the frame but also
Figure 1: Intuition. This figure exemplifies how essential
hand features are for detecting actions. Both persons in the
frame are interacting with objects. Still, the instance detec-
tor fails to detect those very objects the persons are interact-
ing with (green boxes) and, instead, picks the unimportant
ones (dashed grey boxes). However, capturing the hands
and everything in between (yellow boxes) gives the model a
better idea of the actions being performed by the actors (red
boxes);“lift/pick up” (left) and “carry/hold” (right).
their surrounding objects [26, 40]. These methods have two
main shortcomings. First, they only rely on objects with
high detection confidence which might result in ignoring
important objects that may be too small to be detected or
unknown to the off-the-shelf detector. For example, in Fig-
ure 1, none of the objects the actors are interacting with are
detected. Secondly, these models struggle to detect actions
related to objects not present in the frame. For instance,
consider the action “point to (an object)”. It is possible that
the object the actor is pointing at is not in the current frame.
Figure 1 illustrates one of our motivations for undertak-
ing this research. Most humans’ actions are contingent on
what they do with their hands and their poses when exe-
cuting specific actions. The person on the left is “picking
up/lifting (something)” which is not noticeable even by hu-
mans. Still, our model is able to capture this action since
we consider the person’s hand features and the pose of the
subject (the bending position is typical of someone picking
up something). A similar issue occurs with the person on
the right who is “sitting and holding (an object)”. The man
is holding a cup, but the object detector does not find the
arXiv:2210.12686v2 [cs.CV] 18 Nov 2022
object, probably because it is very small or highly transpar-
ent. Using hand features, our model implicitly focuses on
these challenging objects.
Our proposed Holistic Interaction Transformer (HIT)
network uses fine-grained context, including person pose,
hands, and objects, to construct a bi-modal interaction struc-
ture. Each modality comprises three main components:
person interaction, object interaction, and hand interaction.
Each of these components learns valuable local action pat-
terns. We then use an Attentive Fusion Mechanism to com-
bine the different modalities before learning temporal infor-
mation from neighboring frames that help us better detect
the actions occurring in the current frame. We perform ex-
periments on the J-HMDB [13], UCF101-24 [35], Multi-
sports [18] and AVA [10] datasets, and our method achieves
state-of-the-art performance on the first three while being
competitive with the SOTA methods on AVA.
The main contributions in this paper can be summarized as
follows:
We propose a novel framework that combines RGB,
pose and hand features for action detection.
We introduce a bi-modal Holistic Interaction Trans-
former (HIT) network that combines different kinds of
interactions in an intuitive and meaningful way.
We propose an Attentive Fusion Module (AFM) that
works as a selective filter to keep the most informa-
tive features from each modality and an Intra-Modality
Aggregator (IMA) for learning useful action represen-
tations within the modalities.
Our method achieves state-of-the-art performance on
three of the most challenging spatio-temporal action
detection datasets.
2. Related Work
2.1. Video Classification
Video classification consists in recognizing the activity
happening in a video clip. Usually, the clip spans a few sec-
onds and has a single label. Most recent approaches to this
task use 3D CNNs [1, 5, 6, 41] since they can process the
whole video clip as input, as opposed to considering it as
a sequence of frames [30, 39]. Due to the scarcity of la-
beled video datasets, many researchers rely on models pre-
trained on ImageNet [1, 42, 48] and use them as backbones
to extract video features. Two-stream networks [5, 6] are
another widely used approach to video classification thanks
to their ability to only process a fraction of the input frames,
striking a good balance between accuracy and complexity.
2.2. Spatio-Temporal Action Detection
In recent years, more attention has been given to spatio-
temporal action detection [5, 7, 17, 28, 40]. As the name
(spatio-temporal) suggests, instead of classifying the whole
video into one class, we need to detect the actions in space,
i.e., the actions of everyone in the current frame, and in
time since each frame might contain different sets of ac-
tions. Most recent works on spatio-temporal action detec-
tion use a 3D CNN backbone [27, 43] to extract video fea-
tures and then crop the person features from the video fea-
tures either using ROI pooling [8] or ROI align [12]. Such
methods discard all the other potentially useful information
contained in the video.
2.3. Interaction Modeling
What if the spatio-temporal action detection task really is
an interaction modeling task? In fact, most of our everyday
actions are interactions with our environment (e.g., other
persons, objects, ourselves) and interactions between our
actions (for instance, it is very likely that“open the door”
is followed by “close the door”). The interaction model-
ing idea spurs a wave of research about how to effectively
model interaction for video understanding [28, 40, 43].
Most researches in this area use the attention mecha-
nism. [25, 52] propose Temporal Relation Network (TRN),
which learns temporal dependencies between frames or, in
other words, the interaction between entities from adjacent
frames. Other methods further model not just temporal but
spatial interactions between different entities from the same
frame [26, 40, 43, 49, 53]. Nevertheless, the choice of en-
tities for which to model the interactions differs by model.
Rather than using only human features, [28, 46] chose to
use the background information to model interactions be-
tween the person in the frame and the context. They still
crop the persons’ features but do not discard the remaining
background features. Such an approach provides rich in-
formation about the person’s surroundings. However, while
the context says a lot, it might induce noise.
Attempting to be more selective about the features to
use, [26, 40] first pass the video frames through an ob-
ject detector, crop both the object and person features, and
then model their interactions. This extra layer of interac-
tion provides better representations than standalone human
interaction modeling models and helps with classes related
to objects such as “work on a computer”. However, they
still fall short when the objects are too small to be detected
or not in the current frame.
2.4. Multi-modal Action Detection
Most recent action detection frameworks use only RGB
features. The few exceptions such as [10, 34, 36, 38] and
[29] use optical flow to capture motion. [38] employs an
2
Figure 2: Overview of our HIT Network. On top of our
RGB stream is a 3D CNN backbone which we use to ex-
tract video features. Our pose encoder is a spatial trans-
former model. We parallelly compute rich local informa-
tion from both sub-networks using person, hands, and ob-
ject features. We then combine the learned features using
an attentive fusion module before modeling their interac-
tion with the global context.
inception-like model and concatenates RGB and flow fea-
tures at the Mixed4blayer (early fusion) whereas [10]
and [36] use an I3D backbone to separately extract RGB
and flow features, then concatenate the two modalities just
before the action classifier. While skeleton-based action
recognition has been around for a while now [2, 11, 24], as
far as we know, no previous works have tackled skeleton-
based action detection.
In this paper, we propose a bi-modal approach to action
detection that employs visual and skeleton-based features.
Each modality computes a series of interactions, including
person, object, and hands, before being fused. A temporal
interaction module is then applied to the fused features to
learn global information regarding neighboring frames.
3. Proposed Method
In this section, we provide a detailed walk-through of our
approach. Our Holistic Interaction Transformer (HIT) net-
work is concurrently composed of an RGB and a pose sub-
network. Each aims to learn persons’ interactions with their
surroundings (space) by focusing on the key entities that
drive most of our actions (e.g., objects, pose, hands). Af-
ter fusing the two sub-networks’ outputs, we further model
how actions evolve in time by looking at cached features
from past and future frames. Such a comprehensive activ-
ity understanding scheme helps us achieve superior action
detection performance.
This section is organized as follows: we first describe
Figure 3: Illustration of the Interaction module. refers
to the module-specific inputs while e
Prefers to the person
features in A(P)or the output of the module that comes
before A().
Figure 4: Illustration of the Intra-Modality Aggregator.
Features from one unit to the next are first augmented with
contextual cues then filtered.
the entity selection process in section 3.1. In section 3.2,
we elaborate on the RGB modality before introducing its
pose counterpart in section 3.3. Further, in section 3.4, we
explain our Attentive Fusion Module (AFM) and then the
Temporal Interaction Unit (Section 3.5).
Given an input video Vin RC×T×H×Wwe extract
video features VbRC×T×H×Wby applying a 3D video
backbone. Afterward, using ROIAlign, we crop person fea-
tures P, object features O, and hands features Hfrom the
video. We also keep a cache of memory features M=
[tS, ..., t1, t, t+1, ..., t+S], where 2S+1 is the temporal
window. Parallelly, we use a pose model to extract person
keypoints Kfrom each keyframe of the dataset. Further,
the RGB and pose sub-networks compute the RGB feature
Frgb and pose feature Fpose, respectively. These features
are then fused and subsequently used as anchors for learn-
ing global context information to obtain Fcls. Finally, our
network outputs ˆy=g(Fcls), where gis the classification
head. The overall framework is shown in Figure 2.
3.1. Entity Selection
HIT consists of two mirroring modalities with distinct
modules designed to learn different types of interactions.
Human actions are largely based on their pose, hand move-
ments (and pose), and interaction with their surroundings.
Based on these observations, we select human poses and
hands bounding boxes as entities for our model, along with
3
摘要:

HolisticInteractionTransformerNetworkforActionDetectionGueterJosmyFaure1Min-HungChen2Shang-HongLai1,21NationalTsingHuaUniversity,Taiwan2MicrosoftAIR&DCenter,Taiwanjosmyfaure@gapp.nthu.edu.twvitec6@gmail.comlai@cs.nthu.edu.twAbstractActionsareabouthowweinteractwiththeenvironment,includingotherpeople,...

展开>> 收起<<
Holistic Interaction Transformer Network for Action Detection Gueter Josmy Faure1Min-Hung Chen2Shang-Hong Lai12 1National Tsing Hua University Taiwan2Microsoft AI RD Center Taiwan.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:7.53MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注