TAMFORMER: MULTI-MODAL TRANSFORMER WITH LEARNED ATTENTION MASK
FOR EARLY INTENT PREDICTION
Nada Osman, Guglielmo Camporese, Lamberto Ballan
Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy
{nadasalahmahmoud.osman, guglielmo.camporese}@phd.unipd.it
lamberto.ballan@unipd.it
ABSTRACT
Human intention prediction is a growing area of research
where an activity in a video has to be anticipated by a vision-
based system. To this end, the model creates a representation
of the past, and subsequently, it produces future hypothe-
ses about upcoming scenarios. In this work, we focus on
pedestrians’ early intention prediction in which, from a cur-
rent observation of an urban scene, the model predicts the
future activity of pedestrians that approach the street. Our
method is based on a multi-modal transformer that encodes
past observations and produces multiple predictions at dif-
ferent anticipation times. Moreover, we propose to learn the
attention masks of our transformer-based model (Temporal
Adaptive Mask Transformer) in order to weigh differently
present and past temporal dependencies. We investigate our
method on several public benchmarks for early intention pre-
diction, improving the prediction performances at different
anticipation times compared to the previous works.
Index Terms—Action anticipation, multi-modal deep learn-
ing, transformers, pedestrian intent prediction
1. INTRODUCTION
In the last years, the development of computer vision algo-
rithms has seen a massive improvement thanks to the advent
of deep learning enabling new applications in the context of
autonomous driving, video surveillance, and virtual reality.
The visual understanding capabilities of deep learning models
have been adopted in various domains, from smart cameras
used in video surveillance to cognitive systems in robotics
and multi-modal sensors for autonomous driving. Moreover,
a recent interesting direction involves predicting future ac-
tivities that can be anticipated from a visual content [1,2,
3]. Some applications enabled by the models designed for
action anticipation are pedestrian intention prediction from
a smart camera and ego-centric action anticipation from a
robotic agent. In this work, we investigate the early inten-
tion prediction of pedestrians in an urban environment. In
particular, i) we propose a new model for early intent pre-
diction based on a multi-modal transformer; ii) we propose
a new mechanism for learning the attention masks inside the
transformer that leads to better performances and more effi-
cient computation; and iii) we conduct several experiments
and model ablations on different datasets obtaining state-of-
the-art results on the early intent prediction task.
2. RELATED WORKS
Action Recognition. Video action recognition is a well inves-
tigated problem that, in recent years, has experienced massive
improvements thanks to the recent progress of deep learning.
Specifically, traditional hand-crafted video approaches [4,5,
6,7,8] have been replaced by models based on recurrent neu-
ral networks [9,10,11,12,13], 2D CNN [14,15,16], and 3D
CNN [17,18,19,20,12,21]. Transformers [22] have been
also investigated for spatio-temporal modeling [23,24,20]
improving the state-of-the-art performances on video related
problems including video action recognition.
Action Anticipation and Intent Prediction. Recently, antic-
ipating actions on videos gained attention given the develop-
ment of new methods [1,25,3,26,27], datasets [28,29,30],
and applications such as autonomous driving, human-robot
interaction and virtual reality. In particular, in urban environ-
ments, the pedestrian intent prediction from third-view cam-
eras is a growing area [29,30] in which models are designed
to predict the future activity of pedestrians.
Temporal Modeling on Vision Problems. Video-based
models need to process spatial and temporal information.
Usually, the temporal axis is considered an independent com-
ponent of the video, and in the model design, the spatial
information is processed differently from the temporal one.
Recent works proposed to model the temporal at different
frame rates [19], with multiple time-scales [25,31], and with
an adaptive frame-rate [32]. However, in such works, the
temporal sampling strategy of the frames is fixed and treated
as a hyper-parameter of the model. For this reason, in this
work, we explore and propose an adaptive mechanism for
weighting the importance of the current and past frames by
learning the attention mask inside our transformer model.
arXiv:2210.14714v1 [cs.CV] 26 Oct 2022