TAMFORMER MULTI-MODAL TRANSFORMER WITH LEARNED ATTENTION MASK FOR EARLY INTENT PREDICTION Nada Osman Guglielmo Camporese Lamberto Ballan
2025-05-02
0
0
2.68MB
5 页
10玖币
侵权投诉
TAMFORMER: MULTI-MODAL TRANSFORMER WITH LEARNED ATTENTION MASK
FOR EARLY INTENT PREDICTION
Nada Osman, Guglielmo Camporese, Lamberto Ballan
Department of Mathematics “Tullio Levi-Civita”, University of Padova, Italy
{nadasalahmahmoud.osman, guglielmo.camporese}@phd.unipd.it
lamberto.ballan@unipd.it
ABSTRACT
Human intention prediction is a growing area of research
where an activity in a video has to be anticipated by a vision-
based system. To this end, the model creates a representation
of the past, and subsequently, it produces future hypothe-
ses about upcoming scenarios. In this work, we focus on
pedestrians’ early intention prediction in which, from a cur-
rent observation of an urban scene, the model predicts the
future activity of pedestrians that approach the street. Our
method is based on a multi-modal transformer that encodes
past observations and produces multiple predictions at dif-
ferent anticipation times. Moreover, we propose to learn the
attention masks of our transformer-based model (Temporal
Adaptive Mask Transformer) in order to weigh differently
present and past temporal dependencies. We investigate our
method on several public benchmarks for early intention pre-
diction, improving the prediction performances at different
anticipation times compared to the previous works.
Index Terms—Action anticipation, multi-modal deep learn-
ing, transformers, pedestrian intent prediction
1. INTRODUCTION
In the last years, the development of computer vision algo-
rithms has seen a massive improvement thanks to the advent
of deep learning enabling new applications in the context of
autonomous driving, video surveillance, and virtual reality.
The visual understanding capabilities of deep learning models
have been adopted in various domains, from smart cameras
used in video surveillance to cognitive systems in robotics
and multi-modal sensors for autonomous driving. Moreover,
a recent interesting direction involves predicting future ac-
tivities that can be anticipated from a visual content [1,2,
3]. Some applications enabled by the models designed for
action anticipation are pedestrian intention prediction from
a smart camera and ego-centric action anticipation from a
robotic agent. In this work, we investigate the early inten-
tion prediction of pedestrians in an urban environment. In
particular, i) we propose a new model for early intent pre-
diction based on a multi-modal transformer; ii) we propose
a new mechanism for learning the attention masks inside the
transformer that leads to better performances and more effi-
cient computation; and iii) we conduct several experiments
and model ablations on different datasets obtaining state-of-
the-art results on the early intent prediction task.
2. RELATED WORKS
Action Recognition. Video action recognition is a well inves-
tigated problem that, in recent years, has experienced massive
improvements thanks to the recent progress of deep learning.
Specifically, traditional hand-crafted video approaches [4,5,
6,7,8] have been replaced by models based on recurrent neu-
ral networks [9,10,11,12,13], 2D CNN [14,15,16], and 3D
CNN [17,18,19,20,12,21]. Transformers [22] have been
also investigated for spatio-temporal modeling [23,24,20]
improving the state-of-the-art performances on video related
problems including video action recognition.
Action Anticipation and Intent Prediction. Recently, antic-
ipating actions on videos gained attention given the develop-
ment of new methods [1,25,3,26,27], datasets [28,29,30],
and applications such as autonomous driving, human-robot
interaction and virtual reality. In particular, in urban environ-
ments, the pedestrian intent prediction from third-view cam-
eras is a growing area [29,30] in which models are designed
to predict the future activity of pedestrians.
Temporal Modeling on Vision Problems. Video-based
models need to process spatial and temporal information.
Usually, the temporal axis is considered an independent com-
ponent of the video, and in the model design, the spatial
information is processed differently from the temporal one.
Recent works proposed to model the temporal at different
frame rates [19], with multiple time-scales [25,31], and with
an adaptive frame-rate [32]. However, in such works, the
temporal sampling strategy of the frames is fixed and treated
as a hyper-parameter of the model. For this reason, in this
work, we explore and propose an adaptive mechanism for
weighting the importance of the current and past frames by
learning the attention mask inside our transformer model.
arXiv:2210.14714v1 [cs.CV] 26 Oct 2022
摘要:
展开>>
收起<<
TAMFORMER:MULTI-MODALTRANSFORMERWITHLEARNEDATTENTIONMASKFOREARLYINTENTPREDICTIONNadaOsman,GuglielmoCamporese,LambertoBallanDepartmentofMathematicsTullioLevi-Civita,UniversityofPadova,Italyfnadasalahmahmoud.osman,guglielmo.camporeseg@phd.unipd.itlamberto.ballan@unipd.itABSTRACTHumanintentionpredict...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
公司营销部领导述职述廉报告VIP免费
2024-12-03 4 -
100套述职述廉述法述学框架提纲VIP免费
2024-12-03 3 -
20220106政府党组班子党史学习教育专题民主生活会“五个带头”对照检查材料VIP免费
2024-12-03 3 -
20220106县纪委监委领导班子党史学习教育专题民主生活会对照检查材料VIP免费
2024-12-03 6 -
A文秘笔杆子工作资料汇编手册(近70000字)VIP免费
2024-12-03 3 -
20220106县领导班子党史学习教育专题民主生活会对照检查材料VIP免费
2024-12-03 4 -
经济开发区党工委书记管委会主任述学述职述廉述法报告VIP免费
2024-12-03 34 -
20220106政府领导专题民主生活会五个方面对照检查材料VIP免费
2024-12-03 11 -
派出所教导员述职述廉报告6篇VIP免费
2024-12-03 8 -
民主生活会对县委班子及其成员批评意见清单VIP免费
2024-12-03 50
分类:图书资源
价格:10玖币
属性:5 页
大小:2.68MB
格式:PDF
时间:2025-05-02


渝公网安备50010702506394