Grounded Video Situation Recognition Zeeshan Khan C. V . Jawahar Makarand Tapaswi CVIT IIIT Hyderabad

2025-05-06 0 0 5.08MB 18 页 10玖币
侵权投诉
Grounded Video Situation Recognition
Zeeshan Khan C. V. Jawahar Makarand Tapaswi
CVIT, IIIT Hyderabad
https://zeeshank95.github.io/grvidsitu
Abstract
Dense video understanding requires answering several questions such as who is
doing what to whom, with what, how, why, and where. Recently, Video Situation
Recognition (VidSitu) is framed as a task for structured prediction of multiple
events, their relationships, and actions and various verb-role pairs attached to
descriptive entities. This task poses several challenges in identifying, disambiguat-
ing, and co-referencing entities across multiple verb-role pairs, but also faces
some challenges of evaluation. In this work, we propose the addition of spatio-
temporal grounding as an essential component of the structured prediction task in
a weakly supervised setting, and present a novel three stage Transformer model,
VideoWhisperer, that is empowered to make joint predictions. In stage one, we
learn contextualised embeddings for video features in parallel with key objects
that appear in the video clips to enable fine-grained spatio-temporal reasoning.
The second stage sees verb-role queries attend and pool information from object
embeddings, localising answers to questions posed about the action. The final
stage generates these answers as captions to describe each verb-role pair present
in the video. Our model operates on a group of events (clips) simultaneously and
predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When
evaluated on a grounding-augmented version of the VidSitu dataset, we observe a
large improvement in entity captioning accuracy, as well as the ability to localize
verb-roles without grounding annotations at training time.
1 Introduction
At the end of The Dark Knight, we see a short intense sequencethat involves Harvey Dent toss a coin
while holding a gun followed by sudden action. Holistic understanding of such a video sequence,
especially one that involves multiple people, requires predicting more than the action label (what
verb). For example, we may wish to answer questions such as who performed the action (agent),
why they are doing it (purpose / goal), how are they doing it (manner), where are they doing it
(location), and even what happens after (multi-event understanding). While humans are able to
perceive the situation and are good at answering such questions, many works often focus on building
tools for doing single tasks, e.g. predicting actions [
8
] or detecting objects [
2
,
4
] or image/video
captioning [
19
,
29
]. We are interested in assessing how some of these advances can be combined for
a holistic understanding of video clips.
A recent and audacious step towards this goal is the work by Sadhu et al. [
28
]. They propose Video
Situation Recognition (VidSitu), a structured prediction task over five short clips consisting of three
sub-problems: (i) recognizing the salient actions in the short clips; (ii) predicting roles and their
entities that are part of this action; and (iii) modelling simple event relations such as enable or
cause. Similar to the predecessor image situation recognition (imSitu [
40
]), VidSitu is annotated
using Semantic Role Labelling (SRL) [
22
]. A video (say 10s) is divided into multiple small events
(
2s) and each event is associated with a salient action verb (e.g.hit). Each verb has a fixed set
of roles or arguments, e.g.agent-Arg0,patient-Arg1,tool-Arg2,location-ArgM(Location),manner-
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.10828v1 [cs.CV] 19 Oct 2022
ArgM(manner),etc., and each role is annotated with a free form text caption, e.g.agent: Blonde
Woman, as illustrated in Fig. 1.
Video 1 Video 2 Video 3
Event 1
Event N
Ev
-1: Verb: ROLL
Arg0
(Roller)
Boy in striped shirt
Arg1
(Thing rolled)
Himself
ArgM
(Direction)
back and forth
Arg Scene
Backyard
Ev
-N: Verb: RUB
Arg0 (Rubber)
Person in blue shirt
Arg1
(Thing rubbed)
Dog
Arg2
(Surface)
hand
Arg Scene
Backyard
Ev
-1: Verb: HIT
Arg0
(Hitter)
Man in armor
Arg1 (Thing hit)
Bald man
Arg2
(Instrument)
Spear
Arg Scene
Arena
-N: Verb: WINCE
(Wincer)
Bald man
Arena
Ev
-1: Verb: LIFT
Arg0
(Elevator)
Blonde woman
Arg1
(Thing lift)
Her phone
ArgM
(Direction)
Up
ArgM
(Manner)
Quickly
Arg Scene
An open field
Ev
-N: Verb: TALK
Arg0
(Talker)
The kneeling man
Arg1
(Hearer)
Blonde woman
Arg2 (Manner)
Confidently
Arg Scene
An open field
Figure 1:
Overview of GVSR:
Given a video consisting of
multiple events, GVSR requires recognising the action verbs,
their corresponding roles, and localising them in the spatio-
temporal domain. This is a challenging task as it requires
to disambiguate between several roles that the same entity
may take in different events, e.g. in Video 2 the bald man
is a patient in event 1, but an agent in event N. Moreover,
the entities present in multiple events are co-referenced in all
such events. Colored arguments are grounded in the image
with bounding boxes (figure best seen in colour).
Grounded VidSitu.
VidSitu poses
various challenges: long-tailed distri-
bution of both verbs and text phrases,
disambiguating the roles, overcoming
semantic role-noun pair sparsity, and
co-referencing of entities in the en-
tire video. Moreover, there is ambi-
guity in text phrases that refer to the
same unique entity (e.g. “man in white
shirt” or “man with brown hair”). A
model may fail to understand which
attributes are important and may bias
towards a specific caption (or pattern
like shirt color), given the long-tailed
distribution. This is exacerbated when
multiple entities (e.g.agent and pa-
tient) have similar attributes and the
model predicts the same caption for
them (see Fig. 1). To remove biases
of the captioning module and gauge
the model’s ability to identify the role,
we propose Grounded Video Situation
Recognition (GVSR) - an extension
of the VidSitu task to include spatio-
temporal grounding. In addition to
predicting the captions for the role-
entity pairs, we now expect the struc-
tured output to contain spatio-temporal localization, currently posed as a weakly-supervised task.
Joint structured prediction.
Previous works [
28
,
38
] modeled the VidSitu tasks separately, e.g. the
ground-truth verb is fed to the SRL task. This setup does not allow for situation recognition on a new
video clip without manual intervention. Instead, in this work, we focus on solving three tasks jointly:
(i) verb classification; (ii) SRL; and (iii) Grounding for SRL. We ignore the original event relation
prediction task in this work, as this can be performed later in a decoupled manner similar to [28].
We propose VideoWhisperer, a new three-stage transformer architecture that enables video under-
standing at a global level through self-attention across all video clips, and generates predictions for
the above three tasks at an event level through localised event-role representations. In the first stage,
we use a Transformer encoder to align and contextualise 2D object features in addition to event-level
video features. These rich features are essential for grounded situation recognition, and are used to
predict both the verb-role pairs and entities. In the second stage, a Transformer decoder models the
role as a query, and applies cross-attention to find the best elements from the contextualised object
features, also enabling visual grounding. Finally, in stage three, we generate the captions for each
role entity. The three-stage network disentangles the three tasks and allows for end-to-end training.
Contributions summary.
(i) We present a new framework that combines grounding with SRL
for end-to-end Grounded Video Situation Recognition (GVSR). We will release the grounding
annotations and also include them in the evaluation benchmark. (ii) We design a new three-stage
transformer architecture for joint verb prediction, semantic-role labelling through caption generation,
and weakly-supervised grounding of visual entities. (iii) We propose role prediction and use role
queries contextualised by video embeddings for SRL, circumventing the requirement of ground-truth
verbs or roles, enabling end-to-end GVSR. (iv) At the encoder, we combine object features with video
features and highlight multiple advantages enabling weakly-supervised grounding and improving the
quality of SRL captions leading to a 22 points jump in CIDEr score in comparison to a video-only
baseline [
28
]. (v) Finally, we present extensive ablation experiments to analyze our model. Our
model achieves the state-of-the-art results on the VidSitu benchmark.
2
2 Related Work
Image Situation Recognition.
Situation Recognition in images was first proposed by [
10
] where
they created datasets to understand actions along with localisation of objects and people. Another line
of work, imSitu [
40
] proposed situation recognition via semantic role labelling by leveraging linguistic
frameworks, FrameNet [
3
] and WordNet [
20
] to formalize situations in the form of verb-role-noun
triplets. Recently, grounding has been incorporated with image situation recognition [
24
] to add a
level of understanding for the predicted SRL. Situation recognition requires global understanding
of the entire scene, where the verbs, roles and nouns interact with each other to predict a coherent
output. Therefore several approaches used CRF [
40
], LSTMs [
24
] and Graph neural networks [
14
]
to model the global dependencies among verb and roles. Recently various Transformer [
33
] based
methods have been proposed that claim large performance improvements [6,7,36].
Video Situation Recognition.
Recently, imSitu was extended to videos as VidSitu [
28
], a large scale
video dataset based on short movie clips spanning multiple events. Compared to image situation
recognition, VidSRL not only requires understanding the action and the entities involved in a single
frame, but also needs to coherently understand the entire video while predicting event-level verb-SRLs
and co-referencing the entities participating across events. Sadhu et al. [
28
] propose to use standard
video backbones for feature extraction followed by multiple but separate Transformers to model all
the tasks individually, using ground-truth of previous the task to model the next. A concurrent work
to this submission, [
38
], proposes to improve upon the video features by pretraining the low-level
video backbone using contrastive learning objectives, and pretrain the high-level video contextualiser
using event mask prediction tasks resulting in large performance improvements on SRL. Our goals
are different from the above two works, we propose to learn and predict all three tasks simultaneously.
To achieve this, we predict verb-role pairs on the fly and design a new role query contextualised
by video embeddings to model SRL. This eliminates the need for ground-truth verbs and enables
end-to-end situation recognition in videos. We also propose to learn contextualised object and video
features enabling weakly-supervised grounding for SRL, which was not supported by previous works.
Video Understanding.
Video understanding is a broad area of research, dominantly involving
tasks like action recognition [
5
,
8
,
9
,
30
,
35
,
37
], localisation [
16
,
17
], object grounding [
27
,
39
],
question answering [
32
,
41
], video captioning [
26
], and spatio-temporal detection [
9
,
31
]. These
tasks involve visual temporal understanding in a sparse uni-dimensional way. In contrast, GVSR
involves a hierarchy of tasks, coming together to provide a fixed structure, enabling dense situation
recognition. The proposed task requires global video understanding through event level predictions
and fine-grained details to recognise all the entities involved, the roles they play, and simultaneously
ground them. Note that our work on grounding is different from classical spatio-temporal video
grounding [
42
,
39
] or referring expressions based segmentation [
11
] as they require a text query as
input. In our case, both the text and the bounding box (grounding) are predicted jointly by the model.
3 VideoWhisperer for Grounded Video Situation Recognition
We now present the details of our three stage Transformer model, VideoWhisperer. A visual overview
is presented in Fig. 2. For brevity, we request the reader to refer to [
33
] for now popular details of
self- and cross-attention layers used in Transformer encoders and decoders.
Preliminaries.
Given a video
V
consisting of several short events
E={ei}
, the complete situation
in
V
, is characterised by 3 tasks. (i) Verb classification, requires predicting the action label
vi
associated with each event
ei
; (ii) Semantic role labelling (SRL), involves guessing the nouns
(captions)
Ci={Cik}
for various roles
Ri={r|r∈ P(vi)r∈ R}
associated with the verb
vi
.
P
is a mapping function from verbs to a set of roles based on VidSitu (extended PropBank [
22
]) and
R
is the set of all roles); and (iii) Spatio-temporal Grounding of each visual role-noun prediction
Cij
is
formulated as selecting one among several bounding box proposals
B
obtained from sub-sampled
keyframes of the video. We evaluate this against ground-truth annotations done at a keyframe level.
3.1 Contextualised Video and Object Features (Stage 1)
GVSR is a challenging task, that requires to coherently model spatio-temporal information to
understand the salient action, determine the semantic role-noun pairs involved with the action,
and simultaneously localise them. Different from previous works that operate only on event level
3
VO
𝐸𝑣𝑒𝑛𝑡&𝑤𝑖𝑠𝑒&𝑉𝑖𝑑𝑒𝑜𝑠{𝑒𝑖}!
"
𝑆𝑎𝑚𝑝𝑙𝑒𝑑&𝐹𝑟𝑎𝑚𝑒𝑠&{𝑓𝑡}!
#
Event 1 Event N
Frame 1 Frame T
Video
Embeddings
Object
Embeddings
Event temporal Position embedding
Event temporal Position embedding
Object 2d Position embedding
Verb Classification per Event
To ta l T * M O bj e ct Em b ed d in g s
M objects
per frame
Video
Backbone
Video
Backbone
Object
Backbone
Object
Backbone
RO
𝑒!
"
r1r1rk
Event temporal Position embedding
To ta l N Ev en t E m be d din gs
MLP MLP
Multi-label Role Classification
Per Event
: Run
: Breathe
CCCC
Man In Gray shirt On a Road On a Road
Woman In Blue shirt
rk
Role Query
xx x x
N events
(A)
C
RO
VO
: Self + Cross Attention
: Cross + Self Attention
: Self Attention
𝑒𝟏
"𝑒$
"𝑒$
"
𝑅𝑖
"
𝒗𝒊
$
𝒗𝒊
$
(B)
Man In
Gray shirt
Woman In
Blue shirt
T*M objects
Role queries
Event-aware Cross-attention of ROTD
𝑅𝑖
"
Figure 2:
VideoWhisperer
: We present a new 3-stage Transformer for GVSR. Stage-1 learns the
contextualised object and event embeddings through a video-object Transformer encoder (VO),
that is used to predict the verb-role pairs for each event. Stage-2 models all the predicted roles by
creating role queries contextualised by event embeddings, and attends to all the object proposals
through a role-object Transformer decoder (RO) to find the best entity that represents a role. The
output embeddings are fed to captioning Transformer decoder (C) to generate captions for each role.
Transformer RO’s cross-attention ranks all the object proposals enabling localization for each role.
video features, we propose to model both the event and object level features simultaneously. We
use a pretrained video backbone
φvid
to extract event level video embeddings
xe
i=φvid(ei)
. For
representing objects, we subsample frames
F={ft}T
t=1
from the entire video
V
. We use a pretrained
object detector
φobj
and extract top
M
object proposals from every frame. The box locations (along
with timestamp) and corresponding features are
B={bmt}, m = [1, . . . , M], t = [1, . . . , T ],and {xo
mt}M
m=1 =φobj(ft)respectively.(1)
The subset of frames associated with an event eiare computed based on the event’s timestamps,
Fi={ft|estart
iteend
i}.(2)
Specifically, at a sampling rate of 1fps, video
V
of 10s, and events
ei
of 2s each, we associate 3
frames with each event such that the border frames are shared. We can extend this association to all
object proposals based on the frame in which they appear and denote this as Bi.
Video-Object Transformer Encoder (VO).
Since the object and video embeddings come from
different spaces, we align and contextualise them with a Transformer encoder [
33
]. Event-level
position embeddings
PEi
are added to both representations, event
xe
i
and object
xo
mt
(
t∈ Fi
). In
addition, 2D object position embeddings
PEmt
are added to object embeddings
xo
mt
. Together, they
help capture spatio-temporal information. The object and video tokens are passed through multiple
self-attention layers to produce contextualised event and object embeddings:
[. . . , o0
mt,...,e0
i, . . .] = TransformerVO ([. . . , xo
mt +PEi+PEmt,...,xe
i+PEi, . . .]) .(3)
Verb and role classification.
Each contextualised event embedding
e0
i
is not only empowered to
combine information across neighboring events but also focus on key objects that may be relevant.
We predict the action label for each event by passing them through a 1-hidden layer MLP,
ˆvi=MLPe(e0
i).(4)
4
摘要:

GroundedVideoSituationRecognitionZeeshanKhanC.V.JawaharMakarandTapaswiCVIT,IIITHyderabadhttps://zeeshank95.github.io/grvidsituAbstractDensevideounderstandingrequiresansweringseveralquestionssuchaswhoisdoingwhattowhom,withwhat,how,why,andwhere.Recently,VideoSituationRecognition(VidSitu)isframedasatas...

展开>> 收起<<
Grounded Video Situation Recognition Zeeshan Khan C. V . Jawahar Makarand Tapaswi CVIT IIIT Hyderabad.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:5.08MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注