
ArgM(manner),etc., and each role is annotated with a free form text caption, e.g.agent: Blonde
Woman, as illustrated in Fig. 1.
Video 1 Video 2 Video 3
Event 1
Event N
Figure 1:
Overview of GVSR:
Given a video consisting of
multiple events, GVSR requires recognising the action verbs,
their corresponding roles, and localising them in the spatio-
temporal domain. This is a challenging task as it requires
to disambiguate between several roles that the same entity
may take in different events, e.g. in Video 2 the bald man
is a patient in event 1, but an agent in event N. Moreover,
the entities present in multiple events are co-referenced in all
such events. Colored arguments are grounded in the image
with bounding boxes (figure best seen in colour).
Grounded VidSitu.
VidSitu poses
various challenges: long-tailed distri-
bution of both verbs and text phrases,
disambiguating the roles, overcoming
semantic role-noun pair sparsity, and
co-referencing of entities in the en-
tire video. Moreover, there is ambi-
guity in text phrases that refer to the
same unique entity (e.g. “man in white
shirt” or “man with brown hair”). A
model may fail to understand which
attributes are important and may bias
towards a specific caption (or pattern
like shirt color), given the long-tailed
distribution. This is exacerbated when
multiple entities (e.g.agent and pa-
tient) have similar attributes and the
model predicts the same caption for
them (see Fig. 1). To remove biases
of the captioning module and gauge
the model’s ability to identify the role,
we propose Grounded Video Situation
Recognition (GVSR) - an extension
of the VidSitu task to include spatio-
temporal grounding. In addition to
predicting the captions for the role-
entity pairs, we now expect the struc-
tured output to contain spatio-temporal localization, currently posed as a weakly-supervised task.
Joint structured prediction.
Previous works [
28
,
38
] modeled the VidSitu tasks separately, e.g. the
ground-truth verb is fed to the SRL task. This setup does not allow for situation recognition on a new
video clip without manual intervention. Instead, in this work, we focus on solving three tasks jointly:
(i) verb classification; (ii) SRL; and (iii) Grounding for SRL. We ignore the original event relation
prediction task in this work, as this can be performed later in a decoupled manner similar to [28].
We propose VideoWhisperer, a new three-stage transformer architecture that enables video under-
standing at a global level through self-attention across all video clips, and generates predictions for
the above three tasks at an event level through localised event-role representations. In the first stage,
we use a Transformer encoder to align and contextualise 2D object features in addition to event-level
video features. These rich features are essential for grounded situation recognition, and are used to
predict both the verb-role pairs and entities. In the second stage, a Transformer decoder models the
role as a query, and applies cross-attention to find the best elements from the contextualised object
features, also enabling visual grounding. Finally, in stage three, we generate the captions for each
role entity. The three-stage network disentangles the three tasks and allows for end-to-end training.
Contributions summary.
(i) We present a new framework that combines grounding with SRL
for end-to-end Grounded Video Situation Recognition (GVSR). We will release the grounding
annotations and also include them in the evaluation benchmark. (ii) We design a new three-stage
transformer architecture for joint verb prediction, semantic-role labelling through caption generation,
and weakly-supervised grounding of visual entities. (iii) We propose role prediction and use role
queries contextualised by video embeddings for SRL, circumventing the requirement of ground-truth
verbs or roles, enabling end-to-end GVSR. (iv) At the encoder, we combine object features with video
features and highlight multiple advantages enabling weakly-supervised grounding and improving the
quality of SRL captions leading to a 22 points jump in CIDEr score in comparison to a video-only
baseline [
28
]. (v) Finally, we present extensive ablation experiments to analyze our model. Our
model achieves the state-of-the-art results on the VidSitu benchmark.
2