Grounded Video Situation Recognition Zeeshan Khan C. V . Jawahar Makarand Tapaswi CVIT IIIT Hyderabad

2025-05-06 0 0 5.08MB 18 页 10玖币

侵权投诉

Grounded Video Situation Recognition

Zeeshan Khan C. V. Jawahar Makarand Tapaswi

CVIT, IIIT Hyderabad

https://zeeshank95.github.io/grvidsitu

Abstract

Dense video understanding requires answering several questions such as who is

doing what to whom, with what, how, why, and where. Recently, Video Situation

Recognition (VidSitu) is framed as a task for structured prediction of multiple

events, their relationships, and actions and various verb-role pairs attached to

descriptive entities. This task poses several challenges in identifying, disambiguat-

ing, and co-referencing entities across multiple verb-role pairs, but also faces

some challenges of evaluation. In this work, we propose the addition of spatio-

temporal grounding as an essential component of the structured prediction task in

a weakly supervised setting, and present a novel three stage Transformer model,

VideoWhisperer, that is empowered to make joint predictions. In stage one, we

learn contextualised embeddings for video features in parallel with key objects

that appear in the video clips to enable ﬁne-grained spatio-temporal reasoning.

The second stage sees verb-role queries attend and pool information from object

embeddings, localising answers to questions posed about the action. The ﬁnal

stage generates these answers as captions to describe each verb-role pair present

in the video. Our model operates on a group of events (clips) simultaneously and

predicts verbs, verb-role pairs, their nouns, and their grounding on-the-ﬂy. When

evaluated on a grounding-augmented version of the VidSitu dataset, we observe a

large improvement in entity captioning accuracy, as well as the ability to localize

verb-roles without grounding annotations at training time.

1 Introduction

At the end of The Dark Knight, we see a short intense sequencethat involves Harvey Dent toss a coin

while holding a gun followed by sudden action. Holistic understanding of such a video sequence,

especially one that involves multiple people, requires predicting more than the action label (what

verb). For example, we may wish to answer questions such as who performed the action (agent),

why they are doing it (purpose / goal), how are they doing it (manner), where are they doing it

(location), and even what happens after (multi-event understanding). While humans are able to

perceive the situation and are good at answering such questions, many works often focus on building

tools for doing single tasks, e.g. predicting actions [

] or detecting objects [

] or image/video

captioning [

]. We are interested in assessing how some of these advances can be combined for

a holistic understanding of video clips.

A recent and audacious step towards this goal is the work by Sadhu et al. [

]. They propose Video

Situation Recognition (VidSitu), a structured prediction task over ﬁve short clips consisting of three

sub-problems: (i) recognizing the salient actions in the short clips; (ii) predicting roles and their

entities that are part of this action; and (iii) modelling simple event relations such as enable or

cause. Similar to the predecessor image situation recognition (imSitu [

]), VidSitu is annotated

using Semantic Role Labelling (SRL) [

]. A video (say 10s) is divided into multiple small events

(

∼

2s) and each event is associated with a salient action verb (e.g.hit). Each verb has a ﬁxed set

of roles or arguments, e.g.agent-Arg0,patient-Arg1,tool-Arg2,location-ArgM(Location),manner-

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.10828v1 [cs.CV] 19 Oct 2022

ArgM(manner),etc., and each role is annotated with a free form text caption, e.g.agent: Blonde

Woman, as illustrated in Fig. 1.

Video 1 Video 2 Video 3

Event 1

Event N

-1: Verb: ROLL

Arg0

(Roller)

Boy in striped shirt

Arg1

(Thing rolled)

Himself

ArgM

(Direction)

back and forth

Arg Scene

Backyard

-N: Verb: RUB

Arg0 (Rubber)

Person in blue shirt

Arg1

(Thing rubbed)

Dog

Arg2

(Surface)

hand

Arg Scene

Backyard

-1: Verb: HIT

Arg0

(Hitter)

Man in armor

Arg1 (Thing hit)

Bald man

Arg2

(Instrument)

Spear

Arg Scene

Arena

-N: Verb: WINCE

Arg0

(Wincer)

Bald man

Arg Scene

Arena

-1: Verb: LIFT

Arg0

(Elevator)

Blonde woman

Arg1

(Thing lift)

Her phone

ArgM

(Direction)

ArgM

(Manner)

Quickly

Arg Scene

An open field

-N: Verb: TALK

Arg0

(Talker)

The kneeling man

Arg1

(Hearer)

Blonde woman

Arg2 (Manner)

Confidently

Arg Scene

An open field

Figure 1:

Overview of GVSR:

Given a video consisting of

multiple events, GVSR requires recognising the action verbs,

their corresponding roles, and localising them in the spatio-

temporal domain. This is a challenging task as it requires

to disambiguate between several roles that the same entity

may take in different events, e.g. in Video 2 the bald man

is a patient in event 1, but an agent in event N. Moreover,

the entities present in multiple events are co-referenced in all

such events. Colored arguments are grounded in the image

with bounding boxes (ﬁgure best seen in colour).

Grounded VidSitu.

VidSitu poses

various challenges: long-tailed distri-

bution of both verbs and text phrases,

disambiguating the roles, overcoming

semantic role-noun pair sparsity, and

co-referencing of entities in the en-

tire video. Moreover, there is ambi-

guity in text phrases that refer to the

same unique entity (e.g. “man in white

shirt” or “man with brown hair”). A

model may fail to understand which

attributes are important and may bias

towards a speciﬁc caption (or pattern

like shirt color), given the long-tailed

distribution. This is exacerbated when

multiple entities (e.g.agent and pa-

tient) have similar attributes and the

model predicts the same caption for

them (see Fig. 1). To remove biases

of the captioning module and gauge

the model’s ability to identify the role,

we propose Grounded Video Situation

Recognition (GVSR) - an extension

of the VidSitu task to include spatio-

temporal grounding. In addition to

predicting the captions for the role-

entity pairs, we now expect the struc-

tured output to contain spatio-temporal localization, currently posed as a weakly-supervised task.

Joint structured prediction.

Previous works [

] modeled the VidSitu tasks separately, e.g. the

ground-truth verb is fed to the SRL task. This setup does not allow for situation recognition on a new

video clip without manual intervention. Instead, in this work, we focus on solving three tasks jointly:

(i) verb classiﬁcation; (ii) SRL; and (iii) Grounding for SRL. We ignore the original event relation

prediction task in this work, as this can be performed later in a decoupled manner similar to [28].

We propose VideoWhisperer, a new three-stage transformer architecture that enables video under-

standing at a global level through self-attention across all video clips, and generates predictions for

the above three tasks at an event level through localised event-role representations. In the ﬁrst stage,

we use a Transformer encoder to align and contextualise 2D object features in addition to event-level

video features. These rich features are essential for grounded situation recognition, and are used to

predict both the verb-role pairs and entities. In the second stage, a Transformer decoder models the

role as a query, and applies cross-attention to ﬁnd the best elements from the contextualised object

features, also enabling visual grounding. Finally, in stage three, we generate the captions for each

role entity. The three-stage network disentangles the three tasks and allows for end-to-end training.

Contributions summary.

(i) We present a new framework that combines grounding with SRL

for end-to-end Grounded Video Situation Recognition (GVSR). We will release the grounding

annotations and also include them in the evaluation benchmark. (ii) We design a new three-stage

transformer architecture for joint verb prediction, semantic-role labelling through caption generation,

and weakly-supervised grounding of visual entities. (iii) We propose role prediction and use role

queries contextualised by video embeddings for SRL, circumventing the requirement of ground-truth

verbs or roles, enabling end-to-end GVSR. (iv) At the encoder, we combine object features with video

features and highlight multiple advantages enabling weakly-supervised grounding and improving the

quality of SRL captions leading to a 22 points jump in CIDEr score in comparison to a video-only

baseline [

]. (v) Finally, we present extensive ablation experiments to analyze our model. Our

model achieves the state-of-the-art results on the VidSitu benchmark.

2 Related Work

Image Situation Recognition.

Situation Recognition in images was ﬁrst proposed by [

] where

they created datasets to understand actions along with localisation of objects and people. Another line

of work, imSitu [

] proposed situation recognition via semantic role labelling by leveraging linguistic

frameworks, FrameNet [

] and WordNet [

] to formalize situations in the form of verb-role-noun

triplets. Recently, grounding has been incorporated with image situation recognition [

] to add a

level of understanding for the predicted SRL. Situation recognition requires global understanding

of the entire scene, where the verbs, roles and nouns interact with each other to predict a coherent

output. Therefore several approaches used CRF [

], LSTMs [

] and Graph neural networks [

]

to model the global dependencies among verb and roles. Recently various Transformer [

] based

methods have been proposed that claim large performance improvements [6,7,36].

Video Situation Recognition.

Recently, imSitu was extended to videos as VidSitu [

], a large scale

video dataset based on short movie clips spanning multiple events. Compared to image situation

recognition, VidSRL not only requires understanding the action and the entities involved in a single

frame, but also needs to coherently understand the entire video while predicting event-level verb-SRLs

and co-referencing the entities participating across events. Sadhu et al. [

] propose to use standard

video backbones for feature extraction followed by multiple but separate Transformers to model all

the tasks individually, using ground-truth of previous the task to model the next. A concurrent work

to this submission, [

], proposes to improve upon the video features by pretraining the low-level

video backbone using contrastive learning objectives, and pretrain the high-level video contextualiser

using event mask prediction tasks resulting in large performance improvements on SRL. Our goals

are different from the above two works, we propose to learn and predict all three tasks simultaneously.

To achieve this, we predict verb-role pairs on the ﬂy and design a new role query contextualised

by video embeddings to model SRL. This eliminates the need for ground-truth verbs and enables

end-to-end situation recognition in videos. We also propose to learn contextualised object and video

features enabling weakly-supervised grounding for SRL, which was not supported by previous works.

Video Understanding.

Video understanding is a broad area of research, dominantly involving

tasks like action recognition [

], localisation [

], object grounding [

question answering [

], video captioning [

], and spatio-temporal detection [

]. These

tasks involve visual temporal understanding in a sparse uni-dimensional way. In contrast, GVSR

involves a hierarchy of tasks, coming together to provide a ﬁxed structure, enabling dense situation

recognition. The proposed task requires global video understanding through event level predictions

and ﬁne-grained details to recognise all the entities involved, the roles they play, and simultaneously

ground them. Note that our work on grounding is different from classical spatio-temporal video

grounding [

] or referring expressions based segmentation [

] as they require a text query as

input. In our case, both the text and the bounding box (grounding) are predicted jointly by the model.

3 VideoWhisperer for Grounded Video Situation Recognition

We now present the details of our three stage Transformer model, VideoWhisperer. A visual overview

is presented in Fig. 2. For brevity, we request the reader to refer to [

] for now popular details of

self- and cross-attention layers used in Transformer encoders and decoders.

Preliminaries.

Given a video

consisting of several short events

E={ei}

, the complete situation

, is characterised by 3 tasks. (i) Verb classiﬁcation, requires predicting the action label

associated with each event

; (ii) Semantic role labelling (SRL), involves guessing the nouns

(captions)

Ci={Cik}

for various roles

Ri={r|r∈ P(vi)∀r∈ R}

associated with the verb

is a mapping function from verbs to a set of roles based on VidSitu (extended PropBank [

]) and

is the set of all roles); and (iii) Spatio-temporal Grounding of each visual role-noun prediction

Cij

formulated as selecting one among several bounding box proposals

obtained from sub-sampled

keyframes of the video. We evaluate this against ground-truth annotations done at a keyframe level.

3.1 Contextualised Video and Object Features (Stage 1)

GVSR is a challenging task, that requires to coherently model spatio-temporal information to

understand the salient action, determine the semantic role-noun pairs involved with the action,

and simultaneously localise them. Different from previous works that operate only on event level

𝐸𝑣𝑒𝑛𝑡&𝑤𝑖𝑠𝑒&𝑉𝑖𝑑𝑒𝑜𝑠{𝑒𝑖}!

𝑆𝑎𝑚𝑝𝑙𝑒𝑑&𝐹𝑟𝑎𝑚𝑒𝑠&{𝑓𝑡}!

Event 1 Event N

Frame 1 Frame T

Video

Embeddings

Object

Embeddings

Event temporal Position embedding

Object 2d Position embedding

Verb Classification per Event

To ta l T * M O bj e ct Em b ed d in g s

M objects

per frame

Video

Backbone

Video

Backbone

Object

Backbone

Object

Backbone

𝑒!

r1r1rk

Event temporal Position embedding

To ta l N Ev en t E m be d din gs

MLP MLP

Multi-label Role Classification

Per Event

: Run

: Breathe

CCCC

Man In Gray shirt On a Road On a Road

Woman In Blue shirt

Role Query

xx x x

N events

(A)

: Self + Cross Attention

: Cross + Self Attention

: Self Attention

𝑒𝟏

"𝑒$

𝑅𝑖

𝒗𝒊

(B)

Man In

Gray shirt

Woman In

Blue shirt

T*M objects

Role queries

Event-aware Cross-attention of ROTD

𝑅𝑖

Figure 2:

VideoWhisperer

: We present a new 3-stage Transformer for GVSR. Stage-1 learns the

contextualised object and event embeddings through a video-object Transformer encoder (VO),

that is used to predict the verb-role pairs for each event. Stage-2 models all the predicted roles by

creating role queries contextualised by event embeddings, and attends to all the object proposals

through a role-object Transformer decoder (RO) to ﬁnd the best entity that represents a role. The

output embeddings are fed to captioning Transformer decoder (C) to generate captions for each role.

Transformer RO’s cross-attention ranks all the object proposals enabling localization for each role.

video features, we propose to model both the event and object level features simultaneously. We

use a pretrained video backbone

φvid

to extract event level video embeddings

i=φvid(ei)

. For

representing objects, we subsample frames

F={ft}T

t=1

from the entire video

. We use a pretrained

object detector

φobj

and extract top

object proposals from every frame. The box locations (along

with timestamp) and corresponding features are

B={bmt}, m = [1, . . . , M], t = [1, . . . , T ],and {xo

mt}M

m=1 =φobj(ft)respectively.(1)

The subset of frames associated with an event eiare computed based on the event’s timestamps,

Fi={ft|estart

i≤t≤eend

i}.(2)

Speciﬁcally, at a sampling rate of 1fps, video

of 10s, and events

of 2s each, we associate 3

frames with each event such that the border frames are shared. We can extend this association to all

object proposals based on the frame in which they appear and denote this as Bi.

Video-Object Transformer Encoder (VO).

Since the object and video embeddings come from

different spaces, we align and contextualise them with a Transformer encoder [

]. Event-level

position embeddings

PEi

are added to both representations, event

and object

(

t∈ Fi

). In

addition, 2D object position embeddings

PEmt

are added to object embeddings

. Together, they

help capture spatio-temporal information. The object and video tokens are passed through multiple

self-attention layers to produce contextualised event and object embeddings:

[. . . , o0

mt,...,e0

i, . . .] = TransformerVO ([. . . , xo

mt +PEi+PEmt,...,xe

i+PEi, . . .]) .(3)

Verb and role classiﬁcation.

Each contextualised event embedding

is not only empowered to

combine information across neighboring events but also focus on key objects that may be relevant.

We predict the action label for each event by passing them through a 1-hidden layer MLP,

ˆvi=MLPe(e0

i).(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GroundedVideoSituationRecognitionZeeshanKhanC.V.JawaharMakarandTapaswiCVIT,IIITHyderabadhttps://zeeshank95.github.io/grvidsituAbstractDensevideounderstandingrequiresansweringseveralquestionssuchaswhoisdoingwhattowhom,withwhat,how,why,andwhere.Recently,VideoSituationRecognition(VidSitu)isframedasatas...

展开>> 收起<<

Grounded Video Situation Recognition Zeeshan Khan C. V . Jawahar Makarand Tapaswi CVIT IIIT Hyderabad.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Grounded Video Situation Recognition Zeeshan Khan C. V . Jawahar Makarand Tapaswi CVIT IIIT Hyderabad

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: