Graph2Vid Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization Nikita Dvornik Isma Hadji Hai Pham Dhaivat Bhatt Brais Martinez

2025-05-06 0 0 2.82MB 28 页 10玖币

侵权投诉

Graph2Vid: Flow graph to Video Grounding for

Weakly-supervised Multi-Step Localization

Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez,

Afsaneh Fazly, and Allan D. Jepson

Samsung AI Center

Abstract. In this work, we consider the problem of weakly-supervised

multi-step localization in instructional videos. An established approach

to this problem is to rely on a given list of steps. However, in reality,

there is often more than one way to execute a procedure successfully, by

following the set of steps in slightly varying orders. Thus, for success-

ful localization in a given video, recent works require the actual order

of procedure steps in the video, to be provided by human annotators

at both training and test times. Instead, here, we only rely on generic

procedural text that is not tied to a speciﬁc video. We represent the

various ways to complete the procedure by transforming the list of in-

structions into a procedure ﬂow graph which captures the partial order

of steps. Using the ﬂow graphs reduces both training and test time an-

notation requirements. To this end, we introduce the new problem of

ﬂow graph to video grounding. In this setup, we seek the optimal step

ordering consistent with the procedure ﬂow graph and a given video.

To solve this problem, we propose a new algorithm - Graph2Vid - that

infers the actual ordering of steps in the video and simultaneously lo-

calizes them. To show the advantage of our proposed formulation, we

extend the CrossTask dataset with procedure ﬂow graph information.

Our experiments show that Graph2Vid is both more eﬃcient than the

baselines and yields strong step localization results, without the need for

step order annotation.

Keywords: procedures, ﬂow graphs, instructional videos, localization

1 Introduction

Understanding video content from procedural activities has recently seen a surge

in interest with various applications including future anticipation [34,13], proce-

dure planning [5,1], question answering [43] and multi-step localization [45,38,24,23,10].

In this work, we tackle multi-step localization, i.e., inferring the temporal loca-

tion of procedure steps present in the video. Since fully-supervised approaches

[12,22,38] entail expensive labeling eﬀorts, several recent works perform step lo-

calization with weak supervision. The alignment-based approaches [30,4,10] are

of particular interest here as for each video they only require the knowledge of

step order to yield framewise step localization.

arXiv:2210.04996v2 [cs.CV] 31 Oct 2022

2 N. Dvornik, I. Hadji, H. Pham, D. Bhatt, B. Martinez, A. Fazly, A. Jepson

Prepare Ingredients Make dough, let it restSpread crust Make pizza sauce Add sauce + toppings

Linear procedure to video alignment (existing)

Procedure flow-graph to video grounding (proposed)

Fig. 1: Graph-to-Sequence Grounding. (top) instructional videos do not always

strictly follow a prototypical procedure order (e.g., recipe). (bottom) Therefore, we pro-

pose a new setup where procedural text is parsed into a ﬂow graph that is consequently

grounded to the video to temporally localize all steps using our novel algorithm.

However, all such alignment-based approaches share a common issue. They all

assume that a given procedure follows a strict order, which is often not the case.

For example, in the task of making a pizza, one can either start with steps related

to making dough, then steps involved in making the sauce, or vice-versa, before

ﬁnally putting the two preparations together. Since the general procedure (e.g.,

recipe) does not deﬁne a unique order of steps, the alignment-based approaches

rely on human annotations to provide the exact steps order for each video. In

other words, step localization via alignment requires using per-video step order

annotations during inference, which limits the practical value of this setup.

To this end, we propose a new approach for step localization that does not

rely on per-video step order annotation. Instead, it uses the general procedure

description, common to all the videos of the same category (e.g., the recipe of

making pizza independent of the video sequence), to localize procedure steps

present in any video. Fig. 1 illustrates the proposed problem setup. We propose

to represent a procedure using a ﬂow graph [33,18], i.e., a graph-based procedure

representation that encodes the partial order of instruction steps and captures

all the feasible ways to execute a procedure. This leads us to the novel prob-

lem of multi-step localization from instructional videos under the graph-based

setting, which we call ﬂow graph to video grounding. To support the evaluation

of our work we extend the widely used CrossTask dataset [45] with recipes and

corresponding ﬂow graphs. Importantly, in this work, the ﬂow graphs are ob-

tained by parsing procedural text (e.g., a recipe) freely available online using an

oﬀ-the-shelf parser, which makes the annotation step automatic and reduces the

amount of human annotation even further.

To achieve our goal of step localization from ﬂow graphs, we introduce a

novel solution for graph-to-sequence grounding - Graph2Vid. Graph2Vid is an

algorithm that, given a video and a procedure ﬂow graph, infers the temporal

Flow graph to Video Grounding for Multi-Step Localization 3

location of every instruction step such that the resulting step sequence is consis-

tent with the procedure ﬂow graph. Our proposed solutions grounds each step in

the video by: (i) expanding the original ﬂow graph into a meta-graph, that con-

cisely captures all topological sorts [40] of the original graph and (ii) applying

a novel graph-to-sequence alignment algorithm to ﬁnd the best alignment path

in the metagraph with the given video. Importantly, our alignment algorithm

has the ability to “drop” video frames from the alignment, in case there is not

a good match among the graph nodes, which eﬀectively models the no-action

behavior. Moreover, our Graph2Vid algorithm naturally admits a diﬀerentiable

approximation and can be used as a loss function for training video represen-

tation using ﬂow graph supervision. As we show in Section 3, this can further

improve step localization performance with ﬂow graphs.

Contributions. In summary the main contributions of this work are fourfold.

1. We introduce ﬂow graph to video grounding - a new task of multi-step lo-

calization in instructional videos given generic procedure ﬂow graphs.

2. We extend the CrossTask dataset by associating procedure text with each

category and parsing the instructions into a procedure ﬂow graph.

3. We propose a new graph-to-sequence grounding algorithm (i.e., Graph2Vid)

and show that Graph2Vid outperforms baseline approaches in step localiza-

tion performance and eﬃciency.

4. We show Graph2Vid can be used as a loss function to supervise video rep-

resentations with ﬂow graphs.

The code will be made available at github.com/SamsungLabs/Graph2Vid.

2 Related work

Sequence-to-sequence alignment. Sequence alignment has recently seen grow-

ing interest across various tasks [35,7,4,11,14,3,6,2], in particular, the meth-

ods seeking global alignment between sequences by relying on Dynamic Time

Warping (DTW) [31,7,4,14,6]. Some of these methods propose diﬀerentiable

approximations of the discrete operations (i.e., the min operator) to enable

training with DTW [7,14]. Others, allow DTW to handle outliers in the se-

quences [25,3,32,28,36,26]. Of particular note, the recently proposed Drop-DTW

algorithm [10] combines the beneﬁts of all those methods as it allows dropping

outliers occurring anywhere in the sequence, while still maintaining the abil-

ity of DTW to do one-to-many matching and enabling diﬀerentiable variations.

However, as most other sequence alignment algorithms, Drop-DTW matches se-

quence elements with each other in a linear order, not consider possible element

permutations within each sequence. In this work, we propose to extend Drop-

DTW to work with partially ordered sequences. This is achieved by representing

one of the sequences as a directed cyclic graph, thereby relaxing the strict order

requirement.

Graph-to-sequence alignment. Aligning graphs to sequences is an important

topic in computer science. One of the pioneering works in this area proposed a

4 N. Dvornik, I. Hadji, H. Pham, D. Bhatt, B. Martinez, A. Fazly, A. Jepson

Dynamic Programming (DP) based solution for pattern matching where the tar-

get text is represented as a graph [27]. Many follow up works extend this original

idea via enhancing the alignment procedure. Examples include, admitting addi-

tional dimensions in the DP tables for each alternative path [20], improving the

eﬃciency of the alignment algorithm [29,16] or explicitly allowing gaps in the

alignment, thereby achieving sub-sequence to graph matching [17]. A common

limitation among all these methods is the assumption that only one of the paths

in the graph aligns to the query sequence, while alternative paths and their cor-

responding nodes do not appear in the query sequence. Therefore, the goal in

graph-to-sequence alignment is to ﬁnd the speciﬁc path that best aligns with the

query sequence. In contrast, we consider the novel problem of graph-to-sequence

grounding. In particular, we consider the task where all nodes in the graph have a

match in the query sequence and therefore our task is to ground each node in the

sequence, while ﬁnding the optimal traversal in the graph that best aligns with

the sequence. This problem is strictly harder than graph-to-sequence alignment

and can not be readily tackled by the existing algorithms.

Video multi-step localization The task of video multi-step localization has

gained a lot of attention in the recent years [23,21,10] particularly thanks to in-

structional videos dataset availability that support this research area [45,38,44].

The task consists of determining the start and end times of all steps present in

the video, based on a description of the procedure depicted in the video. Some

methods rely on full supervision using ﬁne-grained labels indicating start and end

times of each step (e.g., [12,22,38]). However, these methods require extensive

labeling eﬀorts. Instead, other methods propose weakly supervised approaches

where only steps order information is needed to yield framewise step localiza-

tion [15,8,30,4,45,10]. However, these methods lack ﬂexibility as they require

exact order information to solve the step localization task. Here, we propose

a more ﬂexible approach where only partial order information, as given by a

procedure ﬂow graph, is required to localize each step. In particular, given a

procedure ﬂow graph, describing all possible step permutations that result in

successful procedure execution, our method localizes the steps in a given video,

by automatically grounding the steps of the graph in the video.

3 Our approach

In this section, we describe our approach for ﬂow graph to video grounding.

We start with a motivation and formal deﬁnition of our proposed ﬂow graph to

sequence grounding problem. Next, we describe in detail our proposed solution

to tackle the task of video multi-step localization using ﬂow graphs.

3.1 Background

Ordered steps to video alignment. If the true order of steps in a video

(i.e., as they happen in a video) is given, the task of step grounding reduces to a

well-deﬁned problem of steps-to-video alignment, which can be solved with some

Flow graph to Video Grounding for Multi-Step Localization 5

existing sequence alignment method. In particular, the recent Drop-DTW [10]

algorithm suits the task particularly well thanks to a unique set of desired prop-

erties: (i) it operates on sequences of continuous vectors (such as video and

step embeddings) (ii) it permits one-to-many matching, allowing multiple video

frames to match to a single step, and (iii) it allows for dropping elements from

the sequence, which in turn allows for ignoring video frames that are unrelated

to the sequence of steps. In Drop-DTW, the alignment is formulated as mini-

mization of the total match cost between the video clips and instruction steps.

It is solved using dynamic programming and can be made diﬀerentiable (see

Alg. 1 in [10]). That is, given a video, x, and a sequence of steps, v, Drop-DTW

returns the alignment cost, c∗, and alignment matrix, M∗, indicating the corre-

spondences between steps and video segments.

Procedure ﬂow graphs. In more realistic settings, procedure steps for many

processes, such as cooking recipes, are often given as a set of steps in a partial

order. Speciﬁcally, the partial ordering dictates that certain steps need to be

completed before other steps are started, but that other subsets of steps can

be done in any order. For example, when thinking of making a salad, one can

cut tomatoes and cucumbers in one order or the other, however we are certain

that both ingredients must be cut before mixing them into the salad. This is an

example of a procedure with partially ordered steps; i.e., there are multiple valid

ways to complete the procedure, all of which can be conveniently represented

with a ﬂow graph.

A procedure ﬂow graph is a Directed Acyclic Graph (DAG) G= (V, E),

where Vis a set of nodes and Eis a set of directed edges. Each node vi∈V

represents a procedure step and an edge ej∈Econnecting vkand vldeclares

that the procedure step vkmust be completed before vlbegins in any instruction

execution. If a node vkhas multiple ancestors, all the corresponding steps must

be completed before beginning instruction step vk. In this work, we assume that

Ghas a single root and sink nodes. For convenience, we automatically add them

to the graph if they are not already present. From the deﬁnition of the ﬂow

graph, it follows that every topological sort [40] of the nodes in G(see Fig. 2,

step 2) is a valid way to complete the procedure. This is an important property

that forms the foundation of our Graph2Vid approach, described next.

Flow graphs to video grounding. We deﬁne the task of grounding a ﬂow

graph Gin a video x= [xi]N

i=1, where Nis the total number of frames, as the

task of ﬁnding a disjoint set of corresponding video segments, sl= [xi]endl

i=startl

for each node vl∈ G of the ﬂow graph, such that the resulting segmentation

conforms to the ﬂow graph. Speciﬁcally, in a pair of resulting video segments,

(si,sj), segment sican only occur before sjin the video if the corresponding

node niis a predecessor of njin the ﬂow graph G. In this work, we assume that

every procedure step vlappears in the video exactly once.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Graph2Vid:FlowgraphtoVideoGroundingforWeakly-supervisedMulti-StepLocalizationNikitaDvornik,IsmaHadji,HaiPham,DhaivatBhatt,BraisMartinez,AfsanehFazly,andAllanD.JepsonSamsungAICenterAbstract.Inthiswork,weconsidertheproblemofweakly-supervisedmulti-steplocalizationininstructionalvideos.Anestablishedappr...

展开>> 收起<<

Graph2Vid Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization Nikita Dvornik Isma Hadji Hai Pham Dhaivat Bhatt Brais Martinez.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Graph2Vid Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization Nikita Dvornik Isma Hadji Hai Pham Dhaivat Bhatt Brais Martinez

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: