Graph2Vid Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization Nikita Dvornik Isma Hadji Hai Pham Dhaivat Bhatt Brais Martinez

2025-05-06 0 0 2.82MB 28 页 10玖币
侵权投诉
Graph2Vid: Flow graph to Video Grounding for
Weakly-supervised Multi-Step Localization
Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez,
Afsaneh Fazly, and Allan D. Jepson
Samsung AI Center
Abstract. In this work, we consider the problem of weakly-supervised
multi-step localization in instructional videos. An established approach
to this problem is to rely on a given list of steps. However, in reality,
there is often more than one way to execute a procedure successfully, by
following the set of steps in slightly varying orders. Thus, for success-
ful localization in a given video, recent works require the actual order
of procedure steps in the video, to be provided by human annotators
at both training and test times. Instead, here, we only rely on generic
procedural text that is not tied to a specific video. We represent the
various ways to complete the procedure by transforming the list of in-
structions into a procedure flow graph which captures the partial order
of steps. Using the flow graphs reduces both training and test time an-
notation requirements. To this end, we introduce the new problem of
flow graph to video grounding. In this setup, we seek the optimal step
ordering consistent with the procedure flow graph and a given video.
To solve this problem, we propose a new algorithm - Graph2Vid - that
infers the actual ordering of steps in the video and simultaneously lo-
calizes them. To show the advantage of our proposed formulation, we
extend the CrossTask dataset with procedure flow graph information.
Our experiments show that Graph2Vid is both more efficient than the
baselines and yields strong step localization results, without the need for
step order annotation.
Keywords: procedures, flow graphs, instructional videos, localization
1 Introduction
Understanding video content from procedural activities has recently seen a surge
in interest with various applications including future anticipation [34,13], proce-
dure planning [5,1], question answering [43] and multi-step localization [45,38,24,23,10].
In this work, we tackle multi-step localization, i.e., inferring the temporal loca-
tion of procedure steps present in the video. Since fully-supervised approaches
[12,22,38] entail expensive labeling efforts, several recent works perform step lo-
calization with weak supervision. The alignment-based approaches [30,4,10] are
of particular interest here as for each video they only require the knowledge of
step order to yield framewise step localization.
arXiv:2210.04996v2 [cs.CV] 31 Oct 2022
2 N. Dvornik, I. Hadji, H. Pham, D. Bhatt, B. Martinez, A. Fazly, A. Jepson
Prepare Ingredients Make dough, let it restSpread crust Make pizza sauce Add sauce + toppings
23
1
Linear procedure to video alignment (existing)
Procedure flow-graph to video grounding (proposed)
5
5
4
3
4
1
2
Fig. 1: Graph-to-Sequence Grounding. (top) instructional videos do not always
strictly follow a prototypical procedure order (e.g., recipe). (bottom) Therefore, we pro-
pose a new setup where procedural text is parsed into a flow graph that is consequently
grounded to the video to temporally localize all steps using our novel algorithm.
However, all such alignment-based approaches share a common issue. They all
assume that a given procedure follows a strict order, which is often not the case.
For example, in the task of making a pizza, one can either start with steps related
to making dough, then steps involved in making the sauce, or vice-versa, before
finally putting the two preparations together. Since the general procedure (e.g.,
recipe) does not define a unique order of steps, the alignment-based approaches
rely on human annotations to provide the exact steps order for each video. In
other words, step localization via alignment requires using per-video step order
annotations during inference, which limits the practical value of this setup.
To this end, we propose a new approach for step localization that does not
rely on per-video step order annotation. Instead, it uses the general procedure
description, common to all the videos of the same category (e.g., the recipe of
making pizza independent of the video sequence), to localize procedure steps
present in any video. Fig. 1 illustrates the proposed problem setup. We propose
to represent a procedure using a flow graph [33,18], i.e., a graph-based procedure
representation that encodes the partial order of instruction steps and captures
all the feasible ways to execute a procedure. This leads us to the novel prob-
lem of multi-step localization from instructional videos under the graph-based
setting, which we call flow graph to video grounding. To support the evaluation
of our work we extend the widely used CrossTask dataset [45] with recipes and
corresponding flow graphs. Importantly, in this work, the flow graphs are ob-
tained by parsing procedural text (e.g., a recipe) freely available online using an
off-the-shelf parser, which makes the annotation step automatic and reduces the
amount of human annotation even further.
To achieve our goal of step localization from flow graphs, we introduce a
novel solution for graph-to-sequence grounding - Graph2Vid. Graph2Vid is an
algorithm that, given a video and a procedure flow graph, infers the temporal
Flow graph to Video Grounding for Multi-Step Localization 3
location of every instruction step such that the resulting step sequence is consis-
tent with the procedure flow graph. Our proposed solutions grounds each step in
the video by: (i) expanding the original flow graph into a meta-graph, that con-
cisely captures all topological sorts [40] of the original graph and (ii) applying
a novel graph-to-sequence alignment algorithm to find the best alignment path
in the metagraph with the given video. Importantly, our alignment algorithm
has the ability to “drop” video frames from the alignment, in case there is not
a good match among the graph nodes, which effectively models the no-action
behavior. Moreover, our Graph2Vid algorithm naturally admits a differentiable
approximation and can be used as a loss function for training video represen-
tation using flow graph supervision. As we show in Section 3, this can further
improve step localization performance with flow graphs.
Contributions. In summary the main contributions of this work are fourfold.
1. We introduce flow graph to video grounding - a new task of multi-step lo-
calization in instructional videos given generic procedure flow graphs.
2. We extend the CrossTask dataset by associating procedure text with each
category and parsing the instructions into a procedure flow graph.
3. We propose a new graph-to-sequence grounding algorithm (i.e., Graph2Vid)
and show that Graph2Vid outperforms baseline approaches in step localiza-
tion performance and efficiency.
4. We show Graph2Vid can be used as a loss function to supervise video rep-
resentations with flow graphs.
The code will be made available at github.com/SamsungLabs/Graph2Vid.
2 Related work
Sequence-to-sequence alignment. Sequence alignment has recently seen grow-
ing interest across various tasks [35,7,4,11,14,3,6,2], in particular, the meth-
ods seeking global alignment between sequences by relying on Dynamic Time
Warping (DTW) [31,7,4,14,6]. Some of these methods propose differentiable
approximations of the discrete operations (i.e., the min operator) to enable
training with DTW [7,14]. Others, allow DTW to handle outliers in the se-
quences [25,3,32,28,36,26]. Of particular note, the recently proposed Drop-DTW
algorithm [10] combines the benefits of all those methods as it allows dropping
outliers occurring anywhere in the sequence, while still maintaining the abil-
ity of DTW to do one-to-many matching and enabling differentiable variations.
However, as most other sequence alignment algorithms, Drop-DTW matches se-
quence elements with each other in a linear order, not consider possible element
permutations within each sequence. In this work, we propose to extend Drop-
DTW to work with partially ordered sequences. This is achieved by representing
one of the sequences as a directed cyclic graph, thereby relaxing the strict order
requirement.
Graph-to-sequence alignment. Aligning graphs to sequences is an important
topic in computer science. One of the pioneering works in this area proposed a
4 N. Dvornik, I. Hadji, H. Pham, D. Bhatt, B. Martinez, A. Fazly, A. Jepson
Dynamic Programming (DP) based solution for pattern matching where the tar-
get text is represented as a graph [27]. Many follow up works extend this original
idea via enhancing the alignment procedure. Examples include, admitting addi-
tional dimensions in the DP tables for each alternative path [20], improving the
efficiency of the alignment algorithm [29,16] or explicitly allowing gaps in the
alignment, thereby achieving sub-sequence to graph matching [17]. A common
limitation among all these methods is the assumption that only one of the paths
in the graph aligns to the query sequence, while alternative paths and their cor-
responding nodes do not appear in the query sequence. Therefore, the goal in
graph-to-sequence alignment is to find the specific path that best aligns with the
query sequence. In contrast, we consider the novel problem of graph-to-sequence
grounding. In particular, we consider the task where all nodes in the graph have a
match in the query sequence and therefore our task is to ground each node in the
sequence, while finding the optimal traversal in the graph that best aligns with
the sequence. This problem is strictly harder than graph-to-sequence alignment
and can not be readily tackled by the existing algorithms.
Video multi-step localization The task of video multi-step localization has
gained a lot of attention in the recent years [23,21,10] particularly thanks to in-
structional videos dataset availability that support this research area [45,38,44].
The task consists of determining the start and end times of all steps present in
the video, based on a description of the procedure depicted in the video. Some
methods rely on full supervision using fine-grained labels indicating start and end
times of each step (e.g., [12,22,38]). However, these methods require extensive
labeling efforts. Instead, other methods propose weakly supervised approaches
where only steps order information is needed to yield framewise step localiza-
tion [15,8,30,4,45,10]. However, these methods lack flexibility as they require
exact order information to solve the step localization task. Here, we propose
a more flexible approach where only partial order information, as given by a
procedure flow graph, is required to localize each step. In particular, given a
procedure flow graph, describing all possible step permutations that result in
successful procedure execution, our method localizes the steps in a given video,
by automatically grounding the steps of the graph in the video.
3 Our approach
In this section, we describe our approach for flow graph to video grounding.
We start with a motivation and formal definition of our proposed flow graph to
sequence grounding problem. Next, we describe in detail our proposed solution
to tackle the task of video multi-step localization using flow graphs.
3.1 Background
Ordered steps to video alignment. If the true order of steps in a video
(i.e., as they happen in a video) is given, the task of step grounding reduces to a
well-defined problem of steps-to-video alignment, which can be solved with some
Flow graph to Video Grounding for Multi-Step Localization 5
existing sequence alignment method. In particular, the recent Drop-DTW [10]
algorithm suits the task particularly well thanks to a unique set of desired prop-
erties: (i) it operates on sequences of continuous vectors (such as video and
step embeddings) (ii) it permits one-to-many matching, allowing multiple video
frames to match to a single step, and (iii) it allows for dropping elements from
the sequence, which in turn allows for ignoring video frames that are unrelated
to the sequence of steps. In Drop-DTW, the alignment is formulated as mini-
mization of the total match cost between the video clips and instruction steps.
It is solved using dynamic programming and can be made differentiable (see
Alg. 1 in [10]). That is, given a video, x, and a sequence of steps, v, Drop-DTW
returns the alignment cost, c, and alignment matrix, M, indicating the corre-
spondences between steps and video segments.
Procedure flow graphs. In more realistic settings, procedure steps for many
processes, such as cooking recipes, are often given as a set of steps in a partial
order. Specifically, the partial ordering dictates that certain steps need to be
completed before other steps are started, but that other subsets of steps can
be done in any order. For example, when thinking of making a salad, one can
cut tomatoes and cucumbers in one order or the other, however we are certain
that both ingredients must be cut before mixing them into the salad. This is an
example of a procedure with partially ordered steps; i.e., there are multiple valid
ways to complete the procedure, all of which can be conveniently represented
with a flow graph.
A procedure flow graph is a Directed Acyclic Graph (DAG) G= (V, E),
where Vis a set of nodes and Eis a set of directed edges. Each node viV
represents a procedure step and an edge ejEconnecting vkand vldeclares
that the procedure step vkmust be completed before vlbegins in any instruction
execution. If a node vkhas multiple ancestors, all the corresponding steps must
be completed before beginning instruction step vk. In this work, we assume that
Ghas a single root and sink nodes. For convenience, we automatically add them
to the graph if they are not already present. From the definition of the flow
graph, it follows that every topological sort [40] of the nodes in G(see Fig. 2,
step 2) is a valid way to complete the procedure. This is an important property
that forms the foundation of our Graph2Vid approach, described next.
Flow graphs to video grounding. We define the task of grounding a flow
graph Gin a video x= [xi]N
i=1, where Nis the total number of frames, as the
task of finding a disjoint set of corresponding video segments, sl= [xi]endl
i=startl
for each node vl∈ G of the flow graph, such that the resulting segmentation
conforms to the flow graph. Specifically, in a pair of resulting video segments,
(si,sj), segment sican only occur before sjin the video if the corresponding
node niis a predecessor of njin the flow graph G. In this work, we assume that
every procedure step vlappears in the video exactly once.
摘要:

Graph2Vid:FlowgraphtoVideoGroundingforWeakly-supervisedMulti-StepLocalizationNikitaDvornik,IsmaHadji,HaiPham,DhaivatBhatt,BraisMartinez,AfsanehFazly,andAllanD.JepsonSamsungAICenterAbstract.Inthiswork,weconsidertheproblemofweakly-supervisedmulti-steplocalizationininstructionalvideos.Anestablishedappr...

展开>> 收起<<
Graph2Vid Flow graph to Video Grounding for Weakly-supervised Multi-Step Localization Nikita Dvornik Isma Hadji Hai Pham Dhaivat Bhatt Brais Martinez.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:2.82MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注