See Plan Predict Language-guided Cognitive Planning with Video Prediction

2025-04-15 0 0 1.4MB 8 页 10玖币
侵权投诉
See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction
Maria Attarian1,2,3, Advaya Gupta1,2, Ziyi Zhou1,2, Wei Yu1,2, Igor Gilitschenski1, Animesh Garg1,2
Abstract Cognitive planning is the structural decomposition
of complex tasks into a sequence of future behaviors. In the
computational setting, performing cognitive planning entails
grounding plans and concepts in one or more modalities in order
to leverage them for low level control. Since real-world tasks
are often described in natural language, we devise a cognitive
planning algorithm via language-guided video prediction. Current
video prediction models do not support conditioning on natural
language instructions. Therefore, we propose a new video prediction
architecture which leverages the power of pre-trained transformers.
The network is endowed with the ability to ground concepts based
on natural language input with generalization to unseen objects. We
demonstrate the effectiveness of this approach on a new simulation
dataset, where each task is defined by a high-level action described
in natural language. Our experiments compare our method against
one video generation baseline without planning or action grounding
and showcase significant improvements. Our ablation studies
highlight an improved generalization to unseen objects that natural
language embeddings offer to concept grounding ability, as well as
the importance of planning towards visual ”imagination” of a task.
I. INTRODUCTION
Cognitive planning is one of the core abilities that allows
humans to carry out complex tasks through formulation, evaluation
and selection of a sequence of actions and expected percepts
to achieve a desired goal [
1
]. The ability to look ahead and to
conditionally predict the expected perceptual input is necessary
for goal-conditioned planning. However, at times the intermediate
steps involved may not directly relate to achieving this goal.
Consider the scenario illustrated in Fig. 1: We have two fruits and
we would like to place one in the box. This may elicit the thought
to “pick up the apple”, “move it over the box”, and “place the apple
inside. Such a plan may also trigger visual and other sensory
associations when planning the corresponding actions with only
an approximate world-model. As put, cognitive planning can be
thought of as a combination of two tasks: (i) high-level planning
with abstract actions, and (ii) concept grounding of the planned
sequence of actions. This is subsequently followed by physical
execution of the abstract action sequence with feedback-control
using the grounding as reference. We thus approach the problem
of task completion as having three phases: (a) high-level planning,
(b) cognitive grounding, and (c) closed-loop control. In this
work, we address the first two phases of this pipeline and assess
our ability to perform video generation via natural language
instruction and conceptual reasoning about a scene. Applying our
results towards low level robotic control is left as future work.
Cognitive grounding has multiple forms [
2
]. For our example
this takes the form of grounding concepts in the visual space.
What does “apple” represent? What does “move it over the
box” entail? We are grounding these concepts in vision when
*equal contribution
1University of Toronto, 2Vector Institute, 3Google
Fig. 1. Intuition behind cognitive planning as performed by humans. A high level
task is broken down into steps with each performed in sequence. The subgoals
of the various steps are met in the process leading up to the final goal.
we imagine what they look like. Such “imagination” of future
observations of subgoals could be leveraged by various Imitation
Learning approaches like observation-only behavior cloning [
3
],
LbW-kP [
4
] and Transporter Nets [
5
] to perform these tasks. This
motivates us to implement a computational analog of cognitive
planning, which, in our three phase model, performs the first two
phases: planning and grounding.
Concretely, our work proposes a novel architecture for
performing cognitive planning based on an initial visual
observation and a natural language task description. This involves
producing a high-level plan of actions by generating natural
language commands, which are used for visual grounding through
language-conditioned video prediction. An architectural diagram
of our method is illustrated in fig. 2.
Summary of Contributions.
1)
A novel architecture that combines the planning power
of pre-trained transformer models with visual concept
grounding to tackle computational cognitive planning as
a video prediction problem.
2)
A simulation dataset of a robotic arm performing spelling
of various 4-letter words on a board.
3)
Evaluations that demonstrate the value of such a framework
towards better semantic generalization in video prediction.
II. RELATED WORK
Video prediction and generation.
Video prediction and gener-
ation has been extensively studied in the computer vision com-
munity. ConvLSTMs have been used for learning representations
of video sequences on patches of image pixels as well as high-
level representations [
6
], while they’ve also been proven effective
for video prediction on settings where objects are controlled
arXiv:2210.03825v1 [cs.AI] 7 Oct 2022
Fig. 2. Architecture diagram of See-PP. The high level description of the task is passed through the planner which suggests an ordered list of sequential goal instructions.
The first of those is converted to a language embedding and combined with the image embedding of the initial observation, generated by the encoder. The predicted
observation outputted by the predictor and decoder is used as the new initial frame to be combined with the next language instruction. This process is repeated until
the end of the step-by-step plan.
or influenced by actions [
7
]. Hierarchical and non-hierarchical
VAEs [
8
], [
9
], [
10
] are another interesting direction of approaches
that have yielded state-of-the-art results in video prediction. [
11
]
conduct pixel transformations from previous frames, explicitly
modeling motion where [
12
] attempts to segregate foreground
from background with the use of generative adversarial networks.
Stochastic video prediction like [
8
], [
13
], [
14
] has yielded ad-
vancements in the field, however these approaches neither support
language instructions nor provide a way to control predictions
based on text directives. Such level of control is not provided
by existing stochastic video prediction models which instead,
predict different possible futures with no determinism based on
a task definition. In comparison, approaches such as [
15
] support
tokenized instructions for relatively simpler task settings while also
assuming access to such ground truth task segmentation. Finally,
other approaches draw inspiration from language models [
16
],
[
17
], [
18
]. Similar to some prior work, our approach employs
VAEs but operates on visual and language representations com-
bined, attempting to do feature fusion from the two modalities.
Concept and action grounding.
Concept learning draws inspira-
tion from Cognitive Science and attempts to study how humans
conceptualize the world [
19
]. Many approaches study concept
and action grounding by operating directly in the image space
or implicitly on some latent space [
20
], [
21
]. Even though such
methods have shown promising results, we follow the vein of
approaches that explore the combination of language and vision
representations for concept and action learning such as [
22
], [
23
],
[
24
]. Such approaches received even more attention after the
emergence of the CLIP architecture for learning visual concepts
from natural language supervision [
25
]. Our intuition around
concept and action grounding is related the closest to [
26
]. In
this work, a shared embedding is generated by the concatenation
of a reduced dimensionality language and vision embedding
derived from large pre-trained state-of-the-art models. This shared
embedding is subsequently employed to conceptually represent
the scene.
Task planning.
Task and procedure planning is defined as the
problem of predicting a sequence of actions that will accomplish
a goal, when starting from some initial state. Several recent
works attempt to perform planning from pixel observations.
InfoGAN [
27
] seeks to learn causality in the data structure and
extracts latent features that can then be used as state representations,
whereas PlaNet [
28
] performs action planning in latent space via a
model-based agent that learns its environment dynamics. UPN [
29
]
directly predicts an action plan through gradient descent trajectory
optimization in the latent space. Another stream of approaches
such as [
30
] studies planning in real-world instructional videos.
Finally, several methods combine vision and language towards
grounded planning [
31
], [
32
], [
33
]. In contrast, our work segments
the instructional and visual planning into two modularized sub-
problems similar to [
34
]. However, our approach differs from [
34
]
in that our generation model learns to “imagine” its future states
rather than choose them from a defined set of possible states,
where for instruction planning we are focusing on purely language
based planning as per [35].
摘要:

See,Plan,Predict:Language-guidedCognitivePlanningwithVideoPredictionMariaAttarian1;2;3,AdvayaGupta1;2,ZiyiZhou1;2,WeiYu1;2,IgorGilitschenski1,AnimeshGarg1;2Abstract—Cognitiveplanningisthestructuraldecompositionofcomplextasksintoasequenceoffuturebehaviors.Inthecomputationalsetting,performingcogniti...

展开>> 收起<<
See Plan Predict Language-guided Cognitive Planning with Video Prediction.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:学术论文 价格:10玖币 属性:8 页 大小:1.4MB 格式:PDF 时间:2025-04-15

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注