See Plan Predict Language-guided Cognitive Planning with Video Prediction

2025-04-15 0 0 1.4MB 8 页 10玖币

侵权投诉

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Maria Attarian1,2,3∗, Advaya Gupta1,2∗, Ziyi Zhou1,2, Wei Yu1,2, Igor Gilitschenski1, Animesh Garg1,2

Abstract— Cognitive planning is the structural decomposition

of complex tasks into a sequence of future behaviors. In the

computational setting, performing cognitive planning entails

grounding plans and concepts in one or more modalities in order

to leverage them for low level control. Since real-world tasks

are often described in natural language, we devise a cognitive

planning algorithm via language-guided video prediction. Current

video prediction models do not support conditioning on natural

language instructions. Therefore, we propose a new video prediction

architecture which leverages the power of pre-trained transformers.

The network is endowed with the ability to ground concepts based

on natural language input with generalization to unseen objects. We

demonstrate the effectiveness of this approach on a new simulation

dataset, where each task is defined by a high-level action described

in natural language. Our experiments compare our method against

one video generation baseline without planning or action grounding

and showcase significant improvements. Our ablation studies

highlight an improved generalization to unseen objects that natural

language embeddings offer to concept grounding ability, as well as

the importance of planning towards visual ”imagination” of a task.

I. INTRODUCTION

Cognitive planning is one of the core abilities that allows

humans to carry out complex tasks through formulation, evaluation

and selection of a sequence of actions and expected percepts

to achieve a desired goal [

]. The ability to look ahead and to

conditionally predict the expected perceptual input is necessary

for goal-conditioned planning. However, at times the intermediate

steps involved may not directly relate to achieving this goal.

Consider the scenario illustrated in Fig. 1: We have two fruits and

we would like to place one in the box. This may elicit the thought

to “pick up the apple”, “move it over the box”, and “place the apple

inside”. Such a plan may also trigger visual and other sensory

associations when planning the corresponding actions with only

an approximate world-model. As put, cognitive planning can be

thought of as a combination of two tasks: (i) high-level planning

with abstract actions, and (ii) concept grounding of the planned

sequence of actions. This is subsequently followed by physical

execution of the abstract action sequence with feedback-control

using the grounding as reference. We thus approach the problem

of task completion as having three phases: (a) high-level planning,

(b) cognitive grounding, and (c) closed-loop control. In this

work, we address the first two phases of this pipeline and assess

our ability to perform video generation via natural language

instruction and conceptual reasoning about a scene. Applying our

results towards low level robotic control is left as future work.

Cognitive grounding has multiple forms [

]. For our example

this takes the form of grounding concepts in the visual space.

What does “apple” represent? What does “move it over the

box” entail? We are grounding these concepts in vision when

*equal contribution

1University of Toronto, 2Vector Institute, 3Google

Fig. 1. Intuition behind cognitive planning as performed by humans. A high level

task is broken down into steps with each performed in sequence. The subgoals

of the various steps are met in the process leading up to the final goal.

we imagine what they look like. Such “imagination” of future

observations of subgoals could be leveraged by various Imitation

Learning approaches like observation-only behavior cloning [

LbW-kP [

] and Transporter Nets [

] to perform these tasks. This

motivates us to implement a computational analog of cognitive

planning, which, in our three phase model, performs the first two

phases: planning and grounding.

Concretely, our work proposes a novel architecture for

performing cognitive planning based on an initial visual

observation and a natural language task description. This involves

producing a high-level plan of actions by generating natural

language commands, which are used for visual grounding through

language-conditioned video prediction. An architectural diagram

of our method is illustrated in fig. 2.

Summary of Contributions.

A novel architecture that combines the planning power

of pre-trained transformer models with visual concept

grounding to tackle computational cognitive planning as

a video prediction problem.

A simulation dataset of a robotic arm performing spelling

of various 4-letter words on a board.

Evaluations that demonstrate the value of such a framework

towards better semantic generalization in video prediction.

II. RELATED WORK

Video prediction and generation.

Video prediction and gener-

ation has been extensively studied in the computer vision com-

munity. ConvLSTMs have been used for learning representations

of video sequences on patches of image pixels as well as high-

level representations [

], while they’ve also been proven effective

for video prediction on settings where objects are controlled

arXiv:2210.03825v1 [cs.AI] 7 Oct 2022

Fig. 2. Architecture diagram of See-PP. The high level description of the task is passed through the planner which suggests an ordered list of sequential goal instructions.

The first of those is converted to a language embedding and combined with the image embedding of the initial observation, generated by the encoder. The predicted

observation outputted by the predictor and decoder is used as the new initial frame to be combined with the next language instruction. This process is repeated until

the end of the step-by-step plan.

or influenced by actions [

]. Hierarchical and non-hierarchical

VAEs [

], [

] are another interesting direction of approaches

that have yielded state-of-the-art results in video prediction. [

]

conduct pixel transformations from previous frames, explicitly

modeling motion where [

] attempts to segregate foreground

from background with the use of generative adversarial networks.

Stochastic video prediction like [

], [

] has yielded ad-

vancements in the field, however these approaches neither support

language instructions nor provide a way to control predictions

based on text directives. Such level of control is not provided

by existing stochastic video prediction models which instead,

predict different possible futures with no determinism based on

a task definition. In comparison, approaches such as [

] support

tokenized instructions for relatively simpler task settings while also

assuming access to such ground truth task segmentation. Finally,

other approaches draw inspiration from language models [

[

], [

]. Similar to some prior work, our approach employs

VAEs but operates on visual and language representations com-

bined, attempting to do feature fusion from the two modalities.

Concept and action grounding.

Concept learning draws inspira-

tion from Cognitive Science and attempts to study how humans

conceptualize the world [

]. Many approaches study concept

and action grounding by operating directly in the image space

or implicitly on some latent space [

], [

]. Even though such

methods have shown promising results, we follow the vein of

approaches that explore the combination of language and vision

representations for concept and action learning such as [

], [

[

]. Such approaches received even more attention after the

emergence of the CLIP architecture for learning visual concepts

from natural language supervision [

]. Our intuition around

concept and action grounding is related the closest to [

]. In

this work, a shared embedding is generated by the concatenation

of a reduced dimensionality language and vision embedding

derived from large pre-trained state-of-the-art models. This shared

embedding is subsequently employed to conceptually represent

the scene.

Task planning.

Task and procedure planning is defined as the

problem of predicting a sequence of actions that will accomplish

a goal, when starting from some initial state. Several recent

works attempt to perform planning from pixel observations.

InfoGAN [

] seeks to learn causality in the data structure and

extracts latent features that can then be used as state representations,

whereas PlaNet [

] performs action planning in latent space via a

model-based agent that learns its environment dynamics. UPN [

]

directly predicts an action plan through gradient descent trajectory

optimization in the latent space. Another stream of approaches

such as [

] studies planning in real-world instructional videos.

Finally, several methods combine vision and language towards

grounded planning [

], [

]. In contrast, our work segments

the instructional and visual planning into two modularized sub-

problems similar to [

]. However, our approach differs from [

]

in that our generation model learns to “imagine” its future states

rather than choose them from a defined set of possible states,

where for instruction planning we are focusing on purely language

based planning as per [35].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

See,Plan,Predict:Language-guidedCognitivePlanningwithVideoPredictionMariaAttarian1;2;3,AdvayaGupta1;2,ZiyiZhou1;2,WeiYu1;2,IgorGilitschenski1,AnimeshGarg1;2AbstractCognitiveplanningisthestructuraldecompositionofcomplextasksintoasequenceoffuturebehaviors.Inthecomputationalsetting,performingcogniti...

展开>> 收起<<

See Plan Predict Language-guided Cognitive Planning with Video Prediction.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

See Plan Predict Language-guided Cognitive Planning with Video Prediction

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: