
See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction
Maria Attarian1,2,3∗, Advaya Gupta1,2∗, Ziyi Zhou1,2, Wei Yu1,2, Igor Gilitschenski1, Animesh Garg1,2
Abstract— Cognitive planning is the structural decomposition
of complex tasks into a sequence of future behaviors. In the
computational setting, performing cognitive planning entails
grounding plans and concepts in one or more modalities in order
to leverage them for low level control. Since real-world tasks
are often described in natural language, we devise a cognitive
planning algorithm via language-guided video prediction. Current
video prediction models do not support conditioning on natural
language instructions. Therefore, we propose a new video prediction
architecture which leverages the power of pre-trained transformers.
The network is endowed with the ability to ground concepts based
on natural language input with generalization to unseen objects. We
demonstrate the effectiveness of this approach on a new simulation
dataset, where each task is defined by a high-level action described
in natural language. Our experiments compare our method against
one video generation baseline without planning or action grounding
and showcase significant improvements. Our ablation studies
highlight an improved generalization to unseen objects that natural
language embeddings offer to concept grounding ability, as well as
the importance of planning towards visual ”imagination” of a task.
I. INTRODUCTION
Cognitive planning is one of the core abilities that allows
humans to carry out complex tasks through formulation, evaluation
and selection of a sequence of actions and expected percepts
to achieve a desired goal [
1
]. The ability to look ahead and to
conditionally predict the expected perceptual input is necessary
for goal-conditioned planning. However, at times the intermediate
steps involved may not directly relate to achieving this goal.
Consider the scenario illustrated in Fig. 1: We have two fruits and
we would like to place one in the box. This may elicit the thought
to “pick up the apple”, “move it over the box”, and “place the apple
inside”. Such a plan may also trigger visual and other sensory
associations when planning the corresponding actions with only
an approximate world-model. As put, cognitive planning can be
thought of as a combination of two tasks: (i) high-level planning
with abstract actions, and (ii) concept grounding of the planned
sequence of actions. This is subsequently followed by physical
execution of the abstract action sequence with feedback-control
using the grounding as reference. We thus approach the problem
of task completion as having three phases: (a) high-level planning,
(b) cognitive grounding, and (c) closed-loop control. In this
work, we address the first two phases of this pipeline and assess
our ability to perform video generation via natural language
instruction and conceptual reasoning about a scene. Applying our
results towards low level robotic control is left as future work.
Cognitive grounding has multiple forms [
2
]. For our example
this takes the form of grounding concepts in the visual space.
What does “apple” represent? What does “move it over the
box” entail? We are grounding these concepts in vision when
*equal contribution
1University of Toronto, 2Vector Institute, 3Google
Fig. 1. Intuition behind cognitive planning as performed by humans. A high level
task is broken down into steps with each performed in sequence. The subgoals
of the various steps are met in the process leading up to the final goal.
we imagine what they look like. Such “imagination” of future
observations of subgoals could be leveraged by various Imitation
Learning approaches like observation-only behavior cloning [
3
],
LbW-kP [
4
] and Transporter Nets [
5
] to perform these tasks. This
motivates us to implement a computational analog of cognitive
planning, which, in our three phase model, performs the first two
phases: planning and grounding.
Concretely, our work proposes a novel architecture for
performing cognitive planning based on an initial visual
observation and a natural language task description. This involves
producing a high-level plan of actions by generating natural
language commands, which are used for visual grounding through
language-conditioned video prediction. An architectural diagram
of our method is illustrated in fig. 2.
Summary of Contributions.
1)
A novel architecture that combines the planning power
of pre-trained transformer models with visual concept
grounding to tackle computational cognitive planning as
a video prediction problem.
2)
A simulation dataset of a robotic arm performing spelling
of various 4-letter words on a board.
3)
Evaluations that demonstrate the value of such a framework
towards better semantic generalization in video prediction.
II. RELATED WORK
Video prediction and generation.
Video prediction and gener-
ation has been extensively studied in the computer vision com-
munity. ConvLSTMs have been used for learning representations
of video sequences on patches of image pixels as well as high-
level representations [
6
], while they’ve also been proven effective
for video prediction on settings where objects are controlled
arXiv:2210.03825v1 [cs.AI] 7 Oct 2022