
man
Act
ivities, which to the best of our knowl-
edge is the first dataset studying the above chal-
lenges. Specifically, we extend the
Charades
dataset (Sigurdsson et al.,2016) with intents via
crowd-sourcing. As it is practically infeasible to
find all possible future action sequences given an
intent and a video clip of initial activities, we pro-
pose to evaluate all systems by letting each of
them answer multi-choice comprehension ques-
tions (MQA) without training them on those ques-
tions. Given an intent and a video clip showing
initial activities, each multi-choice question pro-
vides a fixed number of future action sequences
as possible answers. A system is then asked to
select the most plausible action sequence among
them. We show that the rankings of all models
using the MQAs correlate strongly with those ob-
tained by asking human assessors to directly ob-
serve estimated action sequences. For training, we
provide both a dataset for end-to-end training of
sequence forecasting and a multimodal knowledge
base (MKB) built from that dataset, which is also
the first video-based multimodal knowledge base
for human activities to the best of our knowledge.
We conduct the first empirical study to inves-
tigate compositional generalization for the target
task. As baselines, we adapt three strong end-to-
end deep generative models for this task and pro-
pose a neurosymbolic planning baseline using the
MKB. The model is neurosymbolic because it com-
bines both deep neural networks and symbolic rea-
soning (Garcez and Lamb,2020). Given a video
of initial activities and an intent, the deep models
generate the top-
k
relevant action sequences, while
the neurosymbolic planning model sends the intent
and the action sequence recognized from the video
as the query to the MKB, followed by retrieving
the top-
k
relevant action sequences. Each model
selects the most plausible answers by performing
probabilistic reasoning over the relevant action se-
quences. We conduct extensive experiments and
obtain the following key experimental results:
•
We compare the evaluation results using MQA
with the ones of human evaluation. The results
of both methods are well aligned. Thus, MQA
is reliable without requiring human effort.
•
The likelihood functions of the deep genera-
tive models are not able to reliably infer which
answers are plausible. In contrast, probabilis-
tic reasoning is an effective method to improve
compositional generalization.
•
Despite information from both modalities be-
ing useful and complementary, all baselines
heavily rely on intents in textual form but fail
to effectively exploit visual information from
video clips.
2 Related Work
Vision-Language Planning Task
Vision Lan-
guage Navigation (VLN) was among the first
widely used goal-oriented vision-language tasks,
requiring AI agents to navigate in an environment
without interaction by reasoning on the given in-
struction (Anderson et al.,2018;Hermann et al.,
2020;Misra et al.,2018;Jain et al.,2019). Re-
cently, further goal-oriented vision-language tasks
have been proposed. The Vision and Dialogue
History Navigation (VDHN) task (De Vries et al.,
2018;Nguyen and Daumé III,2019;Thomason
et al.,2020), which is similar to VLN, requires
agents to reason on the instructions over multiple
time steps. Other tasks such as Embodied Ques-
tion Answering (EQA; Das et al. 2018;Wijmans
et al. 2019), Embodied Object Referral (EOR; Qi
et al. 2020b;Chen et al. 2019) and Embodied Goal-
directed Manipulation (EGM; Shridhar et al. 2020;
Kim et al. 2020;Suhr et al. 2019) rely on reasoning
and interpreting the instruction with observation
or object interaction in the environment. However,
we argue that there are other ways to learn to plan
without practising. Our task is one example of
this, requiring agents to reason over the observa-
tion without performing actions.
Vision-Language Planning Datasets
As exist-
ing vision-language planning datasets emphasize
teaching embodied AI to perform the task like hu-
mans, they are constructed with interactive AI in
mind. VLN (Anderson et al.,2018) datasets ini-
tially started exploring planning tasks with the tex-
tual instruction as a step-by-step abstract guide and
minimal interaction with the environment. Extend-
ing the VLN task, VDHN (De Vries et al.,2018)
datasets provide an interactive textual dialogue be-
tween the speaker and the receiver in multiple steps.
The EQA (Das et al.,2018) task takes this a step
further by providing data in an object-centric QA
manner, advancing systems to understand the given
environment through object retrieval. The EOR (Qi
et al.,2020b) task designs object-centric datasets
with detailed instructions, aiming at localizing the
relevant objects accurately. The closest benchmark
to ours is ALFRED (Shridhar et al.,2021) from