ViLPAct A Benchmark for Compositional Generalization on Multimodal Human Activities Terry Yue Zhuo1and Yaqing Liao2and Yuecheng Lei2

2025-04-24 0 0 1.06MB 16 页 10玖币
侵权投诉
ViLPAct: A Benchmark for Compositional Generalization
on Multimodal Human Activities
Terry Yue Zhuo1and Yaqing Liao2and Yuecheng Lei2
Lizhen Qu1*and Gerard de Melo3
Xiaojun Chang4and Yazhou Ren2and Zenglin Xu5,6*
1Monash University 2University of Electronic Science and Technology of China
3HPI/Univerisity of Potsdam 4University of Technology Sydney
5Harbin Institute of Technology, Shenzhen 6Peng Cheng Lab
Abstract
We introduce ViLPAct, a novel vision-
language benchmark for human activity plan-
ning. It is designed for a task where em-
bodied AI agents can reason and forecast fu-
ture actions of humans based on video clips
about their initial activities and intents in
text. The dataset consists of 2.9k videos from
Charades extended with intents via crowd-
sourcing, a multi-choice question test set, and
four strong baselines. One of the baselines im-
plements a neurosymbolic approach based on
a multi-modal knowledge base (MKB), while
the other ones are deep generative models
adapted from recent state-of-the-art (SOTA)
methods. According to our extensive exper-
iments, the key challenges are compositional
generalization and effective use of information
from both modalities1.
1 Introduction
"He wants to keep his food fresh." Intent The old man is
now standing
in the kitchen.
Holding
some
food
Putting
some food
somewhere
Holding
some
clothes
Opening
a
refrigerator
Holding a
knife
Cook
some
food
Opening
a
refrigerator
Putting
some food
somewhere
Holding
some
food
What
should
come
next?
Observation
Figure 1: In daily life scenarios, an agent should be
aware of future actions that will likely be taken by
the user based on what it has observed. In this ex-
ample, inputs of intent and observation are colored in
green, while potential future action sequences are high-
lighted in orange. The first two sequences contain ac-
tions which do not align with the human intent. Thus,
the agent needs to automatically detect which future ac-
tions are plausible by understanding the user’s intent.
One of the ultimate goals of Artificial Intelli-
gence is to build intelligent agents capable of accu-
rately understanding humans’ actions and intents,
*
Corresponding authors:
lizhen.qu@monash.edu
,
xuzenglin@hit.edu.cn
1
Our benchmark is available at
https://github.
com/terryyz/ViLPAct
so that they can better serve us (Kong and Fu,2018;
Zhuo et al.,2023). Newly emerging applications in
robotics and multi-modal planning, such as Ama-
zon Astro, have demonstrated a strong need to
understand human behavior in multimodal envi-
ronments. On the one hand, such an agent, e.g.
an elderly care service bot, needs to understand
human activities and anticipate human behaviors
based on users’ intents. Here the intents may be
estimated based on previous activities or articu-
lated verbally by users. The anticipated behaviors
may be used for risk assessment (e.g. falling of
elderly people) and to facilitate collaboration with
humans. On the other hand, recent advances in
robotics show that it is possible to let robots learn
new tasks directly from observed human behav-
ior without robot demonstrations (Yu et al.,2018;
Sharma et al.,2019). However, that line of work fo-
cuses on imitating observed human actions without
anticipating future activities.
To promote research on action forecasting based
on intents, we propose the vision-language plan-
ning task for human behaviors. As shown in Fig. 1,
given an intent in textual form and a short video
clip, an agent anticipates which actions a human is
likely to take. We consider intents as given because
there is already ample research on intent identifi-
cation (Pandey and Aghav,2020) and automatic
speech recognition (Malik et al.,2021). To the best
of our knowledge, there is no dataset to evaluate
models for this task.
The task poses two major challenges. First, there
are often multiple plausible action sequences satis-
fying an intent. Second, it is highly unlikely that a
training dataset can cover all possible combinations
of actions for a given intent. Hence, models need
to acquire compositional generalization (Fodor and
Pylyshyn,1988), the capability to generalize to un-
seen action sequences composed of known actions.
In this work, we construct a dataset called
ViLPAct
for
Vi
sion-
L
anguage
P
lanning of hu-
arXiv:2210.05556v4 [cs.CV] 9 Mar 2023
man
Act
ivities, which to the best of our knowl-
edge is the first dataset studying the above chal-
lenges. Specifically, we extend the
Charades
dataset (Sigurdsson et al.,2016) with intents via
crowd-sourcing. As it is practically infeasible to
find all possible future action sequences given an
intent and a video clip of initial activities, we pro-
pose to evaluate all systems by letting each of
them answer multi-choice comprehension ques-
tions (MQA) without training them on those ques-
tions. Given an intent and a video clip showing
initial activities, each multi-choice question pro-
vides a fixed number of future action sequences
as possible answers. A system is then asked to
select the most plausible action sequence among
them. We show that the rankings of all models
using the MQAs correlate strongly with those ob-
tained by asking human assessors to directly ob-
serve estimated action sequences. For training, we
provide both a dataset for end-to-end training of
sequence forecasting and a multimodal knowledge
base (MKB) built from that dataset, which is also
the first video-based multimodal knowledge base
for human activities to the best of our knowledge.
We conduct the first empirical study to inves-
tigate compositional generalization for the target
task. As baselines, we adapt three strong end-to-
end deep generative models for this task and pro-
pose a neurosymbolic planning baseline using the
MKB. The model is neurosymbolic because it com-
bines both deep neural networks and symbolic rea-
soning (Garcez and Lamb,2020). Given a video
of initial activities and an intent, the deep models
generate the top-
k
relevant action sequences, while
the neurosymbolic planning model sends the intent
and the action sequence recognized from the video
as the query to the MKB, followed by retrieving
the top-
k
relevant action sequences. Each model
selects the most plausible answers by performing
probabilistic reasoning over the relevant action se-
quences. We conduct extensive experiments and
obtain the following key experimental results:
We compare the evaluation results using MQA
with the ones of human evaluation. The results
of both methods are well aligned. Thus, MQA
is reliable without requiring human effort.
The likelihood functions of the deep genera-
tive models are not able to reliably infer which
answers are plausible. In contrast, probabilis-
tic reasoning is an effective method to improve
compositional generalization.
Despite information from both modalities be-
ing useful and complementary, all baselines
heavily rely on intents in textual form but fail
to effectively exploit visual information from
video clips.
2 Related Work
Vision-Language Planning Task
Vision Lan-
guage Navigation (VLN) was among the first
widely used goal-oriented vision-language tasks,
requiring AI agents to navigate in an environment
without interaction by reasoning on the given in-
struction (Anderson et al.,2018;Hermann et al.,
2020;Misra et al.,2018;Jain et al.,2019). Re-
cently, further goal-oriented vision-language tasks
have been proposed. The Vision and Dialogue
History Navigation (VDHN) task (De Vries et al.,
2018;Nguyen and Daumé III,2019;Thomason
et al.,2020), which is similar to VLN, requires
agents to reason on the instructions over multiple
time steps. Other tasks such as Embodied Ques-
tion Answering (EQA; Das et al. 2018;Wijmans
et al. 2019), Embodied Object Referral (EOR; Qi
et al. 2020b;Chen et al. 2019) and Embodied Goal-
directed Manipulation (EGM; Shridhar et al. 2020;
Kim et al. 2020;Suhr et al. 2019) rely on reasoning
and interpreting the instruction with observation
or object interaction in the environment. However,
we argue that there are other ways to learn to plan
without practising. Our task is one example of
this, requiring agents to reason over the observa-
tion without performing actions.
Vision-Language Planning Datasets
As exist-
ing vision-language planning datasets emphasize
teaching embodied AI to perform the task like hu-
mans, they are constructed with interactive AI in
mind. VLN (Anderson et al.,2018) datasets ini-
tially started exploring planning tasks with the tex-
tual instruction as a step-by-step abstract guide and
minimal interaction with the environment. Extend-
ing the VLN task, VDHN (De Vries et al.,2018)
datasets provide an interactive textual dialogue be-
tween the speaker and the receiver in multiple steps.
The EQA (Das et al.,2018) task takes this a step
further by providing data in an object-centric QA
manner, advancing systems to understand the given
environment through object retrieval. The EOR (Qi
et al.,2020b) task designs object-centric datasets
with detailed instructions, aiming at localizing the
relevant objects accurately. The closest benchmark
to ours is ALFRED (Shridhar et al.,2021) from
the EGM task, which lets embodied agents decide
on actions and objects to be manipulated based
on detailed instructions. However, in our setting,
we ask intelligent systems to predict the most rea-
sonable future action sequence based on human
intents and answers in a Multiple Choice Question
Answering (MQA) format. During prediction, we
still give systems the flexibility to consider various
combinations of actions and objects.
Vision-Language Planning Modeling
Accord-
ing to Francis et al. (2021), several approaches have
been used for planning. Greedy search in end-to-
end models has been reported in several studies to
work well in goal-oriented tasks (Fried et al.,2018;
Das et al.,2018;Shridhar et al.,2020;Anderson
et al.,2018). Task progress monitoring (Ma et al.,
2019) is another method to tackle the planning. It
allows models to backtrack on actions if the cur-
rent action is found to be suboptimal. Mapping
(Anderson et al.,2019) has as well been proposed
for efficient planning via sensors. Topological and
Exploration planning (Deng et al.,2020;Ke et al.,
2019) enables modeling the planning in a sym-
bolic manner. When goals are provided as several
sub-goals, a divide and conquer strategy (Misra
et al.,2018;Shridhar et al.,2020;Suhr et al.,2019)
may be invoked to perform sub-task planning. In
our work, we highlight another potential approach,
knowledge base retrieval. As we construct an MKB
containing various action sequences with detailed
features, intelligent agents can retrieve the most
suitable sequence from the MKB source in order to
perform the planning.
3 Dataset Construction
We adopt videos from
Charades
(Sigurdsson
et al.,2016) and solicit intents for videos via crowd-
sourcing. We consider videos that have action
sequences of sufficient length appearing in both
initial video clips and answers, which result in a
dataset comprising 2,912 videos. The dataset is
split into training/validation/test sets with a ratio of
70%, 10%, 20%. On the training dataset, we build
an MKB by incorporating structural and concep-
tual information. On the test dataset, we collect a
set of MQAs for model evaluation. The evaluation
with MQAs is in fact an adversarial testing method,
widely used for quality estimation in machine trans-
lation (Kanojia et al.,2021). Herein, the ability of
a model to discriminate between correct outputs
and meaning-changing perturbations is predictive
of its overall performance, not just its robustness.
Thus MQAs are applied only for testing.
3.1 Data Normalization and Filtering
Charades
is a large-scale video dataset of daily
indoors activities collected via Amazon Mechanic
Turk
2
(AMT). The average length of videos is ap-
proximately 30 seconds. It involves interactions
with 46 object classes and contains 157 action
classes, which are also referred to as
actions
for
short. Each action is represented as a verb phrase,
such as “pouring into a cup". This dataset is chosen
because i) it contains a sufficient number of long
action sequences of human daily activities; ii) the
intents are easily identifiable, as the activities in the
videos are based on scripts; iii) there are rich anno-
tations of videos that can be leveraged for dataset
construction. The details of action sequence selec-
tion in videos are presented in Appendix 7.1, with
the goal of choosing core action sequences having
clear human goals.
In order to assess the quality of extracted action
sequences, we randomly sample 100 videos from
the test set for manual inspection. The primary
action sequence of each video is evaluated in terms
of three criteria: i) if all actions of a sequence occur
in the video; ii) if the actions of a sequence appear
in the same order as in the video; iii) if a sequence
has any actions missing between the first and the
last action. In total, we determined that 94 videos
have all actions of their action sequences covered
in the video. The actions of 92 videos appear in
the same order as in the videos. Furthermore, 85
videos have no actions missing between the first
and the last action of their sequences. Thus, the
quality of such action sequences is adequate for VL
planning evaluation.
Following prior work (Ng and Fernando,2020),
we consider the first 20% of a video as its initial
visual state and aim to forecast future actions ap-
pearing in the remaining part of the video for a
given intent. To have at least one future action
per video, we retain only videos that contain at
least one action sequence comprising more than
three actions. As a result, we obtain 2,912 such
videos, each of which is associated with one action
sequence of length longer than three.
2https://www.mturk.com
3.2 Intent Annotation
An intent may be defined as “something that you
want and plan to do”.
3
Philosophers distinguish be-
tween future-directed intents and present-directed
ones (Cohen and Levesque,1990). The former
guide the planning of actions, while the latter
causally produce behavior. As the focus of this
work is anticipating and planning actions, we
encourage crowd-workers to also provide future-
directed intents.
We recruit crowd-workers to annotate videos
with future-directed and present-directed intents.
Each annotator is provided with a full video clip
and the associated action sequence. They are in-
structed to answer the question what the person
wants to do by taking the actions in the video. Ev-
ery annotator is asked to submit two intents. One
of them should describe which activity the person
intends to take, such as “drink a glass of water”.
The other one needs to be at a high-level, such as
“quench the thirst” or “be thirsty”. The permitted
formats are either “S/He wants to
+
do_something
or “S/He is
+
feeling". Thus, the annotators are
encouraged to provide future-directed intents by
differentiating them from ones causally leading to
behaviours. To ensure the quality of intent annota-
tions, we randomly assign three crowd-workers to
write intents per video. The process of constructing
the dataset for intent annotation involved a rigor-
ous validation and selection process. One of the
authors acted as an expert annotator, and conducted
a thorough review of all crowd-sourced intents to
identify and select the most reasonable annotations
as the final results. The validation process was com-
pleted in three rounds, yielding increasingly higher
percentages of reasonable annotations, with 82%,
94% and 100% respectively for each round. The
annotations that did not meet the required criteria
were discarded and not included in the final dataset.
This rigorous validation process ensured that the
final dataset is comprised of high-quality and rele-
vant annotations, providing a robust foundation for
subsequent modeling and analysis.
3.3 Multimodal Knowledge Base
We construct the MKB of human activities based
on the
training set
and
validation set
by taking
a neurosymbolic approach. The main challenges
herein are twofold: i) how to represent multimodal
3
Cambridge Dictionary,
https://dictionary.
cambridge.org/
information from videos, action names, and intents
adequately to facilitate information retrieval; ii)
how to model shared knowledge of multimodal in-
formation. For the former, we allow both string and
embedding based retrieval methods by attaching
neural representations of video clips and texts to
symbols of actions and action sequences. For the
latter, we employ the classical planning language
STRIPS (Bylander,1994) and neural prototypes to
encode abstract properties of actions.
At the core of the MKB is a knowledge graph
G= (V,E)
, where the node set
V
comprises
four types of nodes: action classes, action video
clips, action sequences, and action sequence videos,
while the edge set
E
contains edges reflecting rela-
tionships between nodes.
An action class
ac
is the abstraction of an
action described in the language of STRIPS. The
attributes of an action class include its ID, its
name
τ
, its precondition set PRE, its add effect
set ADD, and its delete effect set DEL. An
action is executed only if its preconditions are
satisfied. The effect sets ADD and DEL of an
action class describe the add and delete operations
applied to the current state after executing the
action. For example, the precondition of Clos-
ing a refrigerator is isOpen(refrigerator),
ADD = isClosed(refrigerator) and
DEL=isOpen(refrigerator). In this way, the
properties described in STRIPS present the shared
knowledge of each action class.
MKB
Z6LYG
c119
c142
c143
c156
Action Sequence Video
Video IDZ6LYG
Start1.9 End42
Action Video Clip
Video ID24XHS
Action ID: c142
Start: 36.30
End: 42.00
c063
Action ID: c142
Name: Closing a refrigerator
pre: IsOpen(Refrigerator)
add: IsClosed(Refrigerator)
del: IsOpen(Refrigerator)
Action Class
Action Sequence
future-directed Intent: S/He is hungry
present-directed Intent: Eat food
c143:Opening a refrigerator-> c156:Someone is eating
something->c063: Taking food from somewhere
->c119:Putting a dish/es somewhere
->c142:Closing a refrigerator
Figure 2: An example action sequence in the MKB.
An action sequence comprises a future-directed
intent, a present-directed intent, and a sequence of
action IDs. An intent is represented by both a word
sequence and the distributed representation of the
word sequence. We obtain the distributed repre-
sentation of an intent by applying BERT (Devlin
et al.,2018) and utilizing the representation of the
CLS token. The collection of action sequences can
摘要:

ViLPAct:ABenchmarkforCompositionalGeneralizationonMultimodalHumanActivitiesTerryYueZhuo1andYaqingLiao2andYuechengLei2LizhenQu1*andGerarddeMelo3XiaojunChang4andYazhouRen2andZenglinXu5;6*1MonashUniversity2UniversityofElectronicScienceandTechnologyofChina3HPI/UniverisityofPotsdam4UniversityofTechnology...

展开>> 收起<<
ViLPAct A Benchmark for Compositional Generalization on Multimodal Human Activities Terry Yue Zhuo1and Yaqing Liao2and Yuecheng Lei2.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.06MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注