ViLPAct A Benchmark for Compositional Generalization on Multimodal Human Activities Terry Yue Zhuo1and Yaqing Liao2and Yuecheng Lei2

2025-04-24 0 0 1.06MB 16 页 10玖币

侵权投诉

ViLPAct: A Benchmark for Compositional Generalization

on Multimodal Human Activities

Terry Yue Zhuo1and Yaqing Liao2and Yuecheng Lei2

Lizhen Qu1*and Gerard de Melo3

Xiaojun Chang4and Yazhou Ren2and Zenglin Xu5,6*

1Monash University 2University of Electronic Science and Technology of China

3HPI/Univerisity of Potsdam 4University of Technology Sydney

5Harbin Institute of Technology, Shenzhen 6Peng Cheng Lab

Abstract

We introduce ViLPAct, a novel vision-

language benchmark for human activity plan-

ning. It is designed for a task where em-

bodied AI agents can reason and forecast fu-

ture actions of humans based on video clips

about their initial activities and intents in

text. The dataset consists of 2.9k videos from

Charades extended with intents via crowd-

sourcing, a multi-choice question test set, and

four strong baselines. One of the baselines im-

plements a neurosymbolic approach based on

a multi-modal knowledge base (MKB), while

the other ones are deep generative models

adapted from recent state-of-the-art (SOTA)

methods. According to our extensive exper-

iments, the key challenges are compositional

generalization and effective use of information

from both modalities1.

1 Introduction

"He wants to keep his food fresh." Intent The old man is

now standing

in the kitchen.

Holding

some

food

Putting

some food

somewhere

Holding

some

clothes

Opening

refrigerator

Holding a

knife

Cook

some

food

Opening

refrigerator

Putting

some food

somewhere

Holding

some

food

What

should

come

next?

Observation

Figure 1: In daily life scenarios, an agent should be

aware of future actions that will likely be taken by

the user based on what it has observed. In this ex-

ample, inputs of intent and observation are colored in

green, while potential future action sequences are high-

lighted in orange. The ﬁrst two sequences contain ac-

tions which do not align with the human intent. Thus,

the agent needs to automatically detect which future ac-

tions are plausible by understanding the user’s intent.

One of the ultimate goals of Artiﬁcial Intelli-

gence is to build intelligent agents capable of accu-

rately understanding humans’ actions and intents,

Corresponding authors:

lizhen.qu@monash.edu

xuzenglin@hit.edu.cn

Our benchmark is available at

https://github.

com/terryyz/ViLPAct

so that they can better serve us (Kong and Fu,2018;

Zhuo et al.,2023). Newly emerging applications in

robotics and multi-modal planning, such as Ama-

zon Astro, have demonstrated a strong need to

understand human behavior in multimodal envi-

ronments. On the one hand, such an agent, e.g.

an elderly care service bot, needs to understand

human activities and anticipate human behaviors

based on users’ intents. Here the intents may be

estimated based on previous activities or articu-

lated verbally by users. The anticipated behaviors

may be used for risk assessment (e.g. falling of

elderly people) and to facilitate collaboration with

humans. On the other hand, recent advances in

robotics show that it is possible to let robots learn

new tasks directly from observed human behav-

ior without robot demonstrations (Yu et al.,2018;

Sharma et al.,2019). However, that line of work fo-

cuses on imitating observed human actions without

anticipating future activities.

To promote research on action forecasting based

on intents, we propose the vision-language plan-

ning task for human behaviors. As shown in Fig. 1,

given an intent in textual form and a short video

clip, an agent anticipates which actions a human is

likely to take. We consider intents as given because

there is already ample research on intent identiﬁ-

cation (Pandey and Aghav,2020) and automatic

speech recognition (Malik et al.,2021). To the best

of our knowledge, there is no dataset to evaluate

models for this task.

The task poses two major challenges. First, there

are often multiple plausible action sequences satis-

fying an intent. Second, it is highly unlikely that a

training dataset can cover all possible combinations

of actions for a given intent. Hence, models need

to acquire compositional generalization (Fodor and

Pylyshyn,1988), the capability to generalize to un-

seen action sequences composed of known actions.

In this work, we construct a dataset called

ViLPAct

for

sion-

anguage

lanning of hu-

arXiv:2210.05556v4 [cs.CV] 9 Mar 2023

man

Act

ivities, which to the best of our knowl-

edge is the ﬁrst dataset studying the above chal-

lenges. Speciﬁcally, we extend the

Charades

dataset (Sigurdsson et al.,2016) with intents via

crowd-sourcing. As it is practically infeasible to

ﬁnd all possible future action sequences given an

intent and a video clip of initial activities, we pro-

pose to evaluate all systems by letting each of

them answer multi-choice comprehension ques-

tions (MQA) without training them on those ques-

tions. Given an intent and a video clip showing

initial activities, each multi-choice question pro-

vides a ﬁxed number of future action sequences

as possible answers. A system is then asked to

select the most plausible action sequence among

them. We show that the rankings of all models

using the MQAs correlate strongly with those ob-

tained by asking human assessors to directly ob-

serve estimated action sequences. For training, we

provide both a dataset for end-to-end training of

sequence forecasting and a multimodal knowledge

base (MKB) built from that dataset, which is also

the ﬁrst video-based multimodal knowledge base

for human activities to the best of our knowledge.

We conduct the ﬁrst empirical study to inves-

tigate compositional generalization for the target

task. As baselines, we adapt three strong end-to-

end deep generative models for this task and pro-

pose a neurosymbolic planning baseline using the

MKB. The model is neurosymbolic because it com-

bines both deep neural networks and symbolic rea-

soning (Garcez and Lamb,2020). Given a video

of initial activities and an intent, the deep models

generate the top-

relevant action sequences, while

the neurosymbolic planning model sends the intent

and the action sequence recognized from the video

as the query to the MKB, followed by retrieving

the top-

relevant action sequences. Each model

selects the most plausible answers by performing

probabilistic reasoning over the relevant action se-

quences. We conduct extensive experiments and

obtain the following key experimental results:

•

We compare the evaluation results using MQA

with the ones of human evaluation. The results

of both methods are well aligned. Thus, MQA

is reliable without requiring human effort.

•

The likelihood functions of the deep genera-

tive models are not able to reliably infer which

answers are plausible. In contrast, probabilis-

tic reasoning is an effective method to improve

compositional generalization.

•

Despite information from both modalities be-

ing useful and complementary, all baselines

heavily rely on intents in textual form but fail

to effectively exploit visual information from

video clips.

2 Related Work

Vision-Language Planning Task

Vision Lan-

guage Navigation (VLN) was among the ﬁrst

widely used goal-oriented vision-language tasks,

requiring AI agents to navigate in an environment

without interaction by reasoning on the given in-

struction (Anderson et al.,2018;Hermann et al.,

2020;Misra et al.,2018;Jain et al.,2019). Re-

cently, further goal-oriented vision-language tasks

have been proposed. The Vision and Dialogue

History Navigation (VDHN) task (De Vries et al.,

2018;Nguyen and Daumé III,2019;Thomason

et al.,2020), which is similar to VLN, requires

agents to reason on the instructions over multiple

time steps. Other tasks such as Embodied Ques-

tion Answering (EQA; Das et al. 2018;Wijmans

et al. 2019), Embodied Object Referral (EOR; Qi

et al. 2020b;Chen et al. 2019) and Embodied Goal-

directed Manipulation (EGM; Shridhar et al. 2020;

Kim et al. 2020;Suhr et al. 2019) rely on reasoning

and interpreting the instruction with observation

or object interaction in the environment. However,

we argue that there are other ways to learn to plan

without practising. Our task is one example of

this, requiring agents to reason over the observa-

tion without performing actions.

Vision-Language Planning Datasets

As exist-

ing vision-language planning datasets emphasize

teaching embodied AI to perform the task like hu-

mans, they are constructed with interactive AI in

mind. VLN (Anderson et al.,2018) datasets ini-

tially started exploring planning tasks with the tex-

tual instruction as a step-by-step abstract guide and

minimal interaction with the environment. Extend-

ing the VLN task, VDHN (De Vries et al.,2018)

datasets provide an interactive textual dialogue be-

tween the speaker and the receiver in multiple steps.

The EQA (Das et al.,2018) task takes this a step

further by providing data in an object-centric QA

manner, advancing systems to understand the given

environment through object retrieval. The EOR (Qi

et al.,2020b) task designs object-centric datasets

with detailed instructions, aiming at localizing the

relevant objects accurately. The closest benchmark

to ours is ALFRED (Shridhar et al.,2021) from

the EGM task, which lets embodied agents decide

on actions and objects to be manipulated based

on detailed instructions. However, in our setting,

we ask intelligent systems to predict the most rea-

sonable future action sequence based on human

intents and answers in a Multiple Choice Question

Answering (MQA) format. During prediction, we

still give systems the ﬂexibility to consider various

combinations of actions and objects.

Vision-Language Planning Modeling

Accord-

ing to Francis et al. (2021), several approaches have

been used for planning. Greedy search in end-to-

end models has been reported in several studies to

work well in goal-oriented tasks (Fried et al.,2018;

Das et al.,2018;Shridhar et al.,2020;Anderson

et al.,2018). Task progress monitoring (Ma et al.,

2019) is another method to tackle the planning. It

allows models to backtrack on actions if the cur-

rent action is found to be suboptimal. Mapping

(Anderson et al.,2019) has as well been proposed

for efﬁcient planning via sensors. Topological and

Exploration planning (Deng et al.,2020;Ke et al.,

2019) enables modeling the planning in a sym-

bolic manner. When goals are provided as several

sub-goals, a divide and conquer strategy (Misra

et al.,2018;Shridhar et al.,2020;Suhr et al.,2019)

may be invoked to perform sub-task planning. In

our work, we highlight another potential approach,

knowledge base retrieval. As we construct an MKB

containing various action sequences with detailed

features, intelligent agents can retrieve the most

suitable sequence from the MKB source in order to

perform the planning.

3 Dataset Construction

We adopt videos from

Charades

(Sigurdsson

et al.,2016) and solicit intents for videos via crowd-

sourcing. We consider videos that have action

sequences of sufﬁcient length appearing in both

initial video clips and answers, which result in a

dataset comprising 2,912 videos. The dataset is

split into training/validation/test sets with a ratio of

70%, 10%, 20%. On the training dataset, we build

an MKB by incorporating structural and concep-

tual information. On the test dataset, we collect a

set of MQAs for model evaluation. The evaluation

with MQAs is in fact an adversarial testing method,

widely used for quality estimation in machine trans-

lation (Kanojia et al.,2021). Herein, the ability of

a model to discriminate between correct outputs

and meaning-changing perturbations is predictive

of its overall performance, not just its robustness.

Thus MQAs are applied only for testing.

3.1 Data Normalization and Filtering

Charades

is a large-scale video dataset of daily

indoors activities collected via Amazon Mechanic

Turk

(AMT). The average length of videos is ap-

proximately 30 seconds. It involves interactions

with 46 object classes and contains 157 action

classes, which are also referred to as

actions

for

short. Each action is represented as a verb phrase,

such as “pouring into a cup". This dataset is chosen

because i) it contains a sufﬁcient number of long

action sequences of human daily activities; ii) the

intents are easily identiﬁable, as the activities in the

videos are based on scripts; iii) there are rich anno-

tations of videos that can be leveraged for dataset

construction. The details of action sequence selec-

tion in videos are presented in Appendix 7.1, with

the goal of choosing core action sequences having

clear human goals.

In order to assess the quality of extracted action

sequences, we randomly sample 100 videos from

the test set for manual inspection. The primary

action sequence of each video is evaluated in terms

of three criteria: i) if all actions of a sequence occur

in the video; ii) if the actions of a sequence appear

in the same order as in the video; iii) if a sequence

has any actions missing between the ﬁrst and the

last action. In total, we determined that 94 videos

have all actions of their action sequences covered

in the video. The actions of 92 videos appear in

the same order as in the videos. Furthermore, 85

videos have no actions missing between the ﬁrst

and the last action of their sequences. Thus, the

quality of such action sequences is adequate for VL

planning evaluation.

Following prior work (Ng and Fernando,2020),

we consider the ﬁrst 20% of a video as its initial

visual state and aim to forecast future actions ap-

pearing in the remaining part of the video for a

given intent. To have at least one future action

per video, we retain only videos that contain at

least one action sequence comprising more than

three actions. As a result, we obtain 2,912 such

videos, each of which is associated with one action

sequence of length longer than three.

2https://www.mturk.com

3.2 Intent Annotation

An intent may be deﬁned as “something that you

want and plan to do”.

Philosophers distinguish be-

tween future-directed intents and present-directed

ones (Cohen and Levesque,1990). The former

guide the planning of actions, while the latter

causally produce behavior. As the focus of this

work is anticipating and planning actions, we

encourage crowd-workers to also provide future-

directed intents.

We recruit crowd-workers to annotate videos

with future-directed and present-directed intents.

Each annotator is provided with a full video clip

and the associated action sequence. They are in-

structed to answer the question what the person

wants to do by taking the actions in the video. Ev-

ery annotator is asked to submit two intents. One

of them should describe which activity the person

intends to take, such as “drink a glass of water”.

The other one needs to be at a high-level, such as

“quench the thirst” or “be thirsty”. The permitted

formats are either “S/He wants to

do_something”

or “S/He is

feeling". Thus, the annotators are

encouraged to provide future-directed intents by

differentiating them from ones causally leading to

behaviours. To ensure the quality of intent annota-

tions, we randomly assign three crowd-workers to

write intents per video. The process of constructing

the dataset for intent annotation involved a rigor-

ous validation and selection process. One of the

authors acted as an expert annotator, and conducted

a thorough review of all crowd-sourced intents to

identify and select the most reasonable annotations

as the ﬁnal results. The validation process was com-

pleted in three rounds, yielding increasingly higher

percentages of reasonable annotations, with 82%,

94% and 100% respectively for each round. The

annotations that did not meet the required criteria

were discarded and not included in the ﬁnal dataset.

This rigorous validation process ensured that the

ﬁnal dataset is comprised of high-quality and rele-

vant annotations, providing a robust foundation for

subsequent modeling and analysis.

3.3 Multimodal Knowledge Base

We construct the MKB of human activities based

on the

training set

and

validation set

by taking

a neurosymbolic approach. The main challenges

herein are twofold: i) how to represent multimodal

Cambridge Dictionary,

https://dictionary.

cambridge.org/

information from videos, action names, and intents

adequately to facilitate information retrieval; ii)

how to model shared knowledge of multimodal in-

formation. For the former, we allow both string and

embedding based retrieval methods by attaching

neural representations of video clips and texts to

symbols of actions and action sequences. For the

latter, we employ the classical planning language

STRIPS (Bylander,1994) and neural prototypes to

encode abstract properties of actions.

At the core of the MKB is a knowledge graph

G= (V,E)

, where the node set

comprises

four types of nodes: action classes, action video

clips, action sequences, and action sequence videos,

while the edge set

contains edges reﬂecting rela-

tionships between nodes.

An action class

is the abstraction of an

action described in the language of STRIPS. The

attributes of an action class include its ID, its

name

, its precondition set PRE, its add effect

set ADD, and its delete effect set DEL. An

action is executed only if its preconditions are

satisﬁed. The effect sets ADD and DEL of an

action class describe the add and delete operations

applied to the current state after executing the

action. For example, the precondition of Clos-

ing a refrigerator is isOpen(refrigerator),

ADD = isClosed(refrigerator) and

DEL=isOpen(refrigerator). In this way, the

properties described in STRIPS present the shared

knowledge of each action class.

MKB

Z6LYG

c119

c142

c143

c156

Action Sequence Video

Video ID：Z6LYG

Start：1.9 End：42

Action Video Clip

Video ID：24XHS

Action ID: c142

Start: 36.30

End: 42.00

c063

Action ID: c142

Name: Closing a refrigerator

pre: IsOpen(Refrigerator)

add: IsClosed(Refrigerator)

del: IsOpen(Refrigerator)

Action Class

Action Sequence

future-directed Intent: S/He is hungry

present-directed Intent: Eat food

c143:Opening a refrigerator-> c156:Someone is eating

something->c063: Taking food from somewhere

->c119:Putting a dish/es somewhere

->c142:Closing a refrigerator

Figure 2: An example action sequence in the MKB.

An action sequence comprises a future-directed

intent, a present-directed intent, and a sequence of

action IDs. An intent is represented by both a word

sequence and the distributed representation of the

word sequence. We obtain the distributed repre-

sentation of an intent by applying BERT (Devlin

et al.,2018) and utilizing the representation of the

CLS token. The collection of action sequences can

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ViLPAct:ABenchmarkforCompositionalGeneralizationonMultimodalHumanActivitiesTerryYueZhuo1andYaqingLiao2andYuechengLei2LizhenQu1*andGerarddeMelo3XiaojunChang4andYazhouRen2andZenglinXu5;6*1MonashUniversity2UniversityofElectronicScienceandTechnologyofChina3HPI/UniverisityofPotsdam4UniversityofTechnology...

展开>> 收起<<

ViLPAct A Benchmark for Compositional Generalization on Multimodal Human Activities Terry Yue Zhuo1and Yaqing Liao2and Yuecheng Lei2.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ViLPAct A Benchmark for Compositional Generalization on Multimodal Human Activities Terry Yue Zhuo1and Yaqing Liao2and Yuecheng Lei2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: