EgoTaskQA Understanding Human Tasks in Egocentric Videos Baoxiong Jia12

2025-05-03 1 0 8.74MB 28 页 10玖币

侵权投诉

EgoTaskQA: Understanding Human Tasks in

Egocentric Videos

Baoxiong Jia1,2:

baoxiongjia@ucla.edu

Ting Lei2,3:

ting_lei@pku.edu.cn

Song-Chun Zhu2,3,4

sczhu@bigai.ai

Siyuan Huang2

syhuang@bigai.ai

1UCLA Center for Vision, Cognition, Learning, and Autonomy (VCLA)

2Beijing Institute for General Artiﬁcial Intelligence (BIGAI)

3Institute for Artiﬁcial Intelligence, Peking University

4Department of Automation, Tsinghua University

https://sites.google.com/view/egotaskqa

Abstract

Understanding human tasks through video observations is an essential capability

of intelligent agents. The challenges of such capability lie in the difﬁculty of

generating a detailed understanding of situated actions, their effects on object states

(i.e., state changes), and their causal dependencies. These challenges are further

aggravated by the natural parallelism from multi-tasking and partial observations

in multi-agent collaboration. Most prior works leverage action localization or

future prediction as an indirect metric for evaluating such task understanding from

videos. To make a direct evaluation, we introduce the EgoTaskQA benchmark that

provides a single home for the crucial dimensions of task understanding through

question-answering on real-world egocentric videos. We meticulously design

questions that target the understanding of (1) action dependencies and effects,

(2) intents and goals, and (3) agents’ beliefs about others. These questions are

divided into four types, including descriptive (what status?), predictive (what will?),

explanatory (what caused?), and counterfactual (what if?) to provide diagnostic

analyses on spatial, temporal, and causal understandings of goal-oriented tasks.

We evaluate state-of-the-art video reasoning models on our benchmark and show

their signiﬁcant gaps between humans in understanding complex goal-oriented

egocentric videos. We hope this effort will drive the vision community to move

onward with goal-oriented video understanding and reasoning.

1 Introduction

The study of human motion perception has suggested that humans perceive motion as goal-directed

behaviors rather than plain pattern movements [

–

]. Developmental psychologists [

] categorized

such an ability into two distinct mechanisms: (1) action-effect associations that the desired effects

activate the corresponding action; and (2) simulative procedures, which argues that goal attribution

comes from planning under the rational action principle in others’ shoes. Both mechanisms require

detailed knowledge of

action dependencies and effects

, agent’s

intents and goals

and

beliefs about

other agents

. With such knowledge playing crucial roles in human cognitive development, learning

them from visual observation is pivotal for building more intelligent agents.

:Work done during internship at BIGAI.

36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks.

arXiv:2210.03929v1 [cs.CV] 8 Oct 2022







  



  



  





 

 

 



 

 

 



 

 

 



 

 

 





 

 

 



 



 



 

 

 



 

 

 





 

 

 



 

 

 

 



 

 

 



 

 

 



Figure 1: An overview of EgoTaskQA. We show an illustrative scenario where two subjects collaborate to

make and drink cereal. Based on egocentric observations, we generate questions on these seen or unseen video

intervals with different question types,targeting semantics and question scopes. Note that we use both direct

(e.g., when the person open something...) and indirect (e.g., the action before getting something...) references on

actions and objects, where the same color indicates the same referred actions (best viewed in color).

Taking a closer look at how humans learn from interacting with the world, we locate objects,

change their positions and manipulate them in various ways, all presumably under visual control

from an egocentric perspective [

]. This unique ﬁrst-person experience provides essential visual

cues for human attention and goal-oriented task understanding. Moreover, egocentric perception

naturally reﬂects how humans reason and perform in a partially observable environment, making

it the most available learning source for learning actions, tasks [

], and belief modeling [

]. The

past few years have witnessed signiﬁcant progress in egocentric video understanding, especially

action recognition and future anticipation [

–

]. However, these two tasks merely cover the tip of

the iceberg, considering how humans learn from visual observations to obtain knowledge for more

profound tasks such as learning world models, planning for desired goals, and building beliefs about

others. With their essential roles in human cognitive development, we urge the need for a benchmark

that addresses these missing dimensions in egocentric activity understanding.

Hence, we present EgoTaskQA, a challenging egocentric, goal-oriented video question-answering

benchmark based on the LEMMA dataset [

]. The LEMMA dataset collects egocentric videos in

goal-oriented and multi-agent collaborative activities with ﬁne-grained action and task annotation.

By extending the LEMMA dataset with annotations consisting of object status, human-object and

multi-agent relationships, and causal dependency structures between actions, we design questions that

target three speciﬁc scopes: (1) actions with world state transitions and their dependencies, (2) agents’

intents and goals in task execution, and (3) agents’ belief about others in collaboration to provide an

in-depth evaluation metric for task understanding. These questions are procedurally generated within

four types:

descriptive

predictive

explanatory

, and

counterfactual

, to systematically test models’

capabilities over

spatial

temporal

, and

causal

domains of goal-oriented task understanding. To

avoid spurious correlations in questions, we include both direct and indirect references to actions and

objects. We further balance the answer distribution by the reasoning type of questions and carefully

design benchmarking train/test splits to provide a systematic test on goal-oriented reasoning and

indirect reference understanding; see Fig. 1for an example and more details in Sec. 3.

As shown in Tab. 1, EgoTaskQA complements existing video reasoning benchmarks on various

dimensions. With models exhibiting large performance gaps compared with humans, we devise

diagnostic experiments to reveal both the easy and challenging spots in our benchmark. We hope

such designs and analyses will foster new insights into goal-oriented activity understanding.

Contributions In summary, our main contributions are three-fold:

•

We extend the LEMMA dataset with annotations of object status, human-object and multi-agent

relationships to facilitate egocentric activity understanding. We further generate causal dependency

structures between actions to provide ground truth for procedural task understanding.

Table 1: A comparison between EgoTaskQA and existing video question-answering benchmarks. We use “world”

for world model-related information, including action preconditions, post-effects, and dependencies. We use

FPV as short for egocentric and TPV for third-person-view videos. We use MC as short for multiple-choice

question-answering, and OP for open-answer question-answering.

Dataset Video Question Scope Question type Answer

Type # questions

View Real-world World Intents & Goals Multi-agent Descriptive Predictive Explanatory Counterfactual

MarioQA [42] TPV 7 3 7 7 3 7 3 7 OP 188K

Pororo-QA [43] TPV 7 3 7 7 3 7 3 7 MC 9K

CLEVRER [44] TPV 7 7 7 7 3 3 3 3 OP+MC 282K

Env-QA [45] FPV 7 3 7 7 3 7 7 7 OP 85K

MovieQA [46] TPV 3 7 7 7 3 7 3 7 MC 14K

Social-IQ [47] TPV 3 7 3 3 3 7 3 7 MC 7.5K

TVQA [48] TPV 3 7 7 7 3 7 3 7 MC 152.5K

TVQA+ [49] TPV 3 7 7 7 3 7 3 7 MC 29.4K

MSVD-QA [50] TPV 3 7 7 7 3 7 7 7 OP 50.5K

MSRVTT-QA [50] TPV 3 7 7 7 3 7 7 7 OP 243K

Video-QA [51] TPV 3 7 7 7 3 7 7 7 OP 175K

ActivityNet-QA [52] TPV 3 7 7 7 3 7 7 7 OP 58K

TGIF-QA [53] TPV 3 7 7 7 3 7 7 7 MC 165.2K

How2QA [54] TPV 3 7 7 7 3 7 7 7 MC 44K

HowToVQA69M [55] TPV 3 7 7 7 3 7 7 7 OP 69M

AGQA [56] TPV 3 7 7 7 3 7 7 7 OP 3.6M

NExT-QA [57] TPV 3 7 3 7 3 3 3 7 OP+MC 52K

STAR [58] TPV 3 3 7 7 3 3 7 7 MC 60K

EgoVQA [59] FPV 3 7 7 3 3 7 7 7 OP+MC 520

EgoTaskQA (Ours) FPV 3 3 3 3 3 3 3 3 OP 40K

•

We construct a balanced video question-answering benchmark, EgoTaskQA, to measure models’

capability in understanding action dependencies and effects, intents and goals, as well as beliefs

in multi-agent scenarios. We procedurally generate four challenging types of questions (descrip-

tive, predictive, explanatory, and counterfactual) with both direct and indirect references for our

benchmark and potential research on video-grounded compositional reasoning.

•

We devise challenging benchmarking splits over EgoTaskQA to provide a systematic evaluation

of goal-oriented reasoning and indirect reference understanding. We experiment with various

state-of-the-art video reasoning models, show their performance gap compared with humans,

and Acknowledgementanalyze their strengths and weaknesses to promote future research on

goal-oriented task understanding.

2 Related Work

Action as Inverse Planning

Action understanding has been seen as an inverse planning problem on

agents’ mental states [

]. Early studies formulate it as reasoning on the ﬁrst-order logic formulae

that describes actions’ preconditions and post-effects [

]. This symbolic formalism is later paired

with domain-speciﬁc language and algorithms to become mainstays in robotics planning [

]. In

computer vision, similar attempts have been made to link visual observations with world states and

actions [20–22]. Various methods treated actions as transformations on images to solve action-state

recognition [

–

] and video prediction [

–

]. With the emerging interest in language-grounded

understanding, Zellers et al.

[31]

proposed PIGLeT to study the binding between images, world states,

and action descriptions. Padmakumar et al.

[32]

further studies the problem of language understanding

and task execution by designing an intelligent embodied agent that can chat during task execution.

However, these works are mostly limited to atomic actions, missing the important action dependency

in task execution. To tackle this problem, instructional videos [

–

] are studied with its goal-

oriented multi-step activities. In these videos, external knowledge [

] can be used as guidance

for advanced tasks like temporal dynamics learning [

] and visually grounded planning [

Unfortunately, these videos highlight the instructions and include no task-level noise, which is

much simpler than the partially observable, highly paralleled, multi-agent environment that humans

learn from and as presented in our benchmark. These complexities make the goal-oriented action

understanding a challenging task remaining to be solved.

Egocentric Vision

Egocentric vision offers a unique perspective for actively engaging with the

world. Aside from traditional video understanding tasks like video summarization [

], ac-

tivity recognition [

–

] and future anticipation [

–

], egocentric videos provide ﬁne-grained

information for tasks like human-object interaction understanding [

–

] and gaze/attention pre-

diction [

]. With its natural reﬂectance of partial observability, egocentric videos are also used

for social understanding tasks such as joint attention modeling [

], perspective taking [

]

and communicative modeling [

]. However, with various egocentric datasets curated over the

last decade [8,60,9], data and detailed annotations for human tasks are still largely missing. Large-

scale daily lifelog datasets like EPIC-KITCHENS [

] and Ego4D [

] cover certain aspects of

action-dependencies, effects, and social scenarios in their recordings, but are unsuitable for detailed

annotation due to their size. The other stream of datasets collects activities by providing coarse task

instructions to both single actor [

] and multiple agent collaborations [

]. They annotate tasks

and compositional actions to reveal agents’ execution and collaboration process for multi-step goal-

directed tasks. Despite all the preferred characteristics of these goal-oriented activity videos, none of

them successfully addressed action-dependencies and effects, nor multi-agent belief modeling.

Video Question-Answering Benchmarks

Visual question-answering can be designed to evaluate a

wide spectrum of model capabilities, spanning from visual concept recognition and spatial relationship

reasoning [

–

], abstract reasoning [

–

], to common sense reasoning [

]. In the temporal

domain, synthetic environments are used for questions that involve simple action-effect reasoning [

]. Crowdsourced videos [

] are used for collecting questions on basic spatial-temporal

reasoning capabilities like event counting [

], grounding [

], and episodic memory [

]. Recent

advances in video question-answering aim for more profound reasoning capabilities. Gao et al.

[45]

leverages an indoor synthetic environment to generate questions on spatial relationships and

simple action-effect reasoning from an egocentric perspective. Xiao et al.

[57]

designs NExT-QA

containing questions about knowledge of the past, present, and future on both temporal and causal

domains. Grunde-McLaughlin et al.

[56]

programmatically generates questions for compositional

spatial-temporal reasoning and generalization. Wu et al.

[58]

focus on short atomic action clips for

situated reasoning. Yi et al.

[44]

generates synthetic videos for studying counterfactual predictions

on collisions. Zadeh et al.

[47]

collects questions for social intelligence evaluation. Nevertheless,

none of these benchmarks addressed the aforementioned critical dimensions of goal-oriented activity

understanding from a real-world egocentric perspective.

3 The EgoTaskQA Benchmark

The EgoTaskQA benchmark contains 40K balanced question-answer pairs selected from 368K

programmatically generated questions generated over 2K egocentric videos. We target the crucial

dimensions for understanding goal-oriented human tasks, including action effects and dependencies,

intent and goals, and multi-agent belief modeling. We further evaluate models’ capabilities to

describe, explain, anticipate, and make counterfactual predictions about goal-oriented events. A

detailed comparison between EgoTaskQA and existing benchmarks is shown in Tab. 1.

3.1 Data Collection

We select egocentric videos from the LEMMA dataset [

] as base video sources. Compared to

similar egocentric datasets, human activities in LEMMA are highly goal-oriented and multi-tasked.

These activities contain rich human-object interactions and action dependencies in both single-agent

and two-agent collaboration scenarios. We take advantage of these desired characteristics and

augment LEMMA with ground truths of object states, relationships, and agents’ beliefs about others.

More speciﬁcally, we augment LEMMA on the following aspects:

World States

We focus on world states consisting of object states, object-object relationships, and

human-object relationships. First, we build the vocabulary of relationships and state attributes from

activity knowledge deﬁned in previous works [

]. We manually ﬁlter irrelevant relationships and

attributes by removing dataset-speciﬁc (e.g., under the car) and detailed numerical (e.g., cut in three)

relationships. Next, we gather similar relationships to obtain 48 relationships and 14 object attributes.

This vocabulary covers spatial relationships (e.g., on top of), object affordances (e.g., openable),

and time-varying attributes (e.g., shape). We build on top of action annotations from LEMMA and

use Amazon Mechanical Turk (

AMT

) to annotate this information before and after the changing

action for all time-varying objects. With these annotations, we reconstruct the transition chain for

each interacted object and obtain their temporal status. We provide the complete list of relationships

and object attributes in the supplementary.

Multi-agent Relationships

To capture how two agents (actor and helper) collaborate over the

same task, we annotate basic information about objects’ visibility and the actor’s awareness of the

helper. For each object that the actor operates on, we annotate its visibility to the helper by providing

synchronized videos from both agents’ views to

AMT

workers. For the actor’s awareness of others,

 











  













 



















 











 



  







  





  

 



  







 



 





 

 



 



 







 



 

























Figure 2: We use two actions A1:“get cup from microwave” and A2:“put cup to the other person” from Person

1’s video in Fig. 1as an example to visualize annotations in EgoTaskQA. We annotate states and relationships for

objects changed by actions as well as human-object and multi-agent relationships. After obtaining the “before”

and “after” annotations, we examine which attributes of objects were changed by the action and what are the

preconditions and post-effects. We determine the causal dependency between actions by checking if there exists

an object that the post-effect of one action over this object fulﬁlls the preconditions of another. In this case, the

state change of “cup” determines that A1 and A2 are causally dependent (best viewed in color).

we instruct

AMT

workers to ﬁrst go through the egocentric view video of both agents to get familiar

with actions performed by the actor and the helper. Next, we ask

AMT

workers to replay the video of

the actor and annotate, during each action segment, whether the actor can see the helper or whether

the actor is aware of the helper’s action if the helper is not in sight. As this annotation is usually

subjective, we take the majority vote of three workers as ground truth.

Causal Trace

Based on the annotated transition chain of objects, we generate causal traces for

each action with rules. By checking whether the post-effect of one action fulﬁlls the preconditions of

another, we deﬁne the causal relationship between two actions into unrelated, related, and causally

dependent; see Fig. 2for an illustration and refer to supplementary for detailed explanations. Given

a video, we run this dependency check for each pair of actions. Next, we generate a video-level

dependency tree by recursively checking sequential depending relationships and use it as the ground

truth dependency structure for subsequent explanatory and counterfactual question generation.

In total, we augment LEMMA with 30K annotated before states, after states, and person annotation

blocks as shown in Fig. 2. We then segment the videos in LEMMA into clips with lengths of

around 25 seconds for question generation. This design helps generate interesting clips with partially

observed environmental constraints (e.g., the cup is already washed when the person pours juice), and

visual hints for future actions (e.g., cutting watermelon into dice instead of pieces for making juice

rather than eating it directly). Meanwhile, we keep our videos reasonably long, with an average of 5

actions per clip to cover sufﬁcient information for action dependency inference and future prediction.

We provide more details about data collection and annotation statistics in supplementary.

3.2 Question-Answer Generation

We use machine-generated questions to evaluate models’ task understanding capabilities. We focus

on the transition chain of each interacted object, especially what actions caused changes on objects

and how these changes contribute together to a multi-step task; see examples in Fig. 1.

Question Design

We design questions that pinpoint scopes, including (1) action preconditions,

post-effects, and their dependencies, (2) agents’ intents and goals, and (3) agents’ beliefs about others.

Similar to [

], we categorize our questions over these three scopes into four types to systematically

test models’ capabilities over spatial, temporal, and causal domains of task understanding:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EgoTaskQA:UnderstandingHumanTasksinEgocentricVideosBaoxiongJia1;2:baoxiongjia@ucla.eduTingLei2;3:ting_lei@pku.edu.cnSong-ChunZhu2;3;4sczhu@bigai.aiSiyuanHuang2syhuang@bigai.ai1UCLACenterforVision,Cognition,Learning,andAutonomy(VCLA)2BeijingInstituteforGeneralArticialIntelligence(BIGAI)3Institutefor...

展开>> 收起<<

EgoTaskQA Understanding Human Tasks in Egocentric Videos Baoxiong Jia12.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

EgoTaskQA Understanding Human Tasks in Egocentric Videos Baoxiong Jia12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: