EgoTaskQA Understanding Human Tasks in Egocentric Videos Baoxiong Jia12

2025-05-03 0 0 8.74MB 28 页 10玖币
侵权投诉
EgoTaskQA: Understanding Human Tasks in
Egocentric Videos
Baoxiong Jia1,2:
baoxiongjia@ucla.edu
Ting Lei2,3:
ting_lei@pku.edu.cn
Song-Chun Zhu2,3,4
sczhu@bigai.ai
Siyuan Huang2
syhuang@bigai.ai
1UCLA Center for Vision, Cognition, Learning, and Autonomy (VCLA)
2Beijing Institute for General Artificial Intelligence (BIGAI)
3Institute for Artificial Intelligence, Peking University
4Department of Automation, Tsinghua University
https://sites.google.com/view/egotaskqa
Abstract
Understanding human tasks through video observations is an essential capability
of intelligent agents. The challenges of such capability lie in the difficulty of
generating a detailed understanding of situated actions, their effects on object states
(i.e., state changes), and their causal dependencies. These challenges are further
aggravated by the natural parallelism from multi-tasking and partial observations
in multi-agent collaboration. Most prior works leverage action localization or
future prediction as an indirect metric for evaluating such task understanding from
videos. To make a direct evaluation, we introduce the EgoTaskQA benchmark that
provides a single home for the crucial dimensions of task understanding through
question-answering on real-world egocentric videos. We meticulously design
questions that target the understanding of (1) action dependencies and effects,
(2) intents and goals, and (3) agents’ beliefs about others. These questions are
divided into four types, including descriptive (what status?), predictive (what will?),
explanatory (what caused?), and counterfactual (what if?) to provide diagnostic
analyses on spatial, temporal, and causal understandings of goal-oriented tasks.
We evaluate state-of-the-art video reasoning models on our benchmark and show
their significant gaps between humans in understanding complex goal-oriented
egocentric videos. We hope this effort will drive the vision community to move
onward with goal-oriented video understanding and reasoning.
1 Introduction
The study of human motion perception has suggested that humans perceive motion as goal-directed
behaviors rather than plain pattern movements [
1
3
]. Developmental psychologists [
4
] categorized
such an ability into two distinct mechanisms: (1) action-effect associations that the desired effects
activate the corresponding action; and (2) simulative procedures, which argues that goal attribution
comes from planning under the rational action principle in others’ shoes. Both mechanisms require
detailed knowledge of
action dependencies and effects
, agent’s
intents and goals
and
beliefs about
other agents
. With such knowledge playing crucial roles in human cognitive development, learning
them from visual observation is pivotal for building more intelligent agents.
:Work done during internship at BIGAI.
36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks.
arXiv:2210.03929v1 [cs.CV] 8 Oct 2022


  

  

  


 
 
 

 
 
 

 
 
 

 
 
 


 
 
 

 

 

 
 
 

 
 
 


 
 
 

 
 
 
 

 
 
 

 
 
 

Figure 1: An overview of EgoTaskQA. We show an illustrative scenario where two subjects collaborate to
make and drink cereal. Based on egocentric observations, we generate questions on these seen or unseen video
intervals with different question types,targeting semantics and question scopes. Note that we use both direct
(e.g., when the person open something...) and indirect (e.g., the action before getting something...) references on
actions and objects, where the same color indicates the same referred actions (best viewed in color).
Taking a closer look at how humans learn from interacting with the world, we locate objects,
change their positions and manipulate them in various ways, all presumably under visual control
from an egocentric perspective [
5
]. This unique first-person experience provides essential visual
cues for human attention and goal-oriented task understanding. Moreover, egocentric perception
naturally reflects how humans reason and perform in a partially observable environment, making
it the most available learning source for learning actions, tasks [
6
], and belief modeling [
7
]. The
past few years have witnessed significant progress in egocentric video understanding, especially
action recognition and future anticipation [
8
13
]. However, these two tasks merely cover the tip of
the iceberg, considering how humans learn from visual observations to obtain knowledge for more
profound tasks such as learning world models, planning for desired goals, and building beliefs about
others. With their essential roles in human cognitive development, we urge the need for a benchmark
that addresses these missing dimensions in egocentric activity understanding.
Hence, we present EgoTaskQA, a challenging egocentric, goal-oriented video question-answering
benchmark based on the LEMMA dataset [
11
]. The LEMMA dataset collects egocentric videos in
goal-oriented and multi-agent collaborative activities with fine-grained action and task annotation.
By extending the LEMMA dataset with annotations consisting of object status, human-object and
multi-agent relationships, and causal dependency structures between actions, we design questions that
target three specific scopes: (1) actions with world state transitions and their dependencies, (2) agents’
intents and goals in task execution, and (3) agents’ belief about others in collaboration to provide an
in-depth evaluation metric for task understanding. These questions are procedurally generated within
four types:
descriptive
,
predictive
,
explanatory
, and
counterfactual
, to systematically test models’
capabilities over
spatial
,
temporal
, and
causal
domains of goal-oriented task understanding. To
avoid spurious correlations in questions, we include both direct and indirect references to actions and
objects. We further balance the answer distribution by the reasoning type of questions and carefully
design benchmarking train/test splits to provide a systematic test on goal-oriented reasoning and
indirect reference understanding; see Fig. 1for an example and more details in Sec. 3.
As shown in Tab. 1, EgoTaskQA complements existing video reasoning benchmarks on various
dimensions. With models exhibiting large performance gaps compared with humans, we devise
diagnostic experiments to reveal both the easy and challenging spots in our benchmark. We hope
such designs and analyses will foster new insights into goal-oriented activity understanding.
Contributions In summary, our main contributions are three-fold:
We extend the LEMMA dataset with annotations of object status, human-object and multi-agent
relationships to facilitate egocentric activity understanding. We further generate causal dependency
structures between actions to provide ground truth for procedural task understanding.
2
Table 1: A comparison between EgoTaskQA and existing video question-answering benchmarks. We use “world”
for world model-related information, including action preconditions, post-effects, and dependencies. We use
FPV as short for egocentric and TPV for third-person-view videos. We use MC as short for multiple-choice
question-answering, and OP for open-answer question-answering.
Dataset Video Question Scope Question type Answer
Type # questions
View Real-world World Intents & Goals Multi-agent Descriptive Predictive Explanatory Counterfactual
MarioQA [42] TPV 7 3 7 7 3 7 3 7 OP 188K
Pororo-QA [43] TPV 7 3 7 7 3 7 3 7 MC 9K
CLEVRER [44] TPV 7 7 7 7 3 3 3 3 OP+MC 282K
Env-QA [45] FPV 7 3 7 7 3 7 7 7 OP 85K
MovieQA [46] TPV 3 7 7 7 3 7 3 7 MC 14K
Social-IQ [47] TPV 3 7 3 3 3 7 3 7 MC 7.5K
TVQA [48] TPV 3 7 7 7 3 7 3 7 MC 152.5K
TVQA+ [49] TPV 3 7 7 7 3 7 3 7 MC 29.4K
MSVD-QA [50] TPV 3 7 7 7 3 7 7 7 OP 50.5K
MSRVTT-QA [50] TPV 3 7 7 7 3 7 7 7 OP 243K
Video-QA [51] TPV 3 7 7 7 3 7 7 7 OP 175K
ActivityNet-QA [52] TPV 3 7 7 7 3 7 7 7 OP 58K
TGIF-QA [53] TPV 3 7 7 7 3 7 7 7 MC 165.2K
How2QA [54] TPV 3 7 7 7 3 7 7 7 MC 44K
HowToVQA69M [55] TPV 3 7 7 7 3 7 7 7 OP 69M
AGQA [56] TPV 3 7 7 7 3 7 7 7 OP 3.6M
NExT-QA [57] TPV 3 7 3 7 3 3 3 7 OP+MC 52K
STAR [58] TPV 3 3 7 7 3 3 7 7 MC 60K
EgoVQA [59] FPV 3 7 7 3 3 7 7 7 OP+MC 520
EgoTaskQA (Ours) FPV 3 3 3 3 3 3 3 3 OP 40K
We construct a balanced video question-answering benchmark, EgoTaskQA, to measure models’
capability in understanding action dependencies and effects, intents and goals, as well as beliefs
in multi-agent scenarios. We procedurally generate four challenging types of questions (descrip-
tive, predictive, explanatory, and counterfactual) with both direct and indirect references for our
benchmark and potential research on video-grounded compositional reasoning.
We devise challenging benchmarking splits over EgoTaskQA to provide a systematic evaluation
of goal-oriented reasoning and indirect reference understanding. We experiment with various
state-of-the-art video reasoning models, show their performance gap compared with humans,
and Acknowledgementanalyze their strengths and weaknesses to promote future research on
goal-oriented task understanding.
2 Related Work
Action as Inverse Planning
Action understanding has been seen as an inverse planning problem on
agents’ mental states [
14
,
15
]. Early studies formulate it as reasoning on the first-order logic formulae
that describes actions’ preconditions and post-effects [
16
,
17
]. This symbolic formalism is later paired
with domain-specific language and algorithms to become mainstays in robotics planning [
18
,
19
]. In
computer vision, similar attempts have been made to link visual observations with world states and
actions [2022]. Various methods treated actions as transformations on images to solve action-state
recognition [
23
27
] and video prediction [
28
30
]. With the emerging interest in language-grounded
understanding, Zellers et al.
[31]
proposed PIGLeT to study the binding between images, world states,
and action descriptions. Padmakumar et al.
[32]
further studies the problem of language understanding
and task execution by designing an intelligent embodied agent that can chat during task execution.
However, these works are mostly limited to atomic actions, missing the important action dependency
in task execution. To tackle this problem, instructional videos [
33
36
] are studied with its goal-
oriented multi-step activities. In these videos, external knowledge [
37
,
38
] can be used as guidance
for advanced tasks like temporal dynamics learning [
39
] and visually grounded planning [
40
,
41
].
Unfortunately, these videos highlight the instructions and include no task-level noise, which is
much simpler than the partially observable, highly paralleled, multi-agent environment that humans
learn from and as presented in our benchmark. These complexities make the goal-oriented action
understanding a challenging task remaining to be solved.
Egocentric Vision
Egocentric vision offers a unique perspective for actively engaging with the
world. Aside from traditional video understanding tasks like video summarization [
60
,
61
], ac-
tivity recognition [
62
64
] and future anticipation [
65
69
], egocentric videos provide fine-grained
information for tasks like human-object interaction understanding [
70
76
] and gaze/attention pre-
diction [
77
,
10
]. With its natural reflectance of partial observability, egocentric videos are also used
for social understanding tasks such as joint attention modeling [
78
,
79
], perspective taking [
80
,
81
]
and communicative modeling [
82
,
7
]. However, with various egocentric datasets curated over the
3
last decade [8,60,9], data and detailed annotations for human tasks are still largely missing. Large-
scale daily lifelog datasets like EPIC-KITCHENS [
12
] and Ego4D [
13
] cover certain aspects of
action-dependencies, effects, and social scenarios in their recordings, but are unsuitable for detailed
annotation due to their size. The other stream of datasets collects activities by providing coarse task
instructions to both single actor [
83
] and multiple agent collaborations [
11
]. They annotate tasks
and compositional actions to reveal agents’ execution and collaboration process for multi-step goal-
directed tasks. Despite all the preferred characteristics of these goal-oriented activity videos, none of
them successfully addressed action-dependencies and effects, nor multi-agent belief modeling.
Video Question-Answering Benchmarks
Visual question-answering can be designed to evaluate a
wide spectrum of model capabilities, spanning from visual concept recognition and spatial relationship
reasoning [
84
87
], abstract reasoning [
88
93
], to common sense reasoning [
94
,
95
]. In the temporal
domain, synthetic environments are used for questions that involve simple action-effect reasoning [
42
,
43
]. Crowdsourced videos [
53
,
52
,
48
,
55
] are used for collecting questions on basic spatial-temporal
reasoning capabilities like event counting [
53
], grounding [
49
], and episodic memory [
13
]. Recent
advances in video question-answering aim for more profound reasoning capabilities. Gao et al.
[45]
leverages an indoor synthetic environment to generate questions on spatial relationships and
simple action-effect reasoning from an egocentric perspective. Xiao et al.
[57]
designs NExT-QA
containing questions about knowledge of the past, present, and future on both temporal and causal
domains. Grunde-McLaughlin et al.
[56]
programmatically generates questions for compositional
spatial-temporal reasoning and generalization. Wu et al.
[58]
focus on short atomic action clips for
situated reasoning. Yi et al.
[44]
generates synthetic videos for studying counterfactual predictions
on collisions. Zadeh et al.
[47]
collects questions for social intelligence evaluation. Nevertheless,
none of these benchmarks addressed the aforementioned critical dimensions of goal-oriented activity
understanding from a real-world egocentric perspective.
3 The EgoTaskQA Benchmark
The EgoTaskQA benchmark contains 40K balanced question-answer pairs selected from 368K
programmatically generated questions generated over 2K egocentric videos. We target the crucial
dimensions for understanding goal-oriented human tasks, including action effects and dependencies,
intent and goals, and multi-agent belief modeling. We further evaluate models’ capabilities to
describe, explain, anticipate, and make counterfactual predictions about goal-oriented events. A
detailed comparison between EgoTaskQA and existing benchmarks is shown in Tab. 1.
3.1 Data Collection
We select egocentric videos from the LEMMA dataset [
11
] as base video sources. Compared to
similar egocentric datasets, human activities in LEMMA are highly goal-oriented and multi-tasked.
These activities contain rich human-object interactions and action dependencies in both single-agent
and two-agent collaboration scenarios. We take advantage of these desired characteristics and
augment LEMMA with ground truths of object states, relationships, and agents’ beliefs about others.
More specifically, we augment LEMMA on the following aspects:
World States
We focus on world states consisting of object states, object-object relationships, and
human-object relationships. First, we build the vocabulary of relationships and state attributes from
activity knowledge defined in previous works [
37
,
96
]. We manually filter irrelevant relationships and
attributes by removing dataset-specific (e.g., under the car) and detailed numerical (e.g., cut in three)
relationships. Next, we gather similar relationships to obtain 48 relationships and 14 object attributes.
This vocabulary covers spatial relationships (e.g., on top of), object affordances (e.g., openable),
and time-varying attributes (e.g., shape). We build on top of action annotations from LEMMA and
use Amazon Mechanical Turk (
AMT
) to annotate this information before and after the changing
action for all time-varying objects. With these annotations, we reconstruct the transition chain for
each interacted object and obtain their temporal status. We provide the complete list of relationships
and object attributes in the supplementary.
Multi-agent Relationships
To capture how two agents (actor and helper) collaborate over the
same task, we annotate basic information about objects’ visibility and the actor’s awareness of the
helper. For each object that the actor operates on, we annotate its visibility to the helper by providing
synchronized videos from both agents’ views to
AMT
workers. For the actor’s awareness of others,
4
 





  






 











 





 

  



  


  
 

  



 

 


 
 

 

 



 

 












Figure 2: We use two actions A1:“get cup from microwave” and A2:“put cup to the other person” from Person
1’s video in Fig. 1as an example to visualize annotations in EgoTaskQA. We annotate states and relationships for
objects changed by actions as well as human-object and multi-agent relationships. After obtaining the “before”
and “after” annotations, we examine which attributes of objects were changed by the action and what are the
preconditions and post-effects. We determine the causal dependency between actions by checking if there exists
an object that the post-effect of one action over this object fulfills the preconditions of another. In this case, the
state change of “cup” determines that A1 and A2 are causally dependent (best viewed in color).
we instruct
AMT
workers to first go through the egocentric view video of both agents to get familiar
with actions performed by the actor and the helper. Next, we ask
AMT
workers to replay the video of
the actor and annotate, during each action segment, whether the actor can see the helper or whether
the actor is aware of the helper’s action if the helper is not in sight. As this annotation is usually
subjective, we take the majority vote of three workers as ground truth.
Causal Trace
Based on the annotated transition chain of objects, we generate causal traces for
each action with rules. By checking whether the post-effect of one action fulfills the preconditions of
another, we define the causal relationship between two actions into unrelated, related, and causally
dependent; see Fig. 2for an illustration and refer to supplementary for detailed explanations. Given
a video, we run this dependency check for each pair of actions. Next, we generate a video-level
dependency tree by recursively checking sequential depending relationships and use it as the ground
truth dependency structure for subsequent explanatory and counterfactual question generation.
In total, we augment LEMMA with 30K annotated before states, after states, and person annotation
blocks as shown in Fig. 2. We then segment the videos in LEMMA into clips with lengths of
around 25 seconds for question generation. This design helps generate interesting clips with partially
observed environmental constraints (e.g., the cup is already washed when the person pours juice), and
visual hints for future actions (e.g., cutting watermelon into dice instead of pieces for making juice
rather than eating it directly). Meanwhile, we keep our videos reasonably long, with an average of 5
actions per clip to cover sufficient information for action dependency inference and future prediction.
We provide more details about data collection and annotation statistics in supplementary.
3.2 Question-Answer Generation
We use machine-generated questions to evaluate models’ task understanding capabilities. We focus
on the transition chain of each interacted object, especially what actions caused changes on objects
and how these changes contribute together to a multi-step task; see examples in Fig. 1.
Question Design
We design questions that pinpoint scopes, including (1) action preconditions,
post-effects, and their dependencies, (2) agents’ intents and goals, and (3) agents’ beliefs about others.
Similar to [
44
], we categorize our questions over these three scopes into four types to systematically
test models’ capabilities over spatial, temporal, and causal domains of task understanding:
5
摘要:

EgoTaskQA:UnderstandingHumanTasksinEgocentricVideosBaoxiongJia1;2:baoxiongjia@ucla.eduTingLei2;3:ting_lei@pku.edu.cnSong-ChunZhu2;3;4sczhu@bigai.aiSiyuanHuang2syhuang@bigai.ai1UCLACenterforVision,Cognition,Learning,andAutonomy(VCLA)2BeijingInstituteforGeneralArticialIntelligence(BIGAI)3Institutefor...

展开>> 收起<<
EgoTaskQA Understanding Human Tasks in Egocentric Videos Baoxiong Jia12.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:28 页 大小:8.74MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注