
Table 1: A comparison between EgoTaskQA and existing video question-answering benchmarks. We use “world”
for world model-related information, including action preconditions, post-effects, and dependencies. We use
FPV as short for egocentric and TPV for third-person-view videos. We use MC as short for multiple-choice
question-answering, and OP for open-answer question-answering.
Dataset Video Question Scope Question type Answer
Type # questions
View Real-world World Intents & Goals Multi-agent Descriptive Predictive Explanatory Counterfactual
MarioQA [42] TPV 7 3 7 7 3 7 3 7 OP 188K
Pororo-QA [43] TPV 7 3 7 7 3 7 3 7 MC 9K
CLEVRER [44] TPV 7 7 7 7 3 3 3 3 OP+MC 282K
Env-QA [45] FPV 7 3 7 7 3 7 7 7 OP 85K
MovieQA [46] TPV 3 7 7 7 3 7 3 7 MC 14K
Social-IQ [47] TPV 3 7 3 3 3 7 3 7 MC 7.5K
TVQA [48] TPV 3 7 7 7 3 7 3 7 MC 152.5K
TVQA+ [49] TPV 3 7 7 7 3 7 3 7 MC 29.4K
MSVD-QA [50] TPV 3 7 7 7 3 7 7 7 OP 50.5K
MSRVTT-QA [50] TPV 3 7 7 7 3 7 7 7 OP 243K
Video-QA [51] TPV 3 7 7 7 3 7 7 7 OP 175K
ActivityNet-QA [52] TPV 3 7 7 7 3 7 7 7 OP 58K
TGIF-QA [53] TPV 3 7 7 7 3 7 7 7 MC 165.2K
How2QA [54] TPV 3 7 7 7 3 7 7 7 MC 44K
HowToVQA69M [55] TPV 3 7 7 7 3 7 7 7 OP 69M
AGQA [56] TPV 3 7 7 7 3 7 7 7 OP 3.6M
NExT-QA [57] TPV 3 7 3 7 3 3 3 7 OP+MC 52K
STAR [58] TPV 3 3 7 7 3 3 7 7 MC 60K
EgoVQA [59] FPV 3 7 7 3 3 7 7 7 OP+MC 520
EgoTaskQA (Ours) FPV 3 3 3 3 3 3 3 3 OP 40K
•
We construct a balanced video question-answering benchmark, EgoTaskQA, to measure models’
capability in understanding action dependencies and effects, intents and goals, as well as beliefs
in multi-agent scenarios. We procedurally generate four challenging types of questions (descrip-
tive, predictive, explanatory, and counterfactual) with both direct and indirect references for our
benchmark and potential research on video-grounded compositional reasoning.
•
We devise challenging benchmarking splits over EgoTaskQA to provide a systematic evaluation
of goal-oriented reasoning and indirect reference understanding. We experiment with various
state-of-the-art video reasoning models, show their performance gap compared with humans,
and Acknowledgementanalyze their strengths and weaknesses to promote future research on
goal-oriented task understanding.
2 Related Work
Action as Inverse Planning
Action understanding has been seen as an inverse planning problem on
agents’ mental states [
14
,
15
]. Early studies formulate it as reasoning on the first-order logic formulae
that describes actions’ preconditions and post-effects [
16
,
17
]. This symbolic formalism is later paired
with domain-specific language and algorithms to become mainstays in robotics planning [
18
,
19
]. In
computer vision, similar attempts have been made to link visual observations with world states and
actions [20–22]. Various methods treated actions as transformations on images to solve action-state
recognition [
23
–
27
] and video prediction [
28
–
30
]. With the emerging interest in language-grounded
understanding, Zellers et al.
[31]
proposed PIGLeT to study the binding between images, world states,
and action descriptions. Padmakumar et al.
[32]
further studies the problem of language understanding
and task execution by designing an intelligent embodied agent that can chat during task execution.
However, these works are mostly limited to atomic actions, missing the important action dependency
in task execution. To tackle this problem, instructional videos [
33
–
36
] are studied with its goal-
oriented multi-step activities. In these videos, external knowledge [
37
,
38
] can be used as guidance
for advanced tasks like temporal dynamics learning [
39
] and visually grounded planning [
40
,
41
].
Unfortunately, these videos highlight the instructions and include no task-level noise, which is
much simpler than the partially observable, highly paralleled, multi-agent environment that humans
learn from and as presented in our benchmark. These complexities make the goal-oriented action
understanding a challenging task remaining to be solved.
Egocentric Vision
Egocentric vision offers a unique perspective for actively engaging with the
world. Aside from traditional video understanding tasks like video summarization [
60
,
61
], ac-
tivity recognition [
62
–
64
] and future anticipation [
65
–
69
], egocentric videos provide fine-grained
information for tasks like human-object interaction understanding [
70
–
76
] and gaze/attention pre-
diction [
77
,
10
]. With its natural reflectance of partial observability, egocentric videos are also used
for social understanding tasks such as joint attention modeling [
78
,
79
], perspective taking [
80
,
81
]
and communicative modeling [
82
,
7
]. However, with various egocentric datasets curated over the
3