In summary, our main contributions are as follows:
(1)
We introduce the Memory Question Answering (MemQA) task
for humans, which tests human visual spatial memory. We cre-
ated the Visually Grounded Memory Assistant Dataset, which
contains over 6k instances of humans preforming the MemQA
task. To the best of our knowledge, this is the largest dataset
on visually-grounded memory assistance for humans.
(2)
We perform in-depth analysis of conditions under which hu-
mans ask for assistance.
(3)
We develop baseline models for the task of predicting whether
participants will ask for assistance or navigate on their own as
well as their accuracy on answering the MemQA questions.
2 RELATED WORK
2.1 Research and models of human memory
Human memory is often classied based on the length of storage
(sensory, short, and long term) and the ability to communicate the
contents of memory. While contents of declarative memory can be
stated explicitly, procedural or perceptual-motor skills cannot be
so stated [
3
]. Declarative memory is often further classied into
episodic (life events) and semantic (language and symbol-based
knowledge). In addition to this classication scheme, computa-
tional theory has led to the development of process models for
the encoding, storage, and retrieval of memories. These models,
typically built on the basis of association networks, blur the lines
in the typology and capture important patterns in data [
16
,
17
]. In
particular, the encoding and recall of memories is highly depen-
dent on spatio-temporal context and the memory networks built
with this structure seem to associate representations across the
memory typology described above. For example, people can often
report the context in which they learned to tie their shoe (proce-
dural and episodic memory) and shoe brands are likely faster to
recall when a person is tying their shoe than when they zipping up
their jacket (indicating context specic eects of procedural and
semantic memory).
The complexity of these association networks has, historically,
been dicult to study in detail due to methodological limitations. In
particular, studying memory at scale and in real-world, visually-rich
contexts is a signicant challenge. By collecting data on visual mem-
ory tasks at scale on Amazon Mechanical Turk (AMT) in complex
3D environments, our data set provide an important step toward
overcoming these limitations. AMT studies have also been used in
the past to study memorability of images [
15
,
19
]. However, these
studies focus on purely visual features underlying memorability
and do not account for context-driven or task-driven visual memory
encoding.
The most relevant cognitive-science research for our task focuses
on how people learn to navigate environments. People build and
store mental maps that allow for more ecient navigation on future
visits to that location. These maps are built from mixtures of sensory
cues that include landmarks, optical ow as well as non-visual
cues [
13
,
14
,
32
]. What features are used and how they are encoded
is not fully understood [
6
,
10
,
18
]. Our data provide a rich source of
information to develop our understanding of what visual features
are encoded and stored in human spatial maps. Understanding
what features are used by humans may enable the development
of better navigation systems in mechanical autonomous agents.
Such agents are now being developed in a research program on
embodied question answering (EQA) [
11
,
30
,
31
] which was the
primary machine-learning inspiration for the present study.
2.2 Embodied Perception and Question
Answering
Recently, the computer-perception community has opened a new
eld of embodied perception where agents learn to perform tasks in
3D simulated environments in an end-to-end manner from raw pixel
data. These tasks include target-driven navigation [
33
], instruction-
based visual navigation [
2
], and embodied and interactive question
answering (EQA) [
11
,
30
,
31
]. In a typical EQA task set up, an agent
is spawned in a random location of a novel building and asked
a question about an object or room such as “What color is the
car?”. The agent has no prior knowledge or representation of the
building or objects and it must navigate to nd the object and
then answer the question correctly. This task involves learning
a robust navigational system and an accurate visual inference to
answer the question. This task was designed as a good measure
of an agent’s ability to preform visually grounded navigation and
semantic understanding of the environment. We hypothesize that
observing humans in the EQA task will lend insight into human
spatial memory and semantic understanding of the environment.
To this end, we expanded the EQA dataset to include ve types of
questions: location, existence, color, count, and comparison. We
then use the new EQA questions to create a new task for humans
called Memory Question Answering (MemQA). In the MemQA
task, participants are given a short video of a y through of an
environment and are then asked to solve multiple EQA questions
about that environment within a time constraint.
Apart from enabling the study of how humans perform naviga-
tion tasks and encode spatio-temporal information, our task also
allows humans to ask for assistance when needed. Recent research
in embodied and autonomous agents has also explored the utility
of seeking assistance [
22
,
23
]. Nguyen and Daumé III developed
a navigation task where agents could ask for natural language
assistance [
22
]. Their goal was to develop mobile agents that can
leverage help from humans potentially to accomplish more complex
tasks than the agents could entirely on their own. They also ex-
plored using a cost to for each request for assistance. Their goal was
to try and learn the optimal policy for requesting assistance with a
limited budget of requests. They did not gather human assistance
dialogue but instead supplemented the assistance with the instruc-
tions from the Room2Room task [
2
]. Unlike this work, our focus
is on understanding when humans might seek assistance in tasks
that require visual memory encoding. The ability to understand
when a user has forgotten something about the local environment
and thereby might need assistance would be crucial for the next
generation of contextual, personal assistants.
2.3 Automated interaction and AI Assistance
Research in human–robot interaction and simulations of human–AI
cooperative systems has helped identify and make progress toward
some of the major challenges in automated interaction systems. As