Learning a Visually Grounded Memory Assistant

2025-05-02 0 0 2.16MB 9 页 10玖币
侵权投诉
Learning a Visually Grounded Memory Assistant
Meera Hahn1,2Kevin Carlberg2Ruta Desai2James Hillis2
1Georgia Institute of Technology 2Facebook Reality Labs
meerahahn@gatech.edu, {carlberg, rutadesai, jmchillis}@fb.com
ABSTRACT
We introduce a novel interface for large scale collection of human
memory and assistance. Using the 3D Matterport simulator we
create a realistic indoor environments in which we have people per-
form specic embodied memory tasks that mimic household daily
activities. This interface was then deployed on Amazon Mechanical
Turk allowing us to test and record human memory, navigation and
needs for assistance at a large scale that was previously impossible.
Using the interface we collect the ‘The Visually Grounded Memory
Assistant Dataset’ which is aimed at developing our understand-
ing of (1) the information people encode during navigation of 3D
environments and (2) conditions under which people ask for mem-
ory assistance. Additionally we experiment with with predicting
when people will ask for assistance using models trained on hand-
selected visual and semantic features. This provides an opportunity
to build stronger ties between the machine-learning and cognitive-
science communities through learned models of human perception,
memory, and cognition.
KEYWORDS
Assistance, Navigation, Visual Memory, Visual Question Answering
1 INTRODUCTION
Automated interaction with humans in everyday activity remains a
signicant challenge for articial intelligence (AI). Current interac-
tive systems take many forms and operate on dierent time scales;
examples include shopping recommendations, conversational AI,
autonomous vehicles, and social robots. These models typically
only work within a narrow range of conditions. For example, while
autonomous vehicles can interact eectively with other drivers in
highway conditions, they are not yet ready for city streets where
environments are less predictable. An even more signicant chal-
lenge is presented by the prospect of all-day wearable augmented
reality (AR) glasses: the ideal is an automated system that that can
oer assistance in any context. Unlike existing mobile devices, AR
wearable glasses could have access to information from the internet
and detailed information about local physical context, including
user actions from a rst person viewpoint. Simultaneous access
to both sources of information opens new opportunities and chal-
lenges for the development of collaborative human–AI systems.
The AI required for such a system would likely require “theory of
mind” similar to that of humans, whereby people infer goals and
cognitive states of others and take strategic, context-dependent
actions that may be cooperative or adversarial in nature.
Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Sys-
tems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023,
London, United Kingdom.
©
2023 International Foundation for Autonomous Agents
and Multiagent Systems (www.ifaamas.org). All rights reserved.
Here we present a dataset aimed at providing insight into the
conditions where people request assistance to recall facts about the
local environment. We focus on memory because enhancing human
memory is one of the primary uses of computing technology. Our
data provides insight into (1) the kind of features people encode
during navigation, (2) the diculty of dierent types of questions,
and (3) conditions under which people will ask for assistance.
To gain insight into the kind of local assistance people would like
to receive, we gave participants exposure to a 3D environment (a
"y-through" of a Matterport3D environment [
7
]). They were then
asked questions about the environment, as illustrated in Figure 1.
For each question, participants could either (1) answer the question
immediately, (2) navigate back to the location where the answer
could be discerned, or (3) pay for an assistant to bring them back
to that location.
I can’t remember
how to get to
the bedroom
Okay I’ll help
you navigate
there
Figure 1: Conceptual depiction of the study with 3D environ-
ment, option for requesting assistance and human-assistant
interaction.
In this paper, we present summary statistics and results of models
(which employ hand-selected features) that predict whether partici-
pants will ask for assistance. The features used in these models were
selected based primarily on intuition. Ultimately, we aim to for-
mulate models of human perceptual and memory systems that are
based on established computational models of human perception
and cognition (e.g., [
8
,
20
]). We hypothesize that this knowledge
will help address a core challenge of developing accurate priors for
when people will ask for assistance in memory and navigation tasks.
Such priors provide a foundation for more generalizable models and
inferring model parameters from less data (i.e., low-shot learning).
arXiv:2210.03787v1 [cs.CV] 7 Oct 2022
In summary, our main contributions are as follows:
(1)
We introduce the Memory Question Answering (MemQA) task
for humans, which tests human visual spatial memory. We cre-
ated the Visually Grounded Memory Assistant Dataset, which
contains over 6k instances of humans preforming the MemQA
task. To the best of our knowledge, this is the largest dataset
on visually-grounded memory assistance for humans.
(2)
We perform in-depth analysis of conditions under which hu-
mans ask for assistance.
(3)
We develop baseline models for the task of predicting whether
participants will ask for assistance or navigate on their own as
well as their accuracy on answering the MemQA questions.
2 RELATED WORK
2.1 Research and models of human memory
Human memory is often classied based on the length of storage
(sensory, short, and long term) and the ability to communicate the
contents of memory. While contents of declarative memory can be
stated explicitly, procedural or perceptual-motor skills cannot be
so stated [
3
]. Declarative memory is often further classied into
episodic (life events) and semantic (language and symbol-based
knowledge). In addition to this classication scheme, computa-
tional theory has led to the development of process models for
the encoding, storage, and retrieval of memories. These models,
typically built on the basis of association networks, blur the lines
in the typology and capture important patterns in data [
16
,
17
]. In
particular, the encoding and recall of memories is highly depen-
dent on spatio-temporal context and the memory networks built
with this structure seem to associate representations across the
memory typology described above. For example, people can often
report the context in which they learned to tie their shoe (proce-
dural and episodic memory) and shoe brands are likely faster to
recall when a person is tying their shoe than when they zipping up
their jacket (indicating context specic eects of procedural and
semantic memory).
The complexity of these association networks has, historically,
been dicult to study in detail due to methodological limitations. In
particular, studying memory at scale and in real-world, visually-rich
contexts is a signicant challenge. By collecting data on visual mem-
ory tasks at scale on Amazon Mechanical Turk (AMT) in complex
3D environments, our data set provide an important step toward
overcoming these limitations. AMT studies have also been used in
the past to study memorability of images [
15
,
19
]. However, these
studies focus on purely visual features underlying memorability
and do not account for context-driven or task-driven visual memory
encoding.
The most relevant cognitive-science research for our task focuses
on how people learn to navigate environments. People build and
store mental maps that allow for more ecient navigation on future
visits to that location. These maps are built from mixtures of sensory
cues that include landmarks, optical ow as well as non-visual
cues [
13
,
14
,
32
]. What features are used and how they are encoded
is not fully understood [
6
,
10
,
18
]. Our data provide a rich source of
information to develop our understanding of what visual features
are encoded and stored in human spatial maps. Understanding
what features are used by humans may enable the development
of better navigation systems in mechanical autonomous agents.
Such agents are now being developed in a research program on
embodied question answering (EQA) [
11
,
30
,
31
] which was the
primary machine-learning inspiration for the present study.
2.2 Embodied Perception and Question
Answering
Recently, the computer-perception community has opened a new
eld of embodied perception where agents learn to perform tasks in
3D simulated environments in an end-to-end manner from raw pixel
data. These tasks include target-driven navigation [
33
], instruction-
based visual navigation [
2
], and embodied and interactive question
answering (EQA) [
11
,
30
,
31
]. In a typical EQA task set up, an agent
is spawned in a random location of a novel building and asked
a question about an object or room such as “What color is the
car?”. The agent has no prior knowledge or representation of the
building or objects and it must navigate to nd the object and
then answer the question correctly. This task involves learning
a robust navigational system and an accurate visual inference to
answer the question. This task was designed as a good measure
of an agent’s ability to preform visually grounded navigation and
semantic understanding of the environment. We hypothesize that
observing humans in the EQA task will lend insight into human
spatial memory and semantic understanding of the environment.
To this end, we expanded the EQA dataset to include ve types of
questions: location, existence, color, count, and comparison. We
then use the new EQA questions to create a new task for humans
called Memory Question Answering (MemQA). In the MemQA
task, participants are given a short video of a y through of an
environment and are then asked to solve multiple EQA questions
about that environment within a time constraint.
Apart from enabling the study of how humans perform naviga-
tion tasks and encode spatio-temporal information, our task also
allows humans to ask for assistance when needed. Recent research
in embodied and autonomous agents has also explored the utility
of seeking assistance [
22
,
23
]. Nguyen and Daumé III developed
a navigation task where agents could ask for natural language
assistance [
22
]. Their goal was to develop mobile agents that can
leverage help from humans potentially to accomplish more complex
tasks than the agents could entirely on their own. They also ex-
plored using a cost to for each request for assistance. Their goal was
to try and learn the optimal policy for requesting assistance with a
limited budget of requests. They did not gather human assistance
dialogue but instead supplemented the assistance with the instruc-
tions from the Room2Room task [
2
]. Unlike this work, our focus
is on understanding when humans might seek assistance in tasks
that require visual memory encoding. The ability to understand
when a user has forgotten something about the local environment
and thereby might need assistance would be crucial for the next
generation of contextual, personal assistants.
2.3 Automated interaction and AI Assistance
Research in human–robot interaction and simulations of human–AI
cooperative systems has helped identify and make progress toward
some of the major challenges in automated interaction systems. As
摘要:

LearningaVisuallyGroundedMemoryAssistantMeeraHahn1,2KevinCarlberg2RutaDesai2JamesHillis21GeorgiaInstituteofTechnology2FacebookRealityLabsmeerahahn@gatech.edu,{carlberg,rutadesai,jmchillis}@fb.comABSTRACTWeintroduceanovelinterfaceforlargescalecollectionofhumanmemoryandassistance.Usingthe3DMatterports...

展开>> 收起<<
Learning a Visually Grounded Memory Assistant.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:2.16MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注