Learning a Visually Grounded Memory Assistant

2025-05-02 0 0 2.16MB 9 页 10玖币

侵权投诉

Meera Hahn1,2Kevin Carlberg2Ruta Desai2James Hillis2

1Georgia Institute of Technology 2Facebook Reality Labs

meerahahn@gatech.edu, {carlberg, rutadesai, jmchillis}@fb.com

ABSTRACT

We introduce a novel interface for large scale collection of human

memory and assistance. Using the 3D Matterport simulator we

create a realistic indoor environments in which we have people per-

form specic embodied memory tasks that mimic household daily

activities. This interface was then deployed on Amazon Mechanical

Turk allowing us to test and record human memory, navigation and

needs for assistance at a large scale that was previously impossible.

Using the interface we collect the ‘The Visually Grounded Memory

Assistant Dataset’ which is aimed at developing our understand-

ing of (1) the information people encode during navigation of 3D

environments and (2) conditions under which people ask for mem-

ory assistance. Additionally we experiment with with predicting

when people will ask for assistance using models trained on hand-

selected visual and semantic features. This provides an opportunity

to build stronger ties between the machine-learning and cognitive-

science communities through learned models of human perception,

memory, and cognition.

KEYWORDS

Assistance, Navigation, Visual Memory, Visual Question Answering

1 INTRODUCTION

Automated interaction with humans in everyday activity remains a

signicant challenge for articial intelligence (AI). Current interac-

tive systems take many forms and operate on dierent time scales;

examples include shopping recommendations, conversational AI,

autonomous vehicles, and social robots. These models typically

only work within a narrow range of conditions. For example, while

autonomous vehicles can interact eectively with other drivers in

highway conditions, they are not yet ready for city streets where

environments are less predictable. An even more signicant chal-

lenge is presented by the prospect of all-day wearable augmented

reality (AR) glasses: the ideal is an automated system that that can

oer assistance in any context. Unlike existing mobile devices, AR

wearable glasses could have access to information from the internet

and detailed information about local physical context, including

user actions from a rst person viewpoint. Simultaneous access

to both sources of information opens new opportunities and chal-

lenges for the development of collaborative human–AI systems.

The AI required for such a system would likely require “theory of

mind” similar to that of humans, whereby people infer goals and

cognitive states of others and take strategic, context-dependent

actions that may be cooperative or adversarial in nature.

Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Sys-

tems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023,

London, United Kingdom.

2023 International Foundation for Autonomous Agents

Here we present a dataset aimed at providing insight into the

conditions where people request assistance to recall facts about the

local environment. We focus on memory because enhancing human

memory is one of the primary uses of computing technology. Our

data provides insight into (1) the kind of features people encode

during navigation, (2) the diculty of dierent types of questions,

and (3) conditions under which people will ask for assistance.

To gain insight into the kind of local assistance people would like

to receive, we gave participants exposure to a 3D environment (a

"y-through" of a Matterport3D environment [

]). They were then

asked questions about the environment, as illustrated in Figure 1.

For each question, participants could either (1) answer the question

immediately, (2) navigate back to the location where the answer

could be discerned, or (3) pay for an assistant to bring them back

to that location.

I can’t remember

how to get to

the bedroom

Okay I’ll help

you navigate

there

Figure 1: Conceptual depiction of the study with 3D environ-

ment, option for requesting assistance and human-assistant

interaction.

In this paper, we present summary statistics and results of models

(which employ hand-selected features) that predict whether partici-

pants will ask for assistance. The features used in these models were

selected based primarily on intuition. Ultimately, we aim to for-

mulate models of human perceptual and memory systems that are

based on established computational models of human perception

and cognition (e.g., [

]). We hypothesize that this knowledge

will help address a core challenge of developing accurate priors for

when people will ask for assistance in memory and navigation tasks.

Such priors provide a foundation for more generalizable models and

inferring model parameters from less data (i.e., low-shot learning).

arXiv:2210.03787v1 [cs.CV] 7 Oct 2022

In summary, our main contributions are as follows:

(1)

We introduce the Memory Question Answering (MemQA) task

for humans, which tests human visual spatial memory. We cre-

ated the Visually Grounded Memory Assistant Dataset, which

contains over 6k instances of humans preforming the MemQA

task. To the best of our knowledge, this is the largest dataset

on visually-grounded memory assistance for humans.

(2)

We perform in-depth analysis of conditions under which hu-

mans ask for assistance.

(3)

We develop baseline models for the task of predicting whether

participants will ask for assistance or navigate on their own as

well as their accuracy on answering the MemQA questions.

2 RELATED WORK

2.1 Research and models of human memory

Human memory is often classied based on the length of storage

(sensory, short, and long term) and the ability to communicate the

contents of memory. While contents of declarative memory can be

stated explicitly, procedural or perceptual-motor skills cannot be

so stated [

]. Declarative memory is often further classied into

episodic (life events) and semantic (language and symbol-based

knowledge). In addition to this classication scheme, computa-

tional theory has led to the development of process models for

the encoding, storage, and retrieval of memories. These models,

typically built on the basis of association networks, blur the lines

in the typology and capture important patterns in data [

]. In

particular, the encoding and recall of memories is highly depen-

dent on spatio-temporal context and the memory networks built

with this structure seem to associate representations across the

memory typology described above. For example, people can often

report the context in which they learned to tie their shoe (proce-

dural and episodic memory) and shoe brands are likely faster to

recall when a person is tying their shoe than when they zipping up

their jacket (indicating context specic eects of procedural and

semantic memory).

The complexity of these association networks has, historically,

been dicult to study in detail due to methodological limitations. In

particular, studying memory at scale and in real-world, visually-rich

contexts is a signicant challenge. By collecting data on visual mem-

ory tasks at scale on Amazon Mechanical Turk (AMT) in complex

3D environments, our data set provide an important step toward

overcoming these limitations. AMT studies have also been used in

the past to study memorability of images [

]. However, these

studies focus on purely visual features underlying memorability

and do not account for context-driven or task-driven visual memory

encoding.

The most relevant cognitive-science research for our task focuses

on how people learn to navigate environments. People build and

store mental maps that allow for more ecient navigation on future

visits to that location. These maps are built from mixtures of sensory

cues that include landmarks, optical ow as well as non-visual

cues [

]. What features are used and how they are encoded

is not fully understood [

]. Our data provide a rich source of

information to develop our understanding of what visual features

are encoded and stored in human spatial maps. Understanding

what features are used by humans may enable the development

of better navigation systems in mechanical autonomous agents.

Such agents are now being developed in a research program on

embodied question answering (EQA) [

] which was the

primary machine-learning inspiration for the present study.

2.2 Embodied Perception and Question

Answering

Recently, the computer-perception community has opened a new

eld of embodied perception where agents learn to perform tasks in

3D simulated environments in an end-to-end manner from raw pixel

data. These tasks include target-driven navigation [

], instruction-

based visual navigation [

], and embodied and interactive question

answering (EQA) [

]. In a typical EQA task set up, an agent

is spawned in a random location of a novel building and asked

a question about an object or room such as “What color is the

car?”. The agent has no prior knowledge or representation of the

building or objects and it must navigate to nd the object and

then answer the question correctly. This task involves learning

a robust navigational system and an accurate visual inference to

answer the question. This task was designed as a good measure

of an agent’s ability to preform visually grounded navigation and

semantic understanding of the environment. We hypothesize that

observing humans in the EQA task will lend insight into human

spatial memory and semantic understanding of the environment.

To this end, we expanded the EQA dataset to include ve types of

questions: location, existence, color, count, and comparison. We

then use the new EQA questions to create a new task for humans

called Memory Question Answering (MemQA). In the MemQA

task, participants are given a short video of a y through of an

environment and are then asked to solve multiple EQA questions

about that environment within a time constraint.

Apart from enabling the study of how humans perform naviga-

tion tasks and encode spatio-temporal information, our task also

allows humans to ask for assistance when needed. Recent research

in embodied and autonomous agents has also explored the utility

of seeking assistance [

]. Nguyen and Daumé III developed

a navigation task where agents could ask for natural language

assistance [

]. Their goal was to develop mobile agents that can

leverage help from humans potentially to accomplish more complex

tasks than the agents could entirely on their own. They also ex-

plored using a cost to for each request for assistance. Their goal was

to try and learn the optimal policy for requesting assistance with a

limited budget of requests. They did not gather human assistance

dialogue but instead supplemented the assistance with the instruc-

tions from the Room2Room task [

]. Unlike this work, our focus

is on understanding when humans might seek assistance in tasks

that require visual memory encoding. The ability to understand

when a user has forgotten something about the local environment

and thereby might need assistance would be crucial for the next

generation of contextual, personal assistants.

2.3 Automated interaction and AI Assistance

Research in human–robot interaction and simulations of human–AI

cooperative systems has helped identify and make progress toward

some of the major challenges in automated interaction systems. As

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningaVisuallyGroundedMemoryAssistantMeeraHahn1,2KevinCarlberg2RutaDesai2JamesHillis21GeorgiaInstituteofTechnology2FacebookRealityLabsmeerahahn@gatech.edu,{carlberg,rutadesai,jmchillis}@fb.comABSTRACTWeintroduceanovelinterfaceforlargescalecollectionofhumanmemoryandassistance.Usingthe3DMatterports...

展开>> 收起<<

Learning a Visually Grounded Memory Assistant.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning a Visually Grounded Memory Assistant

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: