Memory in humans and deep language models Linking hypotheses for model augmentation Omri Raccah

2025-05-02 0 0 1.56MB 14 页 10玖币

侵权投诉

Memory in humans and deep language models:

Linking hypotheses for model augmentation

Omri Raccah

Intel Labs

New York University

or409@nyu.edu

Phoebe Chen

New York University

hc2896@nyu.edu

Ted L. Willke

Intel Labs

ted.willke@intel.com

David Poeppel

New York University

dp101@nyu.edu

Vy A. Vo

Intel Labs

vy.vo@intel.com

Abstract

The computational complexity of the self-attention mechanism in Transformer

models signiﬁcantly limits their ability to generalize over long temporal durations.

Memory-augmentation, or the explicit storing of past information in external mem-

ory for subsequent predictions, has become a constructive avenue for mitigating this

limitation. We argue that memory-augmented Transformers can beneﬁt substan-

tially from considering insights from the memory literature in humans. We detail

an approach for integrating evidence from the human memory system through the

speciﬁcation of cross-domain linking hypotheses. We then provide an empirical

demonstration to evaluate the use of surprisal as a linking hypothesis, and further

identify the limitations of this approach to inform future research.

1 Introduction

Transformer model architectures [

] have become an indispensable tool in several domains, such

as natural language processing [

], image processing [

] and reinforcement learning [

]. A

widely acknowledged scaling limitation of the self-attention mechanism is that its computational

complexity scales quadratically with the size of the attention window. This limits the model’s ability

to capture long-range dependencies in data such as books, scientiﬁc articles, or code. Several efﬁcient

Transformers have been proposed to address this principal limitation [

]. A subset of these focus

on augmenting the network with an external memory, which we henceforth refer to as memory-

augmented Transformers [

–

]. Notably, knowledge in artiﬁcial neural networks (ANNs) is thought

to be implicitly stored in the parameters of a pre-trained model, requiring ever-larger networks to store

more facts [

]. In memory-augmented Transformers, however, information is more explicitly

stored in an external memory and retrieved when making predictions. This may increase the capacity

of the network to represent knowledge about the world, in addition to helping it capture information

over long temporal durations. Unlike temporal convolutions, which require a pre-speciﬁed width,

external memories can be stored for arbitrary durations and retrieved when relevant.

This property is also true of human memory, which demonstrates the remarkable ability to generalize

over an immense amount of information in written documents and over events in one’s life. The rich

literature on memory in psychology and neuroscience presents ample opportunities for augmenting

ANNs with biologically-inspired memory. Here, we aim to lay some groundwork for understanding

memory across ﬁelds, and describe some practical considerations for effectively integrating ﬁndings

from cognitive neuroscience into Transformers, with a particular focus on language models (LMs).

However, note that several of these considerations can be applied to other models and domains in AI.

Workshop on Memory in Artiﬁcial and Real Intelligence (MemARI) at NeurIPS 2022.

arXiv:2210.01869v3 [cs.CL] 28 Nov 2022

Finally, we provide an empirical demonstration of evaluating a memory augmentation strategy for

GPT-2 [

] using human behavioral data and identify its limitations and strengths to inform future

research.

2 Considerations for biologically-inspired memory augmentation

2.1 Memory-augmentation from a cognitive lens

We argue that specifying appropriate linking hypotheses across domains will not only facilitate

novel biologically-inspired approaches, but will also provide a way to empirically evaluate different

hypotheses. In cognitive neuroscience, a linking hypothesis is a proposition for a formal mapping

between a neurobiological state and a psychological state [

], such as the ﬁring of a single

neuron leading to a visual percept. A central aim of biologically-inspired AI is to formulate linking

hypotheses between a component in an AI system and a well-deﬁned aspect of cognition. Strong

linking hypotheses should lead to a formal and quantiﬁable mapping between a representation in an

AI system and some neurobiological/psychological data, as has been demonstrated in some cases for

computer vision [

–

] and natural language [

–

]. These linking hypotheses must be speciﬁed at

the correct level of analysis [

], e.g., a modiﬁcation of the equations to perform similarity search

on a database in a retrieval-augmented system should map to research on the biological mechanisms

of memory retrieval. In our view, proper accounts would be best derived through decomposing

the problem into computational subroutines appropriate for comparison across domains. Many AI

systems already assume a linking hypothesis between ANNs and human cognition without explicitly

stating them as hypotheses or evaluating them. Here, we brieﬂy explore some of these hypotheses in

memory-augmented Transformers and propose possible mappings to ﬁndings in the human literature.

We divide memory-augmented Transformers into two general types. A static memory stores infor-

mation in a corpus of ﬁxed size and content (e.g., a Wikipedia knowledge base), which it learns

to retrieve from during training [

]. The contents of a static memory do not get modiﬁed,

although they can be encoded in different formats such as raw text or embeddings [

]. Dynamic

memory mechanisms store new information as inputs that are being processed by the model. Training

the network involves learning both storage and retrieval policies. For example, new information may

be remembered or forgotten on the basis of input properties or model activations. Furthermore, inputs

may be transformed in some manner (e.g., through compression) before being stored in the external

memory [

]. Both static and dynamic memory-augmented Transformers have shown signiﬁcant

improvements over non-augmented models when making predictions over long texts [7, 8, 27, 28].

These augmentation strategies do not map cleanly to the types of memory commonly delineated in

cognitive theories of human memory [

]. That said, classical memory taxonomies are often the

source of AI inspiration, with papers citing work on short- vs. long-term memory or episodic memory

[

]. In our view, a static memory could be like human semantic memory if it uses a

knowledge base, or it could be a fairly direct analog of episodic memory if it stores previously seen

examples [

]. Instead, our proposed division focuses on the subprocesses thought to be involved

in human long-term memory: encoding, consolidation, and retrieval [

]. Different strategies for

memory augmentation will therefore pursue different implementations of each subprocess, and can

draw direct inspiration from studies of that speciﬁc subprocess. Current work on memory-augmented

Transformers has already proposed separate mechanisms for each subprocess, although there is often

no direct link to human data. For example, there is a growing literature on retrieval augmentation

[

–

] that proposes similarity search as the retrieval mechanism. Other work has

proposed speciﬁc encoding policies which determine what to store and forget, either by exploiting

the attention weights [7] or learning which memories to forget [36].

2.2 Incorporating insights from human memory via policy modiﬁcations

Here we discuss some ﬁndings from the human memory literature to demonstrate how they may be

used to inform policy modiﬁcations in memory-augmented Transformers. Lexical properties (e.g.,

written-frequency, word length, animacy, etc.) serve as strong predictors of subsequent memory for

individual words and lists [

–

]. Furthermore, humans have been shown to have the remarkable

ability to remember whether they have seen an image from up to 10,000 images after only a single

exposure [

]. The properties that determine the memorability of an image are thought to be

multifaceted, including high-level properties such as emotional valence [

] and overall semantic

meaning [

]. If some property is directly computable from the inputs, it can be efﬁciently

used as a biologically-plausible encoding policy in memory-augmented models. Recent work in

cognitive neuroscience has also been focused on uncovering the process by which humans segment

continuous experience into composite events in memory, known as event segmentation [

–

This evidence can also inform encoding policies for model augmentation, as studies have shown

preferential encoding at event boundaries. Furthermore, this area of research can be leveraged to

inform storage policies, which delineate how sequential information with ordered constituents is

structured or formatted in memory. Lastly, retrieval policies, or the manner by which information is

read from an existing memory store, can take practical inﬂuence from human memory. For example,

items that share a temporal or semantic context during encoding are retrieved sequentially with

relation to one another [

]. These examples provide theoretically and empirically motivated

hypotheses for memory-augmentation. Next, we demonstrate the evaluation of a speciﬁc linking

hypothesis.

3 Evaluating a candidate linking hypothesis for memory augmentation

Surprisal

The loss function of an LM estimates the negative log likelihood of an upcoming word

given its context. In information theoretic terms, this is known as surprisal. Some have proposed that

next-word prediction is a fundamental computational process that occurs during human language

processing [

–

], and have shown evidence that LM-estimated surprise predicts behavioral [

–

]

and neural data [

]. Surprise (or unsigned prediction error) is also theorized to play a critical role

in memory and learning, and experimental evidence supports this notion [

–

]. Word surprisal

in particular may predict human memory during natural language comprehension [

]. Since

surprise is a readily available quantity in LMs, we test its feasibility as a linking hypothesis by

examining human behavioral data in a memory experiment. If model-based surprisal can predict

human memory, it could be a practical and effective memory encoding policy for augmented models.

Dataset of human recall behavior

We used a public dataset collected by Michelmann et al. from

two groups of participants [

]. The ﬁrst group ("story-exposure"; N = 50) listened to a naturally

spoken story containing 965 words. Then participants completed a cloze task [

] similar to an

autoregressive LM objective, in which they were given 10 words from the story and asked to predict

the ﬁnal word. This task was administered in order for every word in the story, starting with the

third word, limiting the context for words in the beginning of the story (Appendix A). The second

group ("no exposure"; N = 50) completed the same cloze task but had no exposure to the story before

completing the cloze task. The memory effect is the difference in performance across groups. For a

full account of the methods, see Appendix A.

Figure 1: Behavioral results. (A) Cloze performance in-

creases as a function of story-exposure across individual

words. Black lines indicate the mean. (B) Histogram of

memory improvement across words (signed difference be-

tween the story-exposure and no-exposure groups.)

For each word tested in the story, co-

sine similarity was computed between

the GloVe embeddings [

] for

the responded word and the correct

answer, and averaged across partici-

pants. In contrast to a binary scor-

ing approach (correct vs. incorrect),

this allows partial credit to be as-

signed for semantically similar re-

sponses to the correct answer (Ap-

pendix A). Replicating the ﬁndings

in Michelmann et al. [

], we found

that the story-exposure group signiﬁ-

cantly outperformed the no-exposure

group in guessing the correct words

(p < 0.001; one-tailed test; Figure 1).

3.1 Model-based word surprisal is related to human memory for spoken narratives

We next tested the effect of word surprisal on cloze performance. We used GPT-2 to estimate surprisal

for each of the 1033 story tokens and combined sub-tokens for each word. We found that word

surprisal shows a robust inverse correlation in both the no-exposure (

R2= 0.61

;

p < 0.001

) and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Memoryinhumansanddeeplanguagemodels:LinkinghypothesesformodelaugmentationOmriRaccahIntelLabsNewYorkUniversityor409@nyu.eduPhoebeChenNewYorkUniversityhc2896@nyu.eduTedL.WillkeIntelLabsted.willke@intel.comDavidPoeppelNewYorkUniversitydp101@nyu.eduVyA.VoIntelLabsvy.vo@intel.comAbstractThecomputationalc...

展开>> 收起<<

Memory in humans and deep language models Linking hypotheses for model augmentation Omri Raccah.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Memory in humans and deep language models Linking hypotheses for model augmentation Omri Raccah

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: