Memory in humans and deep language models Linking hypotheses for model augmentation Omri Raccah

2025-05-02 0 0 1.56MB 14 页 10玖币
侵权投诉
Memory in humans and deep language models:
Linking hypotheses for model augmentation
Omri Raccah
Intel Labs
New York University
or409@nyu.edu
Phoebe Chen
New York University
hc2896@nyu.edu
Ted L. Willke
Intel Labs
ted.willke@intel.com
David Poeppel
New York University
dp101@nyu.edu
Vy A. Vo
Intel Labs
vy.vo@intel.com
Abstract
The computational complexity of the self-attention mechanism in Transformer
models significantly limits their ability to generalize over long temporal durations.
Memory-augmentation, or the explicit storing of past information in external mem-
ory for subsequent predictions, has become a constructive avenue for mitigating this
limitation. We argue that memory-augmented Transformers can benefit substan-
tially from considering insights from the memory literature in humans. We detail
an approach for integrating evidence from the human memory system through the
specification of cross-domain linking hypotheses. We then provide an empirical
demonstration to evaluate the use of surprisal as a linking hypothesis, and further
identify the limitations of this approach to inform future research.
1 Introduction
Transformer model architectures [
1
] have become an indispensable tool in several domains, such
as natural language processing [
2
], image processing [
3
] and reinforcement learning [
4
,
5
]. A
widely acknowledged scaling limitation of the self-attention mechanism is that its computational
complexity scales quadratically with the size of the attention window. This limits the model’s ability
to capture long-range dependencies in data such as books, scientific articles, or code. Several efficient
Transformers have been proposed to address this principal limitation [
6
]. A subset of these focus
on augmenting the network with an external memory, which we henceforth refer to as memory-
augmented Transformers [
7
10
]. Notably, knowledge in artificial neural networks (ANNs) is thought
to be implicitly stored in the parameters of a pre-trained model, requiring ever-larger networks to store
more facts [
11
,
12
]. In memory-augmented Transformers, however, information is more explicitly
stored in an external memory and retrieved when making predictions. This may increase the capacity
of the network to represent knowledge about the world, in addition to helping it capture information
over long temporal durations. Unlike temporal convolutions, which require a pre-specified width,
external memories can be stored for arbitrary durations and retrieved when relevant.
This property is also true of human memory, which demonstrates the remarkable ability to generalize
over an immense amount of information in written documents and over events in one’s life. The rich
literature on memory in psychology and neuroscience presents ample opportunities for augmenting
ANNs with biologically-inspired memory. Here, we aim to lay some groundwork for understanding
memory across fields, and describe some practical considerations for effectively integrating findings
from cognitive neuroscience into Transformers, with a particular focus on language models (LMs).
However, note that several of these considerations can be applied to other models and domains in AI.
Workshop on Memory in Artificial and Real Intelligence (MemARI) at NeurIPS 2022.
arXiv:2210.01869v3 [cs.CL] 28 Nov 2022
Finally, we provide an empirical demonstration of evaluating a memory augmentation strategy for
GPT-2 [
13
] using human behavioral data and identify its limitations and strengths to inform future
research.
2 Considerations for biologically-inspired memory augmentation
2.1 Memory-augmentation from a cognitive lens
We argue that specifying appropriate linking hypotheses across domains will not only facilitate
novel biologically-inspired approaches, but will also provide a way to empirically evaluate different
hypotheses. In cognitive neuroscience, a linking hypothesis is a proposition for a formal mapping
between a neurobiological state and a psychological state [
14
,
15
], such as the firing of a single
neuron leading to a visual percept. A central aim of biologically-inspired AI is to formulate linking
hypotheses between a component in an AI system and a well-defined aspect of cognition. Strong
linking hypotheses should lead to a formal and quantifiable mapping between a representation in an
AI system and some neurobiological/psychological data, as has been demonstrated in some cases for
computer vision [
16
18
] and natural language [
19
22
]. These linking hypotheses must be specified at
the correct level of analysis [
15
,
23
], e.g., a modification of the equations to perform similarity search
on a database in a retrieval-augmented system should map to research on the biological mechanisms
of memory retrieval. In our view, proper accounts would be best derived through decomposing
the problem into computational subroutines appropriate for comparison across domains. Many AI
systems already assume a linking hypothesis between ANNs and human cognition without explicitly
stating them as hypotheses or evaluating them. Here, we briefly explore some of these hypotheses in
memory-augmented Transformers and propose possible mappings to findings in the human literature.
We divide memory-augmented Transformers into two general types. A static memory stores infor-
mation in a corpus of fixed size and content (e.g., a Wikipedia knowledge base), which it learns
to retrieve from during training [
24
,
25
,
9
]. The contents of a static memory do not get modified,
although they can be encoded in different formats such as raw text or embeddings [
25
,
26
]. Dynamic
memory mechanisms store new information as inputs that are being processed by the model. Training
the network involves learning both storage and retrieval policies. For example, new information may
be remembered or forgotten on the basis of input properties or model activations. Furthermore, inputs
may be transformed in some manner (e.g., through compression) before being stored in the external
memory [
7
]. Both static and dynamic memory-augmented Transformers have shown significant
improvements over non-augmented models when making predictions over long texts [7, 8, 27, 28].
These augmentation strategies do not map cleanly to the types of memory commonly delineated in
cognitive theories of human memory [
29
]. That said, classical memory taxonomies are often the
source of AI inspiration, with papers citing work on short- vs. long-term memory or episodic memory
[
30
,
31
,
10
,
11
]. In our view, a static memory could be like human semantic memory if it uses a
knowledge base, or it could be a fairly direct analog of episodic memory if it stores previously seen
examples [
24
]. Instead, our proposed division focuses on the subprocesses thought to be involved
in human long-term memory: encoding, consolidation, and retrieval [
32
]. Different strategies for
memory augmentation will therefore pursue different implementations of each subprocess, and can
draw direct inspiration from studies of that specific subprocess. Current work on memory-augmented
Transformers has already proposed separate mechanisms for each subprocess, although there is often
no direct link to human data. For example, there is a growing literature on retrieval augmentation
[
8
,
9
,
24
,
25
,
33
35
] that proposes similarity search as the retrieval mechanism. Other work has
proposed specific encoding policies which determine what to store and forget, either by exploiting
the attention weights [7] or learning which memories to forget [36].
2.2 Incorporating insights from human memory via policy modifications
Here we discuss some findings from the human memory literature to demonstrate how they may be
used to inform policy modifications in memory-augmented Transformers. Lexical properties (e.g.,
written-frequency, word length, animacy, etc.) serve as strong predictors of subsequent memory for
individual words and lists [
37
39
]. Furthermore, humans have been shown to have the remarkable
ability to remember whether they have seen an image from up to 10,000 images after only a single
exposure [
40
]. The properties that determine the memorability of an image are thought to be
multifaceted, including high-level properties such as emotional valence [
41
] and overall semantic
2
meaning [
42
,
43
]. If some property is directly computable from the inputs, it can be efficiently
used as a biologically-plausible encoding policy in memory-augmented models. Recent work in
cognitive neuroscience has also been focused on uncovering the process by which humans segment
continuous experience into composite events in memory, known as event segmentation [
44
47
].
This evidence can also inform encoding policies for model augmentation, as studies have shown
preferential encoding at event boundaries. Furthermore, this area of research can be leveraged to
inform storage policies, which delineate how sequential information with ordered constituents is
structured or formatted in memory. Lastly, retrieval policies, or the manner by which information is
read from an existing memory store, can take practical influence from human memory. For example,
items that share a temporal or semantic context during encoding are retrieved sequentially with
relation to one another [
48
,
49
]. These examples provide theoretically and empirically motivated
hypotheses for memory-augmentation. Next, we demonstrate the evaluation of a specific linking
hypothesis.
3 Evaluating a candidate linking hypothesis for memory augmentation
Surprisal
The loss function of an LM estimates the negative log likelihood of an upcoming word
given its context. In information theoretic terms, this is known as surprisal. Some have proposed that
next-word prediction is a fundamental computational process that occurs during human language
processing [
50
52
], and have shown evidence that LM-estimated surprise predicts behavioral [
53
55
]
and neural data [
51
]. Surprise (or unsigned prediction error) is also theorized to play a critical role
in memory and learning, and experimental evidence supports this notion [
56
58
]. Word surprisal
in particular may predict human memory during natural language comprehension [
59
,
60
]. Since
surprise is a readily available quantity in LMs, we test its feasibility as a linking hypothesis by
examining human behavioral data in a memory experiment. If model-based surprisal can predict
human memory, it could be a practical and effective memory encoding policy for augmented models.
Dataset of human recall behavior
We used a public dataset collected by Michelmann et al. from
two groups of participants [
61
]. The first group ("story-exposure"; N = 50) listened to a naturally
spoken story containing 965 words. Then participants completed a cloze task [
62
] similar to an
autoregressive LM objective, in which they were given 10 words from the story and asked to predict
the final word. This task was administered in order for every word in the story, starting with the
third word, limiting the context for words in the beginning of the story (Appendix A). The second
group ("no exposure"; N = 50) completed the same cloze task but had no exposure to the story before
completing the cloze task. The memory effect is the difference in performance across groups. For a
full account of the methods, see Appendix A.
Figure 1: Behavioral results. (A) Cloze performance in-
creases as a function of story-exposure across individual
words. Black lines indicate the mean. (B) Histogram of
memory improvement across words (signed difference be-
tween the story-exposure and no-exposure groups.)
For each word tested in the story, co-
sine similarity was computed between
the GloVe embeddings [
63
,
61
] for
the responded word and the correct
answer, and averaged across partici-
pants. In contrast to a binary scor-
ing approach (correct vs. incorrect),
this allows partial credit to be as-
signed for semantically similar re-
sponses to the correct answer (Ap-
pendix A). Replicating the findings
in Michelmann et al. [
61
], we found
that the story-exposure group signifi-
cantly outperformed the no-exposure
group in guessing the correct words
(p < 0.001; one-tailed test; Figure 1).
3.1 Model-based word surprisal is related to human memory for spoken narratives
We next tested the effect of word surprisal on cloze performance. We used GPT-2 to estimate surprisal
for each of the 1033 story tokens and combined sub-tokens for each word. We found that word
surprisal shows a robust inverse correlation in both the no-exposure (
R2= 0.61
;
p < 0.001
) and
3
摘要:

Memoryinhumansanddeeplanguagemodels:LinkinghypothesesformodelaugmentationOmriRaccahIntelLabsNewYorkUniversityor409@nyu.eduPhoebeChenNewYorkUniversityhc2896@nyu.eduTedL.WillkeIntelLabsted.willke@intel.comDavidPoeppelNewYorkUniversitydp101@nyu.eduVyA.VoIntelLabsvy.vo@intel.comAbstractThecomputationalc...

展开>> 收起<<
Memory in humans and deep language models Linking hypotheses for model augmentation Omri Raccah.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:14 页 大小:1.56MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注