Characterizing Verbatim Short-Term Memory in Neural Language Models Kristijan Armeni

2025-04-30 0 0 1.4MB 20 页 10玖币

侵权投诉

Characterizing Verbatim Short-Term Memory in Neural Language

Models

Kristijan Armeni

Johns Hopkins University

karmeni1@jhu.edu

Christopher Honey

Johns Hopkins University

chris.honey@jhu.edu

Tal Linzen

New York University

linzen@nyu.edu

Abstract

When a language model is trained to pre-

dict natural language sequences, its prediction

at each moment depends on a representation

of prior context. What kind of information

about the prior context can language models

retrieve? We tested whether language models

could retrieve the exact words that occurred

previously in a text. In our paradigm, lan-

guage models (transformers and an LSTM)

processed English text in which a list of nouns

occurred twice. We operationalized retrieval

as the reduction in surprisal from the ﬁrst to

the second list. We found that the transform-

ers retrieved both the identity and ordering of

nouns from the ﬁrst list. Further, the transform-

ers’ retrieval was markedly enhanced when

they were trained on a larger corpus and with

greater model depth. Lastly, their ability to in-

dex prior tokens was dependent on learned at-

tention patterns. In contrast, the LSTM exhib-

ited less precise retrieval, which was limited to

list-initial tokens and to short intervening texts.

The LSTM’s retrieval was not sensitive to the

order of nouns and it improved when the list

was semantically coherent. We conclude that

transformers implemented something akin to a

working memory system that could ﬂexibly re-

trieve individual token representations across

arbitrary delays; conversely, the LSTM main-

tained a coarser and more rapidly-decaying se-

mantic gist of prior tokens, weighted toward

the earliest items.

1 Introduction

Language models (LMs) are computational sys-

tems trained to predict upcoming tokens based

on past context. To perform this task well, they

must construct a coherent representation of the text,

which requires establishing relationships between

words that occur at non-adjacent time points.

Despite their simple learning objective, LMs

based on contemporary artiﬁcial neural network

architectures perform well in contexts that require

Before the

meeting, Mary

wrote down

a list of words:

county, muscle,

vapor.

After the

meeting, Mary

took a break

and...

After she got

back, she read

the list again:

LM input sequence

preface

string first list intervening

text

prompt

string

second list

context in memory retrieval

change in surprisal

Paradigm

1) How detailed is LM memory of nouns (identity and ordering)?

2) How resilient is LM memory to size and content of intervening text?

3) How invariant is LM memory w.r.t. the content of noun lists?

county, muscle,

vapor.

Figure 1: Characterizing verbatim memory retrieval in

neural language models. In our paradigm, language

models processed English text in which a list of nouns

occurred twice. We operationalized retrieval as the re-

duction in surprisal from the ﬁrst to the second list pre-

sentation. We measured retrieval while varying: a) set

size, b) the structure of the second list, c) the length of

the intervening text, and d) the content and structure of

the intervening text.

maintenance and retrieval of dependencies span-

ning multiple words. For example, LMs learn to

correctly match the grammatical number of the sub-

ject and a corresponding verb across intervening

words; for example, they prefer the correct The

girls

standing at the desk

are

tall, to the incorrect

The

girls

standing at the desk

tall (Linzen et al.,

2016;Marvin and Linzen,2018;Gulordava et al.,

2018;Futrell et al.,2018). The ability to maintain

context across multiple words is likely to be a cen-

tral factor explaining the success of these models,

potentially following ﬁne-tuning, in natural lan-

guage processing tasks (Devlin et al.,2019;Brown

et al.,2020).

The work discussed above has shown that LMs

extract linguistically meaningful signals and that,

over the course of learning, they develop a short-

term memory capacity: the ability to store and

access recent past context for processing, possibly

akin to the working memory systems thought to en-

able ﬂexible human cognitive capacities (Baddeley,

arXiv:2210.13569v2 [cs.CL] 1 May 2023

2003). What is the nature of the memory processes

that LMs learn? Are these memory processes able

to access individual tokens from the recent past

verbatim, or is the memory system more implicit,

so that only an aggregate gist of the prior context

is available to subsequent processing?

Here, we introduce a paradigm (Fig. 1), inspired

by benchmark tasks for models of human short-

term memory (Oberauer et al.,2018), for charac-

terizing short-term memory abilities of LMs. We

apply it to two particular neural LM architectures

that possess the architectural ingredients to hold

past items in memory: attention-based transformers

(Vaswani et al.,2017) and long short-term mem-

ory networks (Hochreiter and Schmidhuber,1997,

LSTM). Whereas LSTMs incorporate the past by

reusing the results of processing from previous time

steps through dedicated memory cells, transform-

ers use the internal representations of each of the

previous tokens as input. These architectural ingre-

dients alone, however, are not sufﬁcient for a model

to have memory. We hypothesize that whether or

not the model puts this memory capacity to use

depends on whether the training task (next word

prediction) requires it — the parameters control-

ling the activation of context representations and

subsequent retrieval computations are in both cases

learned.

Our goal is to determine whether and when the

LMs we study maintain and retrieve verbatim rep-

resentations of individual prior tokens. First, we

measure the detail of the context representation:

does the LM maintain a verbatim representation of

all prior tokens and their order, or does it instead

combine multiple prior tokens into a summary rep-

resentation, like a semantic gist? Second, we con-

sider the resilience of the memory to interference:

after how many intervening tokens do the represen-

tation of prior context become inaccessible? Third,

we consider the content-invariance of the context

representations: does the resilience of prior context

depend on semantic coherence of the prior infor-

mation, or can arbitrary and unrelated information

sequences be retrieved?

2 Related Work

Previous studies examined how properties of lin-

guistic context inﬂuenced next-word prediction ac-

curacy in transformer and LSTM LMs trained on

text in English. Khandelwal et al. (2018) showed

that LSTM LMs use a window of approximately

200 tokens of past context and word order informa-

tion of the past 50 words, in the service of predict-

ing the next token in natural language sequences.

Subramanian et al. (2020) applied a similar analy-

sis to a transformer LM and showed that LM loss

on test-set sequences was not sensitive to context

perturbations beyond 50 tokens. O’Connor and

Andreas (2021) investigated whether ﬁne-grained

lexical and sentential features of context are used

for next-word prediction in transformer LMs. They

showed that transformers rely predominantly on

local word co-occurrence statistics (e.g. trigram

ordering) and the presence of open class parts of

speech (e.g. nouns), and less on the global struc-

ture of context (e.g. sentence ordering) and the

presence of closed class parts of speech (e.g. func-

tion words). In contrast with these studies, which

focused on how speciﬁc features of past context

affect LM performance on novel input at test time,

our paradigm tests for the ability of LMs to retrieve

nouns that are exactly repeated from prior context.

In a separate line of work bearing on memory

maintenance in LSTMs, Lakretz et al. (2019,2021)

studied an LSTM’s capacity to track subject-verb

agreement dependencies. They showed that LSTM

LMs relied on a small number of hidden units and

the gating mechanisms that control memory con-

tents. Here, we are similarly concerned with mem-

ory characteristics that support LM performance,

but — akin to behavioral tests in cognitive science

— we infer the functional properties of LM mem-

ory by manipulating properties of repeated noun

lists and observing the effects these manipulations

have on the behavior (surprisal) of the LM rather

than on its internal representation.

A third related area of research proposes ar-

chitectural innovations that augment RNNs and

LSTMs with dedicated memory components (e.g.

Weston et al.,2015;Yogatama et al.,2018) or im-

prove the handling of context and memory in trans-

formers (see Tay et al.,2020, for review). Here,

we are not concerned with improving architectures,

but with developing a paradigm that allows us to

study how LMs put to use their memory systems,

whether those are implicit or explicit.

3 Methods

3.1 Paradigm: Lists of Nouns in Context

Noun lists were embedded in brief vignettes (Fig-

ure 1, A and B). Each vignette opened with a pref-

ace string (e.g. “Before the meeting, Mary wrote

down the following list of words:”). This string was

followed by a list of nouns (the ﬁrst list), which

were separated by commas; the list-ﬁnal noun was

followed by a full stop (e.g. “county, muscle, va-

por.”). The ﬁrst list was followed by an intervening

text, which continued the narrative established by

the preface string (“After the meeting, she took a

break and had a cup of coffee.”). The intervening

text was followed by a short prompt string (e.g.

“After she got back, she read the list again:”) after

which another list of nouns, either identical to the

ﬁrst list or different from it, was presented (we re-

fer to this list as the second list). The full vignettes

are provided in Section A.1 of the Appendix.

3.2 Semantic Coherence of Noun Lists

We used two types of word lists: arbitrary and se-

mantically coherent. Arbitrary word lists (e.g. “de-

vice, singer, picture”) were composed of randomly

sampled nouns from the Toronto word pool.

Se-

mantically coherent word lists were sampled from

the categorized noun word pool,

which contains

32 lists, each of which contains 32 semantically

related nouns (e.g. “robin, sparrow, heron, ...”).

All noun lists used in experiments are reported in

Tables 1and 2of the Appendix.

After ensuring there were at least 10 valid, in-

vocabulary nouns per semantic set (as this was the

maximal list length we considered), we were able

to construct

nouns lists. Finally, to reduce the

variance attributable to tokens occurring in speciﬁc

positions, we generated 10 “folds” of each list by

circularly shifting the tokens in the ﬁrst list 10

times. In this way, each noun in each list was tested

in all possible ordinal positions. This procedure

resulted in a total of 23 ×10 = 230 noun lists.

3.3 Language Models

LSTM

We used an adaptive weight-dropped

(AWD) LSTM released by Merity et al. (2018)

which had 3 hidden layers with 400-dimensional

input embeddings, 1840-dimensional hidden states,

and a vocabulary size of 267,735. The model con-

tained 182.3 million trainable parameters. It was

trained on the Wikitext-103 corpus (Merity et al.,

1http://memory.psych.upenn.edu/files/

wordpools/nouns.txt

2http://memory.psych.upenn.edu/files/

wordpools/catwpool.txt

Our code is available at:

https://github.com/

KristijanArmeni/verbatim-memory-in-NLMs

Our experiment data are available at:

https:

//doi.org/10.17605/OSF.IO/5GY7X

2016) and achieved a test-set perplexity of 41.8.

Full training hyperparameters are reported in Sec-

tion A.4 of the Appendix.

Transformer

We trained a transformer LM on

the Wikitext-103 benchmark. We retrained the BPE

tokenizer on the concatenated Wikitext-103 train-

ing, evaluation, and test sets and set. The vocab-

ulary had 28,439 entries. We trained both the 12-

layer GPT-2 architecture (known as “GPT-2 small”,

107.7 million trainable parameters) and, as a point

of comparison, smaller, 1-, 3-, and 6-layer trans-

formers (29.7, 43.9, and 65.2 million trainable pa-

rameters, respectively). The context window was

set to 1024 tokens and embedding dimension was

kept at 768 across the architectures. The perplex-

ities for the 12-, 6-, 3- and 1-layer models on the

Wikitext-103 test set were 40.6, 51.5, 60.1, and

95.1, respectively. The full transformer training de-

tails are reported in Section A.5 of the Appendix.

We also evaluated the transformer LM pretrained

by Radford et al. (2019), accessed through the Hug-

ging Face Transformers library (Wolf et al.,2020).

We refer to this model simply as

GPT-2

. It was

trained on the WebText corpus, which consists of

approximately 8 million online documents. We

used the GPT-2-small checkpoint which has 12

attention layers and 768-dimensional embedding

layer. The model contains 124 million parameters

and has a vocabulary of 50,257 entries. We used

the maximum context size of 1024 tokens.

3.4 Surprisal

For each token

in our sequence, we com-

puted the negative log likelihood (surprisal):

surprisal(wt) = −log2P(wt|w1, . . . , wt−1)

In cases when the transformer byte-pair encoding

tokenizer split a noun into multiple tokens—e.g.

“sparrow” might be split into “sp” and “arrow”—

we summed the surprisals of the resulting tokens.

Quantifying retrieval: repeat surprisal

quantify how the memory trace of the ﬁrst list

affected the model’s expectations on the second

list, we measured the ratio between the surprisal

on the second list and the surprisal on the ﬁrst

list:

repeat surprisal =¯s(L2)

¯s(L1)×100

, where

¯s(L1)

refers to mean surprisal across non-initial

nouns in the ﬁrst list and

¯s(L2)

to mean surprisal

across all non-initial nouns in the second list. We

take a reduction in surprisal on second lists to indi-

cate the extent to which an LM has retrieved tokens

from the ﬁrst list.

4 Transformer Results

We ﬁrst describe the results of our experiments

with the two largest transformer models, the off-

the-shelf GPT-2 and the 12-layer transformer we

trained; LSTM results are discussed in Section 5,

and results with smaller transformers are discussed

towards the end of this section.

The transformers retrieved prior nouns and

their order; this capacity improved when the

model was trained on a larger corpus.

tested whether the transformers could retrieve the

identity and order of 10-token noun lists (arbitrary

or semantically coherent). To this end, we con-

structed vignettes in which the second list was ei-

ther (a) identical to the ﬁrst list, (b) a permutation

of the ﬁrst list, or (c) a list of novel nouns not

present in the ﬁrst list.

We then measured retrieval

as reduction in surprisal from ﬁrst to second list.

When the two transformers were presented with

second lists that were repeated version of the ﬁrst

ones (blue in Fig. 2, B and C), token-by-token

surprisal decreased compared to novel tokens, sug-

gesting that the transformers were able to access

verbatim representations of past nouns from con-

text. When the second list was a permutation of the

ﬁrst one, surprisal was higher compared to when

it was repeated, indicating that the transformers

expected the nouns to be ordered as in the ﬁrst list.

Training set size played an important role in sup-

porting verbatim recall: surprisal differences were

considerably smaller for the transformer trained on

the Wikitext-103 corpus (Fig. 2, B) compared to

GPT-2 (Fig. 2, C).

In order to contextualize the magnitude of these

retrieval effects, we computed the relative surprisal

across all tokens in lists except the ﬁrst one (Fig. 3).

When the ﬁrst and second lists were identical (e.g.

with

N= 10

arbitrary nouns), the Wikitext-103

transformer’s median relative surprisal was at

88%

of the ﬁrst list, compared to

92%

for the permuted

lists, and

99%

for the novel lists. In GPT-2, repeat

surprisal was only

of the ﬁrst list, much lower

than the

58%

for the permuted lists, and

96%

the novel list.

Novel nouns in the string were introduced by randomly

selecting a list of nouns from one the 22 remaining lists in the

noun pool. In semantically coherent lists, novel nouns were

drawn from a different semantic category than the nouns in

the ﬁrst list.

Retrieval in GPT-2 was robust to the exact phras-

ing of the text that introduced the lists. Replacing

the subject ‘Mary’ with ‘John’ in the vignette, re-

placing the colon with a comma or randomly per-

muting the preface or the prompt strings did not

affect the results (Fig. 7, right, Appendix A). By

contrast, the same perturbations reduced retrieval

effects for Wikitext-103 (Fig. 7, left, Appendix A),

supporting the conclusion that larger training cor-

pus size contributes to robustness of transformer

retrieval.

Transformer retrieval was robust to the num-

ber of items being retrieved.

In studies of hu-

man short-term memory, performance degrades as

the number of items that need to be retained in-

creases (“set-size effects”, Oberauer et al. 2018).

Is our LMs’ short-term memory similarly taxed

by increasing the set size? We varied the number

of tokens to be held in memory with

Ntokens ∈

{3,5,7,10}

. For this comparison, the length of

the intervening text was kept at 26 tokens. Re-

sults reported in Fig. 3show that for GPT-2, verba-

tim recall was, for the most part, consistent across

the different set sizes. Repeat surprisal increased

monotonically with set size only when the order

of nouns in second list, either semantically coher-

ent or arbitrary, was permuted.

For the smaller

Wiktiext-103 transformer, repeat surprisal showed

a slight increase with set size further indicating that

retrieval robustness increases with training corpus

size.

Transformer retrieval was robust to the length

and content of intervening text, but scrambling

the intervening text reduced retrieval of order

information.

For how long are individual items

retained in the memory of the LM? We tested

this by varying the length of the intervening text

for

Ntokens ∈ {26,99,194,435}

(see Fig. 1,

panel B). To generate longer intervening text sam-

ples, we continued the narrative established by

the initial preface string (“Before the meeting,

Mary wrote down the following list of words:”).

All intervening text strings ended with the same

prompt string (“When she got back, she read the

list again:”) which introduced the second list.

This increase in surprisal with set size for permuted se-

quences is to be expected, of course, because, if the model

has perfect memory of the list of tokens, but cannot predict

the order in which they will reoccur, then its probability of

guessing the next item in a permuted list where

items have

yet to be observed will be

1/k

, and the mean value of

larger for larger set sizes.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CharacterizingVerbatimShort-TermMemoryinNeuralLanguageModelsKristijanArmeniJohnsHopkinsUniversitykarmeni1@jhu.eduChristopherHoneyJohnsHopkinsUniversitychris.honey@jhu.eduTalLinzenNewYorkUniversitylinzen@nyu.eduAbstractWhenalanguagemodelistrainedtopre-dictnaturallanguagesequences,itspredictionateachm...

展开>> 收起<<

Characterizing Verbatim Short-Term Memory in Neural Language Models Kristijan Armeni.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Characterizing Verbatim Short-Term Memory in Neural Language Models Kristijan Armeni

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: