Characterizing Verbatim Short-Term Memory in Neural Language Models Kristijan Armeni

2025-04-30 0 0 1.4MB 20 页 10玖币
侵权投诉
Characterizing Verbatim Short-Term Memory in Neural Language
Models
Kristijan Armeni
Johns Hopkins University
karmeni1@jhu.edu
Christopher Honey
Johns Hopkins University
chris.honey@jhu.edu
Tal Linzen
New York University
linzen@nyu.edu
Abstract
When a language model is trained to pre-
dict natural language sequences, its prediction
at each moment depends on a representation
of prior context. What kind of information
about the prior context can language models
retrieve? We tested whether language models
could retrieve the exact words that occurred
previously in a text. In our paradigm, lan-
guage models (transformers and an LSTM)
processed English text in which a list of nouns
occurred twice. We operationalized retrieval
as the reduction in surprisal from the first to
the second list. We found that the transform-
ers retrieved both the identity and ordering of
nouns from the first list. Further, the transform-
ers’ retrieval was markedly enhanced when
they were trained on a larger corpus and with
greater model depth. Lastly, their ability to in-
dex prior tokens was dependent on learned at-
tention patterns. In contrast, the LSTM exhib-
ited less precise retrieval, which was limited to
list-initial tokens and to short intervening texts.
The LSTM’s retrieval was not sensitive to the
order of nouns and it improved when the list
was semantically coherent. We conclude that
transformers implemented something akin to a
working memory system that could flexibly re-
trieve individual token representations across
arbitrary delays; conversely, the LSTM main-
tained a coarser and more rapidly-decaying se-
mantic gist of prior tokens, weighted toward
the earliest items.
1 Introduction
Language models (LMs) are computational sys-
tems trained to predict upcoming tokens based
on past context. To perform this task well, they
must construct a coherent representation of the text,
which requires establishing relationships between
words that occur at non-adjacent time points.
Despite their simple learning objective, LMs
based on contemporary artificial neural network
architectures perform well in contexts that require
Before the
meeting, Mary
wrote down
a list of words:
county, muscle,
vapor.
After the
meeting, Mary
took a break
and...
After she got
back, she read
the list again:
LM input sequence
preface
string first list intervening
text
prompt
string
second list
context in memory retrieval
change in surprisal
Paradigm
1) How detailed is LM memory of nouns (identity and ordering)?
2) How resilient is LM memory to size and content of intervening text?
3) How invariant is LM memory w.r.t. the content of noun lists?
county, muscle,
vapor.
Figure 1: Characterizing verbatim memory retrieval in
neural language models. In our paradigm, language
models processed English text in which a list of nouns
occurred twice. We operationalized retrieval as the re-
duction in surprisal from the first to the second list pre-
sentation. We measured retrieval while varying: a) set
size, b) the structure of the second list, c) the length of
the intervening text, and d) the content and structure of
the intervening text.
maintenance and retrieval of dependencies span-
ning multiple words. For example, LMs learn to
correctly match the grammatical number of the sub-
ject and a corresponding verb across intervening
words; for example, they prefer the correct The
girls
standing at the desk
are
tall, to the incorrect
The
girls
standing at the desk
is
tall (Linzen et al.,
2016;Marvin and Linzen,2018;Gulordava et al.,
2018;Futrell et al.,2018). The ability to maintain
context across multiple words is likely to be a cen-
tral factor explaining the success of these models,
potentially following fine-tuning, in natural lan-
guage processing tasks (Devlin et al.,2019;Brown
et al.,2020).
The work discussed above has shown that LMs
extract linguistically meaningful signals and that,
over the course of learning, they develop a short-
term memory capacity: the ability to store and
access recent past context for processing, possibly
akin to the working memory systems thought to en-
able flexible human cognitive capacities (Baddeley,
arXiv:2210.13569v2 [cs.CL] 1 May 2023
2003). What is the nature of the memory processes
that LMs learn? Are these memory processes able
to access individual tokens from the recent past
verbatim, or is the memory system more implicit,
so that only an aggregate gist of the prior context
is available to subsequent processing?
Here, we introduce a paradigm (Fig. 1), inspired
by benchmark tasks for models of human short-
term memory (Oberauer et al.,2018), for charac-
terizing short-term memory abilities of LMs. We
apply it to two particular neural LM architectures
that possess the architectural ingredients to hold
past items in memory: attention-based transformers
(Vaswani et al.,2017) and long short-term mem-
ory networks (Hochreiter and Schmidhuber,1997,
LSTM). Whereas LSTMs incorporate the past by
reusing the results of processing from previous time
steps through dedicated memory cells, transform-
ers use the internal representations of each of the
previous tokens as input. These architectural ingre-
dients alone, however, are not sufficient for a model
to have memory. We hypothesize that whether or
not the model puts this memory capacity to use
depends on whether the training task (next word
prediction) requires it — the parameters control-
ling the activation of context representations and
subsequent retrieval computations are in both cases
learned.
Our goal is to determine whether and when the
LMs we study maintain and retrieve verbatim rep-
resentations of individual prior tokens. First, we
measure the detail of the context representation:
does the LM maintain a verbatim representation of
all prior tokens and their order, or does it instead
combine multiple prior tokens into a summary rep-
resentation, like a semantic gist? Second, we con-
sider the resilience of the memory to interference:
after how many intervening tokens do the represen-
tation of prior context become inaccessible? Third,
we consider the content-invariance of the context
representations: does the resilience of prior context
depend on semantic coherence of the prior infor-
mation, or can arbitrary and unrelated information
sequences be retrieved?
2 Related Work
Previous studies examined how properties of lin-
guistic context influenced next-word prediction ac-
curacy in transformer and LSTM LMs trained on
text in English. Khandelwal et al. (2018) showed
that LSTM LMs use a window of approximately
200 tokens of past context and word order informa-
tion of the past 50 words, in the service of predict-
ing the next token in natural language sequences.
Subramanian et al. (2020) applied a similar analy-
sis to a transformer LM and showed that LM loss
on test-set sequences was not sensitive to context
perturbations beyond 50 tokens. O’Connor and
Andreas (2021) investigated whether fine-grained
lexical and sentential features of context are used
for next-word prediction in transformer LMs. They
showed that transformers rely predominantly on
local word co-occurrence statistics (e.g. trigram
ordering) and the presence of open class parts of
speech (e.g. nouns), and less on the global struc-
ture of context (e.g. sentence ordering) and the
presence of closed class parts of speech (e.g. func-
tion words). In contrast with these studies, which
focused on how specific features of past context
affect LM performance on novel input at test time,
our paradigm tests for the ability of LMs to retrieve
nouns that are exactly repeated from prior context.
In a separate line of work bearing on memory
maintenance in LSTMs, Lakretz et al. (2019,2021)
studied an LSTM’s capacity to track subject-verb
agreement dependencies. They showed that LSTM
LMs relied on a small number of hidden units and
the gating mechanisms that control memory con-
tents. Here, we are similarly concerned with mem-
ory characteristics that support LM performance,
but — akin to behavioral tests in cognitive science
— we infer the functional properties of LM mem-
ory by manipulating properties of repeated noun
lists and observing the effects these manipulations
have on the behavior (surprisal) of the LM rather
than on its internal representation.
A third related area of research proposes ar-
chitectural innovations that augment RNNs and
LSTMs with dedicated memory components (e.g.
Weston et al.,2015;Yogatama et al.,2018) or im-
prove the handling of context and memory in trans-
formers (see Tay et al.,2020, for review). Here,
we are not concerned with improving architectures,
but with developing a paradigm that allows us to
study how LMs put to use their memory systems,
whether those are implicit or explicit.
3 Methods
3.1 Paradigm: Lists of Nouns in Context
Noun lists were embedded in brief vignettes (Fig-
ure 1, A and B). Each vignette opened with a pref-
ace string (e.g. “Before the meeting, Mary wrote
down the following list of words:”). This string was
followed by a list of nouns (the first list), which
were separated by commas; the list-final noun was
followed by a full stop (e.g. “county, muscle, va-
por.”). The first list was followed by an intervening
text, which continued the narrative established by
the preface string (“After the meeting, she took a
break and had a cup of coffee.”). The intervening
text was followed by a short prompt string (e.g.
After she got back, she read the list again:”) after
which another list of nouns, either identical to the
first list or different from it, was presented (we re-
fer to this list as the second list). The full vignettes
are provided in Section A.1 of the Appendix.
3.2 Semantic Coherence of Noun Lists
We used two types of word lists: arbitrary and se-
mantically coherent. Arbitrary word lists (e.g. “de-
vice, singer, picture”) were composed of randomly
sampled nouns from the Toronto word pool.
1
Se-
mantically coherent word lists were sampled from
the categorized noun word pool,
2
which contains
32 lists, each of which contains 32 semantically
related nouns (e.g. “robin, sparrow, heron, ...”).
All noun lists used in experiments are reported in
Tables 1and 2of the Appendix.
After ensuring there were at least 10 valid, in-
vocabulary nouns per semantic set (as this was the
maximal list length we considered), we were able
to construct
23
nouns lists. Finally, to reduce the
variance attributable to tokens occurring in specific
positions, we generated 10 “folds” of each list by
circularly shifting the tokens in the first list 10
times. In this way, each noun in each list was tested
in all possible ordinal positions. This procedure
resulted in a total of 23 ×10 = 230 noun lists.
3.3 Language Models
LSTM
We used an adaptive weight-dropped
(AWD) LSTM released by Merity et al. (2018)
3
,
which had 3 hidden layers with 400-dimensional
input embeddings, 1840-dimensional hidden states,
and a vocabulary size of 267,735. The model con-
tained 182.3 million trainable parameters. It was
trained on the Wikitext-103 corpus (Merity et al.,
1http://memory.psych.upenn.edu/files/
wordpools/nouns.txt
2http://memory.psych.upenn.edu/files/
wordpools/catwpool.txt
3
Our code is available at:
https://github.com/
KristijanArmeni/verbatim-memory-in-NLMs
.
Our experiment data are available at:
https:
//doi.org/10.17605/OSF.IO/5GY7X
2016) and achieved a test-set perplexity of 41.8.
Full training hyperparameters are reported in Sec-
tion A.4 of the Appendix.
Transformer
We trained a transformer LM on
the Wikitext-103 benchmark. We retrained the BPE
tokenizer on the concatenated Wikitext-103 train-
ing, evaluation, and test sets and set. The vocab-
ulary had 28,439 entries. We trained both the 12-
layer GPT-2 architecture (known as “GPT-2 small”,
107.7 million trainable parameters) and, as a point
of comparison, smaller, 1-, 3-, and 6-layer trans-
formers (29.7, 43.9, and 65.2 million trainable pa-
rameters, respectively). The context window was
set to 1024 tokens and embedding dimension was
kept at 768 across the architectures. The perplex-
ities for the 12-, 6-, 3- and 1-layer models on the
Wikitext-103 test set were 40.6, 51.5, 60.1, and
95.1, respectively. The full transformer training de-
tails are reported in Section A.5 of the Appendix.
We also evaluated the transformer LM pretrained
by Radford et al. (2019), accessed through the Hug-
ging Face Transformers library (Wolf et al.,2020).
We refer to this model simply as
GPT-2
. It was
trained on the WebText corpus, which consists of
approximately 8 million online documents. We
used the GPT-2-small checkpoint which has 12
attention layers and 768-dimensional embedding
layer. The model contains 124 million parameters
and has a vocabulary of 50,257 entries. We used
the maximum context size of 1024 tokens.
3.4 Surprisal
For each token
wt
in our sequence, we com-
puted the negative log likelihood (surprisal):
surprisal(wt) = log2P(wt|w1, . . . , wt1)
.
In cases when the transformer byte-pair encoding
tokenizer split a noun into multiple tokens—e.g.
“sparrow” might be split into “sp” and “arrow”—
we summed the surprisals of the resulting tokens.
Quantifying retrieval: repeat surprisal
To
quantify how the memory trace of the first list
affected the model’s expectations on the second
list, we measured the ratio between the surprisal
on the second list and the surprisal on the first
list:
repeat surprisal =¯s(L2)
¯s(L1)×100
, where
¯s(L1)
refers to mean surprisal across non-initial
nouns in the first list and
¯s(L2)
to mean surprisal
across all non-initial nouns in the second list. We
take a reduction in surprisal on second lists to indi-
cate the extent to which an LM has retrieved tokens
from the first list.
4 Transformer Results
We first describe the results of our experiments
with the two largest transformer models, the off-
the-shelf GPT-2 and the 12-layer transformer we
trained; LSTM results are discussed in Section 5,
and results with smaller transformers are discussed
towards the end of this section.
The transformers retrieved prior nouns and
their order; this capacity improved when the
model was trained on a larger corpus.
We
tested whether the transformers could retrieve the
identity and order of 10-token noun lists (arbitrary
or semantically coherent). To this end, we con-
structed vignettes in which the second list was ei-
ther (a) identical to the first list, (b) a permutation
of the first list, or (c) a list of novel nouns not
present in the first list.
4
We then measured retrieval
as reduction in surprisal from first to second list.
When the two transformers were presented with
second lists that were repeated version of the first
ones (blue in Fig. 2, B and C), token-by-token
surprisal decreased compared to novel tokens, sug-
gesting that the transformers were able to access
verbatim representations of past nouns from con-
text. When the second list was a permutation of the
first one, surprisal was higher compared to when
it was repeated, indicating that the transformers
expected the nouns to be ordered as in the first list.
Training set size played an important role in sup-
porting verbatim recall: surprisal differences were
considerably smaller for the transformer trained on
the Wikitext-103 corpus (Fig. 2, B) compared to
GPT-2 (Fig. 2, C).
In order to contextualize the magnitude of these
retrieval effects, we computed the relative surprisal
across all tokens in lists except the first one (Fig. 3).
When the first and second lists were identical (e.g.
with
N= 10
arbitrary nouns), the Wikitext-103
transformer’s median relative surprisal was at
88%
of the first list, compared to
92%
for the permuted
lists, and
99%
for the novel lists. In GPT-2, repeat
surprisal was only
2%
of the first list, much lower
than the
58%
for the permuted lists, and
96%
of
the novel list.
4
Novel nouns in the string were introduced by randomly
selecting a list of nouns from one the 22 remaining lists in the
noun pool. In semantically coherent lists, novel nouns were
drawn from a different semantic category than the nouns in
the first list.
Retrieval in GPT-2 was robust to the exact phras-
ing of the text that introduced the lists. Replacing
the subject ‘Mary’ with ‘John’ in the vignette, re-
placing the colon with a comma or randomly per-
muting the preface or the prompt strings did not
affect the results (Fig. 7, right, Appendix A). By
contrast, the same perturbations reduced retrieval
effects for Wikitext-103 (Fig. 7, left, Appendix A),
supporting the conclusion that larger training cor-
pus size contributes to robustness of transformer
retrieval.
Transformer retrieval was robust to the num-
ber of items being retrieved.
In studies of hu-
man short-term memory, performance degrades as
the number of items that need to be retained in-
creases (“set-size effects”, Oberauer et al. 2018).
Is our LMs’ short-term memory similarly taxed
by increasing the set size? We varied the number
of tokens to be held in memory with
Ntokens
{3,5,7,10}
. For this comparison, the length of
the intervening text was kept at 26 tokens. Re-
sults reported in Fig. 3show that for GPT-2, verba-
tim recall was, for the most part, consistent across
the different set sizes. Repeat surprisal increased
monotonically with set size only when the order
of nouns in second list, either semantically coher-
ent or arbitrary, was permuted.
5
For the smaller
Wiktiext-103 transformer, repeat surprisal showed
a slight increase with set size further indicating that
retrieval robustness increases with training corpus
size.
Transformer retrieval was robust to the length
and content of intervening text, but scrambling
the intervening text reduced retrieval of order
information.
For how long are individual items
retained in the memory of the LM? We tested
this by varying the length of the intervening text
for
Ntokens ∈ {26,99,194,435}
(see Fig. 1,
panel B). To generate longer intervening text sam-
ples, we continued the narrative established by
the initial preface string (“Before the meeting,
Mary wrote down the following list of words:”).
All intervening text strings ended with the same
prompt string (“When she got back, she read the
list again:”) which introduced the second list.
5
This increase in surprisal with set size for permuted se-
quences is to be expected, of course, because, if the model
has perfect memory of the list of tokens, but cannot predict
the order in which they will reoccur, then its probability of
guessing the next item in a permuted list where
k
items have
yet to be observed will be
1/k
, and the mean value of
k
is
larger for larger set sizes.
摘要:

CharacterizingVerbatimShort-TermMemoryinNeuralLanguageModelsKristijanArmeniJohnsHopkinsUniversitykarmeni1@jhu.eduChristopherHoneyJohnsHopkinsUniversitychris.honey@jhu.eduTalLinzenNewYorkUniversitylinzen@nyu.eduAbstractWhenalanguagemodelistrainedtopre-dictnaturallanguagesequences,itspredictionateachm...

展开>> 收起<<
Characterizing Verbatim Short-Term Memory in Neural Language Models Kristijan Armeni.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.4MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注