
from the first list.
4 Transformer Results
We first describe the results of our experiments
with the two largest transformer models, the off-
the-shelf GPT-2 and the 12-layer transformer we
trained; LSTM results are discussed in Section 5,
and results with smaller transformers are discussed
towards the end of this section.
The transformers retrieved prior nouns and
their order; this capacity improved when the
model was trained on a larger corpus.
We
tested whether the transformers could retrieve the
identity and order of 10-token noun lists (arbitrary
or semantically coherent). To this end, we con-
structed vignettes in which the second list was ei-
ther (a) identical to the first list, (b) a permutation
of the first list, or (c) a list of novel nouns not
present in the first list.
4
We then measured retrieval
as reduction in surprisal from first to second list.
When the two transformers were presented with
second lists that were repeated version of the first
ones (blue in Fig. 2, B and C), token-by-token
surprisal decreased compared to novel tokens, sug-
gesting that the transformers were able to access
verbatim representations of past nouns from con-
text. When the second list was a permutation of the
first one, surprisal was higher compared to when
it was repeated, indicating that the transformers
expected the nouns to be ordered as in the first list.
Training set size played an important role in sup-
porting verbatim recall: surprisal differences were
considerably smaller for the transformer trained on
the Wikitext-103 corpus (Fig. 2, B) compared to
GPT-2 (Fig. 2, C).
In order to contextualize the magnitude of these
retrieval effects, we computed the relative surprisal
across all tokens in lists except the first one (Fig. 3).
When the first and second lists were identical (e.g.
with
N= 10
arbitrary nouns), the Wikitext-103
transformer’s median relative surprisal was at
88%
of the first list, compared to
92%
for the permuted
lists, and
99%
for the novel lists. In GPT-2, repeat
surprisal was only
2%
of the first list, much lower
than the
58%
for the permuted lists, and
96%
of
the novel list.
4
Novel nouns in the string were introduced by randomly
selecting a list of nouns from one the 22 remaining lists in the
noun pool. In semantically coherent lists, novel nouns were
drawn from a different semantic category than the nouns in
the first list.
Retrieval in GPT-2 was robust to the exact phras-
ing of the text that introduced the lists. Replacing
the subject ‘Mary’ with ‘John’ in the vignette, re-
placing the colon with a comma or randomly per-
muting the preface or the prompt strings did not
affect the results (Fig. 7, right, Appendix A). By
contrast, the same perturbations reduced retrieval
effects for Wikitext-103 (Fig. 7, left, Appendix A),
supporting the conclusion that larger training cor-
pus size contributes to robustness of transformer
retrieval.
Transformer retrieval was robust to the num-
ber of items being retrieved.
In studies of hu-
man short-term memory, performance degrades as
the number of items that need to be retained in-
creases (“set-size effects”, Oberauer et al. 2018).
Is our LMs’ short-term memory similarly taxed
by increasing the set size? We varied the number
of tokens to be held in memory with
Ntokens ∈
{3,5,7,10}
. For this comparison, the length of
the intervening text was kept at 26 tokens. Re-
sults reported in Fig. 3show that for GPT-2, verba-
tim recall was, for the most part, consistent across
the different set sizes. Repeat surprisal increased
monotonically with set size only when the order
of nouns in second list, either semantically coher-
ent or arbitrary, was permuted.
5
For the smaller
Wiktiext-103 transformer, repeat surprisal showed
a slight increase with set size further indicating that
retrieval robustness increases with training corpus
size.
Transformer retrieval was robust to the length
and content of intervening text, but scrambling
the intervening text reduced retrieval of order
information.
For how long are individual items
retained in the memory of the LM? We tested
this by varying the length of the intervening text
for
Ntokens ∈ {26,99,194,435}
(see Fig. 1,
panel B). To generate longer intervening text sam-
ples, we continued the narrative established by
the initial preface string (“Before the meeting,
Mary wrote down the following list of words:”).
All intervening text strings ended with the same
prompt string (“When she got back, she read the
list again:”) which introduced the second list.
5
This increase in surprisal with set size for permuted se-
quences is to be expected, of course, because, if the model
has perfect memory of the list of tokens, but cannot predict
the order in which they will reoccur, then its probability of
guessing the next item in a permuted list where
k
items have
yet to be observed will be
1/k
, and the mean value of
k
is
larger for larger set sizes.