Understanding Transformer Memorization Recall Through Idioms Adi HavivIdo CohenJacob GidronRoei Schuster Yoav GoldbergMor Geva

2025-05-06 0 0 900.13KB 17 页 10玖币
侵权投诉
Understanding Transformer Memorization Recall Through Idioms
Adi HavivτIdo CohenτJacob GidronτRoei Schusterµ
Yoav Goldbergαβ Mor Gevaα
τTel Aviv University µWild Moose βBar-Ilan University αAllen Institute for AI
adi.haviv@cs.tau.ac.il,roei@wildmoose.ai, pipek@google.com,
{its.ido, jacob.u.gidron, yoav.goldberg}@gmail.com
Abstract
To produce accurate predictions, language
models (LMs) must balance between gener-
alization and memorization. Yet, little is
known about the mechanism by which trans-
former LMs employ their memorization capac-
ity. When does a model decide to output a
memorized phrase, and how is this phrase then
retrieved from memory? In this work, we offer
the first methodological framework for prob-
ing and characterizing recall of memorized se-
quences in transformer LMs. First, we lay
out criteria for detecting model inputs that trig-
ger memory recall, and propose idioms as in-
puts that typically fulfill these criteria. Next,
we construct a dataset of English idioms and
use it to compare model behavior on memo-
rized vs. non-memorized inputs. Specifically,
we analyze the internal prediction construc-
tion process by interpreting the model’s hidden
representations as a gradual refinement of the
output probability distribution. We find that
across different model sizes and architectures,
memorized predictions are a two-step process:
early layers promote the predicted token to the
top of the output distribution, and upper lay-
ers increase model confidence. This suggests
that memorized information is stored and re-
trieved in the early layers of the network. Last,
we demonstrate the utility of our methodol-
ogy beyond idioms in memorized factual state-
ments. Overall, our work makes a first step to-
wards understanding memory recall, and pro-
vides a methodological basis for future studies
of transformer memorization.1
1 Introduction
Transformer language models (LMs) memorize in-
stances from their training data (Carlini et al.,2021;
Zhang et al.,2021b), and evidence is building that
such memorization is an important precondition for
Now at Google Research.
1
Our code and data are available at
https://github.
com/adihaviv/idiomem/.
their predictive abilities (Lee et al.,2022;Feldman,
2020;Feldman and Zhang,2020;Raunak et al.,
2021;Raunak and Menezes,2022). Still, it is un-
known when models decide to output memorized
sequences, and how these sequences are being re-
trieved internally from memory. Current methods
for analyzing memorization (Feldman and Zhang,
2020;Zhang et al.,2021b;Carlini et al.,2022)
use definitions that are based on models perfor-
mance, which changes between models and often
also between training runs. Moreover, these meth-
ods study memorization behavior in terms of the
model’s “black-box” behavior rather than deriving
a behavioral profile of memory recall itself.
Our first contributions are to provide a definition
and construct a dataset that allows probing memo-
rization recall in LMs. We define a set of criteria
for identifying memorized sequences that do not
depend on model behavior:
2
sequences that have
a single plausible completion that is independent
of context and can be inferred only given the entire
sequence. We show that many idioms (e.g., “play
it by ear”) fulfill these conditions, allowing us to
probe and analyze memorization behavior. Fur-
thermore, we construct a dataset of such English
idioms, dubbed IDIOMEM, and release it publicly
for the research community.
Next, to analyze memory recall behavior, we
compare the construction process of predictions
that involve memory recall with those that do not.
To this end, given a LM, we create two sets of mem-
orized and non-memorized idioms from IDIOMEM
(Fig. 1, A). We then adopt a view of the trans-
former inference pass as a gradual refinement of the
output probability distribution (Geva et al.,2021;
Elhage et al.,2021). Concretely, the token repre-
2
Literature often purports to “define memorization”, result-
ing in a multitude of technical definitions with subtle differ-
ences, although we would expect this concept to be consistent
and intuitive. Thus, instead of explicitly defining “memoriza-
tion”, we will define sufficient criteria for detecting memo-
rized instances.
arXiv:2210.03588v3 [cs.CL] 13 Feb 2023
Candidate
Promotion
Confidence
boosting
IdioMem
play it by
(A) Splitting IdioMem to two sets of memorized
and non-memorized idioms
think outside the
yourself
milk
box
crying over spilt
(B) Tracking the predictions rank
and probability at each layer
(C) Visualization: memorized idioms
display two-steps prediction process
spilt
over
crying
milk
___
0.7 milk
0.03 blood
0.004 wine
0.003 coffee
0.1 unden
0.04 streng
>0.0001 milk
... ... ... ...
Figure 1: Our methodological framework for probing and analyzing memorized predictions of a given LM: (A) we
create two sets of memorized (mem-idiom) and non-memorized (non-mem-idiom) idioms by probing the LM with
instances from IDIOMEM, (B) for each instance, we extract hidden features of the prediction computation – the
rank and probability of the predicted token across layers, and (C) we compare the prediction process of memorized
idioms versus non-memorized idioms and short sequences from Wikipedia (wiki). Memorized predictions exhibit
two characteristic phases: candidate promotion and confidence-boosting.
sentation at any layer is interpreted as a “hidden”
probability distribution over the output vocabulary
(Geva et al.,2022) (Fig. 1, B). This interpretation
allows tracking the prediction across layers in the
evolving distribution. We find a clear difference
in model behavior between memorized and non-
memorized predictions (Fig. 1, C). This difference
persists across different transformer architectures
and sizes: retrieval from memory happens in two
distinct phases, corresponding to distinct roles of
the transformer parameters and layers: (1) can-
didate promotion of memorized predictions’ rank
in the hidden distribution in the first layers, and
(2) confidence boosting where, in the last few lay-
ers, the prediction’s probability grows substantially
faster than before. This is unlike non-memorized
predictions, where the two phases are less pro-
nounced and often indistinct. We further confirm
these phases of memorized predictions through in-
tervention in the network’s FFN sublayers, which
have been shown to play an important role in the
prediction construction process (Geva et al.,2022;
Mickus et al.,2022). Concretely, zeroing out hid-
den FFN neurons in early layers deteriorate mem-
ory recall, while intervention in upper layers does
not affect it.
Last, we show our findings extend to types of
memory recall beyond idioms by applying our
method to factual statements from the LAMA-
UHN dataset (Poerner et al.,2020) (e.g. “The
native language of Jean Marais is French”). For
factual statements that were completed correctly
by the LM, we observe the same two phases as in
memorized idioms, further indicating their connec-
tion to memory recall.
To summarize, we construct a novel dataset of
idioms, usable for probing LM memorization irre-
spective of the model architecture or training pa-
rameterization. We then design a probing method-
ology that extracts carefully-devised features of the
internal inference procedure in transformers. By ap-
plying our methodology and using our new dataset,
we discover a profile that characterizes memory
recall, across transformer LMs and types of mem-
orized instances. Our released dataset, probing
framework, and findings open the door for future
work on transformer memorization, to ultimately
demystify the internals of neural memory in LMs.
2 Criteria for Detecting Memory Recall
To study memory recall, we require a set of inputs
that trigger this process. Prior work on memoriza-
tion focused on detecting instances whose inclu-
sion in the training data has a specific influence
on model behavior, such as increased accuracy on
those instances (Feldman and Zhang,2020;Ma-
gar and Schwartz,2022;Carlini et al.,2022,2021,
2019). As a result, memorized instances differ
across models and training parameterization. Our
goal is instead to find a stable dataset of sequences
that correctly predicting their completion indicates
memorization recall. This will greatly reduce the
overhead of studying memorization and facilitate
useful comparisons across models and studies.
To build such a dataset, we start by defining
a general set of criteria that are predicates on se-
quence features, entirely independent of the LM
being probed. Given a textual sequence of
n
words,
we call the first
n1
words the prompt and the
n
th word the target. We focus on the task of pre-
dicting the target given the prompt, i.e., predicting
the last word in a sequence given its prefix.
3
Such
predictions can be based on either generalization
or memorization, and we are interested in isolat-
ing memorized cases to study model behavior on
them. Particularly, we are looking for sequences
for which success in this task implies memorization
recall.
We argue that the following criteria are sufficient
for detecting such memorized sequences:
1. Single target, independent of context:
We re-
quire that the target is the only correct continua-
tion, regardless of the textual context where the
prompt is placed.4
2. Irreducible prompt:
The target is the single
correct completion only if the entire prompt is
given exactly. Changing or removing parts from
the prompt would make the correct target non-
unique.
Claim 2.1.
Assume a sequence fulfills the above
criteria. Then, if a LM correctly predicts the tar-
get, it is highly likely that this prediction involves
memory recall.
Justification.
First, observe that most natural-
language prompts have many possible continua-
tions. For example consider the sentence “to get
there fast, you can take this ____”. Likely con-
tinuations include “route”,“highway”,“road”,
“train”,“plane”,“advice”, inter alia. Note that
there are several divergent interpretations or con-
texts for the prompt, and for each, language offers
many different ways to express similar meaning.
A prediction that is a product of generalization
— i.e., it is derived from context and knowledge
of language — always has plausible alternatives,
depending on the context and stylistic choice of
words. Hence, the relationship between the entire
prompt and the target, where the target is the single
correct continuation, is something that needs to be
memorized rather than derived via generalization.
A LM that predicts the single correct continuation
either memorized this relationship, or used “cues”
3
In cases where tokenization divides the target to sub-
tokens, our task becomes predicting the target’s first token.
4
We assume that contexts are naturally-occurring and not
adversarial.
from the prompt that happen to provide indica-
tion towards the correct continuation. To illustrate
the latter, consider the sequence “it’s raining cats
and ____” which has a single correct continuation,
“dogs”, but a LM might predict it without observing
this sequence during training, due to the seman-
tic proximity of “cats” and “dogs”. Our second
criterion excludes such cases by requiring that the
correct continuation is only likely given the entire
sequence.
Therefore, a LM that correctly completes a se-
quence that fulfills both criteria, is likely to have
recalled it from memory.
In the next section, we argue that idioms are a
special case of such sequences, and are thus useful
for studying memorization (§3).
3 The Utility of Idioms for Studying
Memorization
An idiom is a group of words with a meaning that
is not deducible from the meanings of its individual
words. For example, consider the phrase “play it
by ear” — there is a disconnect between its non-
sensical literal meaning (to play something by a
human-body organ called ‘ear’) and its intended
idiomatic meaning (to improvise).
A key observation is that idioms often satisfy our
criteria (§2), and therefore can probe memoriza-
tion. First, by definition, idioms are expected to
be non-compositional (Dankers et al.,2022). They
are special “hard-coded” phrases that carry a spe-
cific meaning. As a result, their prompts each have
a single correct continuation, regardless of their
context (criterion 1). For example, consider the
prompt “crying over spilt ____” — a generaliz-
ing prediction would predict that this slot may be
filled by any spillable item, like wine, water or
juice, while a memorized prediction will retrieve
only milk in this context. Notably, while this is an
empirical characterization of many idioms, there
might be exceptions, e.g., contexts that are adver-
sarially chosen to change the completion. Second,
many idioms are “irreducible”, for example, the
sub-sequences “crying over” or “over spilt” by
themselves have but a scant connection to the word
“milk”.
Still, not all idioms fulfill the criteria. For exam-
ple, even when the idiom is far from literal, its con-
stituents sometimes strongly indicate the correct
continuation, such as with the case of “it’s raining
cats and ____” (as explained in §2). To construct
Source # of Idioms Idiom Length (words)
MAGPIE 590 4.5±0.9
LIDIOMS 149 5.1±1.2
EF 97 5.6±1.9
EPIE 76 4.4±0.7
Total (unique) 814 4.7±1.8
Table 1: Statistics per data source in IDIOMEM.
a dataset of memorization-probing sequences, we
will carefully curate a set of English idioms and
filter out ones that do not fulfill our criteria.
3.1 The IDIOMEM Dataset
We begin with existing datasets of English idioms:
MAGPIE (Haagsma et al.,2020),
5
EPIE (Saxena
and Paul,2020), and the English subset of LID-
IOMS (Moussallem et al.,2018). We enrich this
collection with idioms scraped from the website
“Education First” (EF).
6
We then split each idiom
into a prompt containing all but the last word and
a target that is the last word. Next, we filter out
idioms that do not comply with our criteria (§2) or
whose target can be predicted from their prompt
based on spurious correlations rather than memo-
rization. To this end, we use three simple rules:
Short idioms
. We observe that prompts of id-
ioms with just a few words often have multiple
plausible continuations, that are not necessar-
ily the idiom’s target, violating our first crite-
rion. For example, the prompt “break a ____”
has many possible continuations (e.g. “win-
dow”,“promise”, and “heart”) in addition to its
idiomatic continuation “leg”. To exclude such
cases, we filter out idioms with <4words.
Idioms whose target is commonly predicted
from the prompt’s subsequence.
We filter such
cases to ensure the prompt fulfills our second
criterion (prompt irreducibility).
To detect these cases, we use an ensemble of
pretrained LMs: GPT2
M
, ROBERTA-BASE
(Liu et al.,2019), T5-BASE (Kale and Ras-
togi,2020) and ELECTRA-BASE-GENERATOR
(Clark et al.,2020), and check for each model if
there is an n-gram (
1n4
) in the prompt
from which the model predicts the target. We
filter out idioms for which a majority (
3
) of
models predicted the target (for some n-gram).
5
We take idioms with annotation confidence of
>75%
and exclude frequently occurring literal interpretations.
6https://www.ef.com/wwen/english-resources/
english-idioms/
Prompt Target Pred. Sim. IDIOMEM
“make a mountain
out of a”
molehill X
“think outside the”
box X
“there’s no such
thing as a free”
lunch X
“go back to the
drawing”
board X
“boys will be” boys X
“take it or leave” it X X
Table 2: Example English idioms included and ex-
cluded from IDIOMEM by our filters of predictable tar-
get (Pred.) and prompt-target similarity (Sim.).
Idioms whose targets are semantically simi-
lar to tokens in the prompt.
To further ensure
prompt irreducibility, we embed the prompt’s
tokens and the target token using GloVe word
embeddings (Pennington et al.,2014). We mea-
sure the cosine distance between the target token
to each token in the prompt separately and take
the maximum of all the tokens. We filter out
idioms where this number is higher than 0.75
(this number was tuned manually using a small
validation set of idioms).
Overall, 55.7% of the idioms were filtered out,
including 48.5% by length, 6.1% by the predictable-
target test, and an additional 1.6% by the prompt-
target similarity, resulting in an 814 idioms dataset,
named IDIOMEM. Further statistics are provided
in Tab. 1, and example idioms in Tab. 2.
4 Probing Methodology
Background and Notation
Assuming a trans-
former LM with
L
layers, a hidden dimension
d
and an input/output-embedding matrix
ER|Vd
over a vocabulary
V
. Denote by
s=hs1, ..., sti
the input sequence to the LM, and let
h`
i
be the
output for token
i
at layer
`
, for all
`1, ..., L
and
i1, ..., t
. The model’s prediction for a token
si
is
obtained by projecting its last hidden representation
hL
i
to the embedding matrix, i.e.
softmax(EhL
i)
.
Following (Geva et al.,2021,2022), we in-
terpret the prediction for a token
si
by viewing
its corresponding sequence of hidden representa-
tions
h1
i, ..., hL
i
as an evolving distribution over
the vocabulary. Concretely, we read the “hidden”
distribution at layer
`
by applying the same pro-
jection to the hidden representation at that layer:
p`
i=softmax(Eh`
i)
. Using this interpretation,
we track the probability and rank of the predicted
token in the output distribution across layers. A
token’s rank is its position in the output distribution
摘要:

UnderstandingTransformerMemorizationRecallThroughIdiomsAdiHavivIdoCohenJacobGidronRoeiSchusterYoavGoldberg MorGeva TelAvivUniversityWildMoose Bar-IlanUniversity AllenInstituteforAIadi.haviv@cs.tau.ac.il,roei@wildmoose.ai,pipek@google.com,{its.ido,jacob.u.gidron,yoav.goldberg}@gmail.comAbstra...

展开>> 收起<<
Understanding Transformer Memorization Recall Through Idioms Adi HavivIdo CohenJacob GidronRoei Schuster Yoav GoldbergMor Geva.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:900.13KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注