Finding Memo Extractive Memorization in Constrained Sequence Generation Tasks Vikas Raunak Arul Menezes

2025-05-08 0 0 287.38KB 10 页 10玖币
侵权投诉
Finding Memo:
Extractive Memorization in Constrained Sequence Generation Tasks
Vikas Raunak Arul Menezes
Microsoft Azure AI
Redmond, Washington
{viraunak,arulm}@microsoft.com
Abstract
Memorization presents a challenge for sev-
eral constrained Natural Language Generation
(NLG) tasks such as Neural Machine Trans-
lation (NMT), wherein the proclivity of neu-
ral models to memorize noisy and atypical
samples reacts adversely with the noisy (web
crawled) datasets. However, previous stud-
ies of memorization in constrained NLG tasks
have only focused on counterfactual memo-
rization, linking it to the problem of halluci-
nations. In this work, we propose a new, inex-
pensive algorithm for extractive memorization
(exact training data generation under insuffi-
cient context) in constrained sequence gener-
ation tasks and use it to study extractive mem-
orization and its effects in NMT. We demon-
strate that extractive memorization poses a
serious threat to NMT reliability by quali-
tatively and quantitatively characterizing the
memorized samples as well as the model be-
havior in their vicinity. Based on empirical
observations, we develop a simple algorithm
which elicits non-memorized translations of
memorized samples from the same model, for
a large fraction of such samples. Finally,
we show that the proposed algorithm could
also be leveraged to mitigate memorization in
the model through finetuning. We have re-
leased the code to reproduce our results at
https://github.com/vyraun/Finding-Memo.
1 Introduction
Previous studies (Arpit et al.,2017;Feldman,2020;
Zhang et al.,2021a) have shown that neural net-
works capture regular patterns in the training data
(generalization) while simultaneously fitting noisy
and atypical samples using brute-force (memoriza-
tion). For constrained Natural Language Gener-
ation tasks such as Neural Machine Translation
(NMT), which rely heavily on noisy (web crawled)
data for training high-capacity neural networks,
this creates an inherent reliability problem. For ex-
ample, memorizations could manifest themselves
in the form of catastrophic translation errors on
specific samples despite high average model per-
formance (Raunak et al.,2021). It is also likely
that the memorization of a specific sample could
corrupt the translations of samples in its vicinity.
Therefore, exploring, quantifying and alleviating
the impact of memorization is of critical impor-
tance for improving the reliability of such systems.
Yet, most of the work on memorization in natural
language processing (NLP) has focused either on
classification (Zheng and Jiang,2022) or on uncon-
strained generation tasks, predominantly language
modeling (Carlini et al.,2021;Zhang et al.,2021b;
Kharitonov et al.,2021;Chowdhery et al.,2022;
Tirumala et al.,2022;Tänzer et al.,2022;Haviv
et al.,2022). In this work, we fill a gap in the
literature by developing an analogue of extractive
memorization for constrained sequence generation
tasks in general and NMT in particular. Our main
contributions are:
1.
We propose a new, inexpensive algorithm
for studying extractive memorization in con-
strained sequence generation tasks and use it
to characterize memorization in NMT.
2.
We demonstrate that extractive memorization
poses a serious threat to NMT reliability by
quantitatively and qualitatively analyzing the
memorized samples and the neighborhood ef-
fects of such memorization. We also demon-
strate that the memorized instances could be
used to generate errors in disparate systems.
3.
Based on an analysis of the neighborhood ef-
fects of memorization, we develop a simple
memorization mitigation algorithm which pro-
duces non-memorized (higher quality) outputs
for a large fraction of memorized samples.
4.
We show that the outputs produced by the
memorization mitigation algorithm could also
be used to directly impart corrective behavior
into the model through finetuning.
arXiv:2210.12929v1 [cs.CL] 24 Oct 2022
Repetitions Total Samples Memorized Ratio (%) Perturb Prefix Perturb Suffix Perturb Start
1 100,000 174 0.17 17.58 % 43.24 % 12.29 %
2 100,000 317 0.32 11.67 % 62.84 % 4.98 %
3 5,381 17 0.32 28.42 % 49.52 % 18.82 %
4 1,885 5 0.26 27.40 % 34.00 % 8.00 %
5 976 7 0.72 26.67 % 70.00 % 11.42 %
1-5 208,242 520 0.25 16.65 % 51.65 % 8.00 %
Table 1: Quantifying Extractive Memorization: Number of Memorized Samples (using Algorithm 1) and Neigh-
borhood Effects of Memorization (using Algorithm 2) across different training data frequency buckets.
2 Related Work
Our work is concerned with the phenomenon of
memorization in constrained natural language gen-
eration in general and NMT in particular. The main
challenge in analyzing memorization is to deter-
mine which samples have been memorized by the
model during training. There exist two key algo-
rithms to elicit memorized samples, each yielding a
distinctive operational definition of memorization:
1. Counterfactual Memorization
:Feldman
and Zhang (2020) study label memorization
and propose to estimate the memorization
value of a training sample by training mul-
tiple models on different random subsets of
the training data and then measuring the devi-
ation in the sample’s classification accuracy
under inclusion/exclusion. This definition of
memorization was further extended to arbi-
trary performance measures by Raunak et al.
(2021) to study memorization in NMT and by
Zhang et al. (2021b) to study memorization
in language models. However, a practical lim-
itation of analysis based on this definition is
the prohibitive computational cost (multiple
model trainings) associated with computing
memorization values for each training sample.
2. Extractive Memorization
:Carlini et al.
(2021) propose a data-extraction based defini-
tion of memorization to study memorization
in language models. Therein, a training string
s
is extractable if there exists a prefix
c
that
could exactly generate
s
under an appropri-
ate sampling strategy (e.g. greedy decoding).
This definition has the benefit of being com-
putationally inexpensive, although it doesn’t
have any existing analogue for constrained nat-
ural language generation tasks such as NMT.
In the next section, we define extractive memo-
rization for constrained sequence generation tasks
and apply it to NMT, in section 4we estimate the
neighborhood effect of such memorizations and in
section 5we propose a simple algorithm for recov-
ering correct translations of memorized samples.
3 Extractive Memorization
We present our definition of extractive memoriza-
tion as Algorithm 1. Analogous to extractive mem-
orization in language models (Carlini et al.,2021),
this definition labels an input sentence (source) as
being memorized if its transduction (translation)
could be replicated exactly with a prefix consid-
erably shorter than the length of the full input
sentence (source), under greedy decoding. Opera-
tionally, we set prefix ratio threshold (p) to 0.75.
Algorithm 1: Extractive Memorization in NMT
Data: Trained NMT Model T, Training Dataset S,
Prefix Ratio Threshold p
Result: Memorized Samples M, Prefix Lengths L
Greedily Translate Sources in Susing T;
M1
= Sources with translations matching References;
Greedily Translate Prefixes of Sources in
M1
using
T
;
M2= Sources with Prefixes producing References;
for Mi
2in M2do
n= Length of the Source Mi
2;
l= Length of Smallest Prefix producing the Ref;
if l
npthen
Add Mi
2to Mand Add lto L;
Next, we apply this definition of memorization
on a strong Transformer-Big (Vaswani et al.,2017)
baseline trained on the 48.2M WMT20 En-De par-
allel corpus (Barrault et al.,2020). We describe the
dataset, model and training details in Appendix A.
Qualitatively, we observe that the memorized
samples detected by Algorithm 1mostly consist of
low-quality samples – templatized source sentences
and noisy translations. To analyze the results quan-
titatively, similar to Carlini et al. (2022), we bucket
the training data pairs in terms of their repetitions
in the training data. Owing to the sparsity of data
with greater than 5 repetitions we report results in
the range of 1-5 repetitions. Further, for repetition
values 1 and 2, we select 100K random samples for
摘要:

FindingMemo:ExtractiveMemorizationinConstrainedSequenceGenerationTasksVikasRaunakArulMenezesMicrosoftAzureAIRedmond,Washington{viraunak,arulm}@microsoft.comAbstractMemorizationpresentsachallengeforsev-eralconstrainedNaturalLanguageGeneration(NLG)taskssuchasNeuralMachineTrans-lation(NMT),whereinthepr...

展开>> 收起<<
Finding Memo Extractive Memorization in Constrained Sequence Generation Tasks Vikas Raunak Arul Menezes.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:287.38KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注