Analyzing the Use of Influence Functions for Instance-Specific Data Filtering in Neural Machine Translation Tsz Kin Lam

2025-04-30 0 0 551.58KB 15 页 10玖币
侵权投诉
Analyzing the Use of Influence Functions for Instance-Specific Data
Filtering in Neural Machine Translation
Tsz Kin Lam
ICL, Heidelberg University
lam@cl.uni-heidelberg.de
Eva Hasler
Amazon AI Translate
ehasler@amazon.com
Felix Hieber
Amazon AI Translate
fhieber@amazon.com
Abstract
Customer feedback can be an important signal
for improving commercial machine translation
systems. One solution for fixing specific trans-
lation errors is to remove the related erroneous
training instances followed by re-training of
the machine translation system, which we refer
to as instance-specific data filtering. Influence
functions (IF) have been shown to be effec-
tive in finding such relevant training examples
for classification tasks such as image classifi-
cation, toxic speech detection and entailment
task. Given a probing instance, IF find influen-
tial training examples by measuring the simi-
larity of the probing instance with a set of train-
ing examples in gradient space. In this work,
we examine the use of influence functions for
Neural Machine Translation (NMT). We pro-
pose two effective extensions to a state of the
art influence function and demonstrate on the
sub-problem of copied training examples that
IF can be applied more generally than hand-
crafted regular expressions.
1 Introduction
Neural Machine Translation (NMT) is the de facto
standard for recent high-quality machine transla-
tion systems. NMT, however, requires abundant
amount of bi-text for supervised training. One com-
mon approach to increase the amount of bi-text
is via data augmentation (Sennrich et al.,2015;
Edunov et al.,2018;He et al.,2019,inter alia).
Another approach is the use of web-crawled data
(Bañón et al.,2020) but since crawled data is
known to be notoriously noisy (Khayrallah and
Koehn,2018;Caswell et al.,2020), a plethora of
data filtering techniques (Junczys-Dowmunt,2018;
Wang et al.,2018;Ramírez-Sánchez et al.,2020,in-
ter alia) have been proposed for retaining a cleaner
portion of the bi-text for training.
While standard data filtering techniques aim to
improve the quality of the overall training data
Work done during an internship at Amazon.
without targeting the translation quality of specific
instances, instance-specific data filtering focuses
on the improvement of translation quality toward
a specific set of input sentences via removal of
the related training data. In commercial MT, this
selected set of sentences can be the problematic
translations reported by customers. One simple
approach of instance-specific data filtering in NMT
is manual filtering. In manual filtering, human
annotators identify translation errors on sentences
reported by customer and designs filtering scheme,
e.g., regular expressions to search related training
examples for removal from the training set.
In this work, we attempt to apply a more au-
tomatable technique called influence functions (IF)
which is shown to be effective on image classifi-
cation (Koh and Liang,2017), and certain NLP
tasks such as sentiment analysis, entailment and
toxic speech detection (Han et al.,2020;Guo et al.,
2020). Given a probing example, influence func-
tions (IF) search for the influential training exam-
ples by measuring the similarity of the probing
example with a set of training examples in gradi-
ent space. Schioppa et al. (2021) use a low-rank
approximation of the Hessian to speed up the com-
putation of IF and apply the idea of self-influence to
NMT. However, self-influence measures if a train-
ing instance is an outlier rather than its similar-
ity with another instance. Akyürek et al. (2022)
question the back-tracing ability of IF on the fact-
tracing task. They compare IF with heuristics used
in Information Retrieval and attribute the worse
performance of IF to a problem called saturation.
Compared to fact-tracing, the target sides of ma-
chine translation can be more diverse which com-
plicates the application of IF.
We apply an effective type of IF called TracIn
(Pruthi et al.,2020) to NMT for instance-specific
data filtering and analyze its behaviour by con-
structing synthetic training examples containing
simulated translation errors. In particular, we find
arXiv:2210.13281v1 [cs.CL] 24 Oct 2022
that
the gradient similarity, also called the influ-
ence
1
, is highly sensitive to the network com-
ponent.
vanilla IF may not be sufficient to achieve
good retrieval performance. We proposed two
contrastive methods to further improve the
performance.
training examples consisting of copied source
sentences have similar gradients even when
they are lexically different. This indicates
that the use of influence functions can go be-
yond what can be achieved with regular ex-
pressions.
an effective automation of the instance-
specific data filtering remains challenging.
To the best of our knowledge, we are the first to
investigate applying IF for instance-specific data
filtering to NMT.
2 Method
Influence functions
IF is a technique from ro-
bust statistics (Hampel,1974;Cook and Weisberg,
1982,inter alia). It aims to trace a model’s predic-
tions back to the most responsible training exam-
ples without repeated re-training of the model, aka
Leave-One-Out. Koh and Liang (2017) extend this
idea from robust statistics to deep neural network
that requires only the gradient of the loss functions
L
and Hessian-vector products so that the influence
I(z, z0)
of two examples
z
and
z0
is approximated
as
I(z, z0)≈ ∇θL(z0)TH1
ˆ
θθL(z)(1)
where
ˆ
θ
is the model parameters at optimum and
Hˆ
θ=1
nPn
i=1 2
θL(θ)
is the Hessian of the model
parameters at
ˆ
θ
. Given
n
number of training in-
stances and
p
number of model parameters, the in-
verse of Hessian has a complexity of
O(np2+p3)
which is expensive to compute for deep neural net-
work. There are several proposed methods to speed
up the computation of IF, e.g., by computing on
a training subset selected by KNN-search (Guo
et al.,2020), by approximating the Hessian with
LISSA (Agarwal et al.,2017), by computing on a
1
In this work, we use gradient similarity or influence inter-
changeably to denote the result of IF. Be aware that TracIn is
also one type of IF.
subset of model parameters (Koh and Liang,2017),
or by replacing the Hessian with some other pro-
cedures (Pruthi et al.,2020). In this work, we
focus on TracIn which is shown to be better than
some other variations (Han and Tsvetkov,2020;
Schioppa et al.,2021) in terms of retrieval perfor-
mance.
TracIn, denoted by
ITracIn(z, z0)
, replaces the
computationally costly Hessian matrix with an
identity matrix. The remained gradient dot product,
or called the gradient similarity, is instead com-
puted over
C
number of checkpoints, followed by
averaging:
ITracIn(z, z0) = 1
C
C
X
i=1
θL(z0)TθL(z)(2)
In NMT, given the same source sentence, the mag-
nitude of the gradient in general is positively corre-
lated to the length of the target sentence. In order
to reduce the effect of the target length, we normal-
ize equation 2by the product of
k∇θL(z0)k
and
k∇θL(z)k
, or equivalently, we compute the cosine
similarity of θL(z0)and θL(z).
Given a probing instance
z0
and its probing gra-
dient
θL(z0)
, instances in the training set that
yield a positive value of
ITracIn(z, z0)
are called
the positively influential training instances (+IF-
Train) whereas those that yield a negative value of
ITracIn(z, z0)
are called the negatively influential
training instances (-IFTrain). Taking a gradient
step on +IFTrain reduces the loss on the probing
example while taking a gradient step on -IFTrain
increases it. IF can be used for data filtering by
removing the +IFTrain examples of low quality
probing samples since their gradients have similar
direction. Conversely, if the probing sample is of
high quality, removing -IFTrain examples from the
training data would be expected to increase transla-
tion quality w.r.t. the probing sample.
3 Experimental Setting
Model configuration and training
We use
Transformer BASE configuration as described in
Vaswani et al. (2017) with default setting and im-
plementation in FAIRSEQ. We use a sentence-piece
model to create subword units of size 32k. Un-
less otherwise specified, we pre-trained our NMT
on Europarl-v7 data and News Commentary-v12
data in German-English direction from WMT17
for 100 epochs, about 112K updates, using Adam
Shared parameters Non-shared parameters
Samples F ull Emb srcEmb trgEmb output concat
Probing Noch kommt Volkswagen glimpflich durch. 1 1 1 1 1 1
Volkswagen gets off lightly.
1Das £ 1,35 Mrd. teure Projekt soll bis 0.153 0.240 0.006 0.287 0.437 0.339
Mai 2017 fertiggestellt werden
Volkswagen gets off lightly.
2Alle in Frage kommenden Produkte wurden 0.238 0.320 0.013 0.230 0.401 0.319
aus dem Verkauf gezogen.
Volkswagen gets off lightly.
3 Noch kommt Volkswagen glimpflich durch. -0.021 -0.030 -0.149 -0.022 -0.017 -0.040
In 2008, most malware programmes were
still focused on sending out adverts.
4 Noch kommt Volkswagen glimpflich durch. -0.007 -0.016 -0.120 -0.003 0.011 -0.013
We’ve made a complete turnaround.
5 Noch kommt Volkswagen glimpflich durch. 0.950 0.894 0.973 0.927 0.843 0.873
Volkswagen gets off lightly!
6 Noch kommt Volkswagen glimpflich durch!0.899 0.912 0.873 0.915 0.940 0.927
Volkswagen gets off lightly.
Table 1: Example showing the changes of influence by network components. Segments that are marked in red
are perturbed from the probing example. Xindicates the network components used in computing the influence,
concat indicates the concatenation of srcEmb,trgEmb and output.
optimizerion training of 16-bit
2
. The effective
mini-batch size is 4096 x 16 tokens and it takes a
p3.16xlarge
3
machine on AWS 6 hours for training.
We evaluate the MT model on the newstest2017
test set with a checkpoint averaged over the 10-best
checkpoints, measured by the validation loss on
the newstest2014-2016 dev set. On the test set, our
NMT model with non-shared parameters with the
two word embeddings and the output layer scores
29.99 BLEU whereas the one with shared parame-
ters scores 29.78 BLEU. We use beam search with
beam size of 5 in decoding.
TracIn
We select 5 checkpoints, i.e., at epoch 5,
8, 15, 30 and 100 for computing TracIn
4
. We select
checkpoints which have relatively large changes
in the validation loss, i.e., usually in the earlier
phrase of training, and include the last one to cover
information at the end of the training. We com-
2
We use 32-bit precision to compute the gradient similarity
once the training is done.
3
See https://aws.amazon.com/ec2/instance-types/ for de-
tails.
4
It is tempting to just use the deployed checkpoint to com-
pute the influence. As shown by Liang et al. 2017, however,
the Hessian term in equation 1 captures more accurately the
effect of model training than the dot product of the optimal
checkpoint. In TracIn, the Hessian is approximated by the av-
erage over a set of checkpoints, and we follow their guidelines
for checkpoints selection.
pute the per-sample gradient with a batch size of
1 parallelized over multiple processes with several
g4dn.2x3machines on AWS.
4 Experimental results
This section describes our findings on the proper-
ties of applying IF on NMT for instance-specific
data filtering.
4.1 Sensitivity of gradient similarity to the
network components
In previous works, the influence, or called the gra-
dient similarity, is usually computed with respect to
a small part of the network parameters, especially
the last or the last few layers (Han et al. (2020);Bar-
shan et al. (2020); inter alia). In NMT, we found
that the resulting influence is highly sensitive to
the network components used in computing the
gradients (or gradient component). For illustration,
we construct a set of perturbed instances, compute
its influence by different gradient components and
observe their changes. The perturbed instances are
not included during the NMT training. This in-
dependence between the NMT and the perturbed
instances provides a simpler setting for checking
how gradient components and the perturbed exam-
ples affect the influence.
摘要:

AnalyzingtheUseofInuenceFunctionsforInstance-SpecicDataFilteringinNeuralMachineTranslationTszKinLamICL,HeidelbergUniversitylam@cl.uni-heidelberg.deEvaHaslerAmazonAITranslateehasler@amazon.comFelixHieberAmazonAITranslatefhieber@amazon.comAbstractCustomerfeedbackcanbeanimportantsignalforimprovingco...

展开>> 收起<<
Analyzing the Use of Influence Functions for Instance-Specific Data Filtering in Neural Machine Translation Tsz Kin Lam.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:551.58KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注