Analyzing the Use of Inﬂuence Functions for Instance-Speciﬁc Data Filtering in Neural Machine Translation Tsz Kin Lam

2025-04-30 0 0 551.58KB 15 页 10玖币

侵权投诉

Analyzing the Use of Inﬂuence Functions for Instance-Speciﬁc Data

Filtering in Neural Machine Translation

Tsz Kin Lam∗

ICL, Heidelberg University

lam@cl.uni-heidelberg.de

Eva Hasler

Amazon AI Translate

ehasler@amazon.com

Felix Hieber

Amazon AI Translate

fhieber@amazon.com

Abstract

Customer feedback can be an important signal

for improving commercial machine translation

systems. One solution for ﬁxing speciﬁc trans-

lation errors is to remove the related erroneous

training instances followed by re-training of

the machine translation system, which we refer

to as instance-speciﬁc data ﬁltering. Inﬂuence

functions (IF) have been shown to be effec-

tive in ﬁnding such relevant training examples

for classiﬁcation tasks such as image classiﬁ-

cation, toxic speech detection and entailment

task. Given a probing instance, IF ﬁnd inﬂuen-

tial training examples by measuring the simi-

larity of the probing instance with a set of train-

ing examples in gradient space. In this work,

we examine the use of inﬂuence functions for

Neural Machine Translation (NMT). We pro-

pose two effective extensions to a state of the

art inﬂuence function and demonstrate on the

sub-problem of copied training examples that

IF can be applied more generally than hand-

crafted regular expressions.

1 Introduction

Neural Machine Translation (NMT) is the de facto

standard for recent high-quality machine transla-

tion systems. NMT, however, requires abundant

amount of bi-text for supervised training. One com-

mon approach to increase the amount of bi-text

is via data augmentation (Sennrich et al.,2015;

Edunov et al.,2018;He et al.,2019,inter alia).

Another approach is the use of web-crawled data

(Bañón et al.,2020) but since crawled data is

known to be notoriously noisy (Khayrallah and

Koehn,2018;Caswell et al.,2020), a plethora of

data ﬁltering techniques (Junczys-Dowmunt,2018;

Wang et al.,2018;Ramírez-Sánchez et al.,2020,in-

ter alia) have been proposed for retaining a cleaner

portion of the bi-text for training.

While standard data ﬁltering techniques aim to

improve the quality of the overall training data

∗Work done during an internship at Amazon.

without targeting the translation quality of speciﬁc

instances, instance-speciﬁc data ﬁltering focuses

on the improvement of translation quality toward

a speciﬁc set of input sentences via removal of

the related training data. In commercial MT, this

selected set of sentences can be the problematic

translations reported by customers. One simple

approach of instance-speciﬁc data ﬁltering in NMT

is manual ﬁltering. In manual ﬁltering, human

annotators identify translation errors on sentences

reported by customer and designs ﬁltering scheme,

e.g., regular expressions to search related training

examples for removal from the training set.

In this work, we attempt to apply a more au-

tomatable technique called inﬂuence functions (IF)

which is shown to be effective on image classiﬁ-

cation (Koh and Liang,2017), and certain NLP

tasks such as sentiment analysis, entailment and

toxic speech detection (Han et al.,2020;Guo et al.,

2020). Given a probing example, inﬂuence func-

tions (IF) search for the inﬂuential training exam-

ples by measuring the similarity of the probing

example with a set of training examples in gradi-

ent space. Schioppa et al. (2021) use a low-rank

approximation of the Hessian to speed up the com-

putation of IF and apply the idea of self-inﬂuence to

NMT. However, self-inﬂuence measures if a train-

ing instance is an outlier rather than its similar-

ity with another instance. Akyürek et al. (2022)

question the back-tracing ability of IF on the fact-

tracing task. They compare IF with heuristics used

in Information Retrieval and attribute the worse

performance of IF to a problem called saturation.

Compared to fact-tracing, the target sides of ma-

chine translation can be more diverse which com-

plicates the application of IF.

We apply an effective type of IF called TracIn

(Pruthi et al.,2020) to NMT for instance-speciﬁc

data ﬁltering and analyze its behaviour by con-

structing synthetic training examples containing

simulated translation errors. In particular, we ﬁnd

arXiv:2210.13281v1 [cs.CL] 24 Oct 2022

that

•

the gradient similarity, also called the inﬂu-

ence

, is highly sensitive to the network com-

ponent.

•

vanilla IF may not be sufﬁcient to achieve

good retrieval performance. We proposed two

contrastive methods to further improve the

performance.

•

training examples consisting of copied source

sentences have similar gradients even when

they are lexically different. This indicates

that the use of inﬂuence functions can go be-

yond what can be achieved with regular ex-

pressions.

•

an effective automation of the instance-

speciﬁc data ﬁltering remains challenging.

To the best of our knowledge, we are the ﬁrst to

investigate applying IF for instance-speciﬁc data

ﬁltering to NMT.

2 Method

Inﬂuence functions

IF is a technique from ro-

bust statistics (Hampel,1974;Cook and Weisberg,

1982,inter alia). It aims to trace a model’s predic-

tions back to the most responsible training exam-

ples without repeated re-training of the model, aka

Leave-One-Out. Koh and Liang (2017) extend this

idea from robust statistics to deep neural network

that requires only the gradient of the loss functions

and Hessian-vector products so that the inﬂuence

I(z, z0)

of two examples

and

is approximated

I(z, z0)≈ ∇θL(z0)TH−1

θ∇θL(z)(1)

where

is the model parameters at optimum and

Hˆ

θ=1

nPn

i=1 ∇2

θL(θ)

is the Hessian of the model

parameters at

. Given

number of training in-

stances and

number of model parameters, the in-

verse of Hessian has a complexity of

O(np2+p3)

which is expensive to compute for deep neural net-

work. There are several proposed methods to speed

up the computation of IF, e.g., by computing on

a training subset selected by KNN-search (Guo

et al.,2020), by approximating the Hessian with

LISSA (Agarwal et al.,2017), by computing on a

In this work, we use gradient similarity or inﬂuence inter-

changeably to denote the result of IF. Be aware that TracIn is

also one type of IF.

subset of model parameters (Koh and Liang,2017),

or by replacing the Hessian with some other pro-

cedures (Pruthi et al.,2020). In this work, we

focus on TracIn which is shown to be better than

some other variations (Han and Tsvetkov,2020;

Schioppa et al.,2021) in terms of retrieval perfor-

mance.

TracIn, denoted by

ITracIn(z, z0)

, replaces the

computationally costly Hessian matrix with an

identity matrix. The remained gradient dot product,

or called the gradient similarity, is instead com-

puted over

number of checkpoints, followed by

averaging:

ITracIn(z, z0) = 1

i=1

∇θL(z0)T∇θL(z)(2)

In NMT, given the same source sentence, the mag-

nitude of the gradient in general is positively corre-

lated to the length of the target sentence. In order

to reduce the effect of the target length, we normal-

ize equation 2by the product of

k∇θL(z0)k

and

k∇θL(z)k

, or equivalently, we compute the cosine

similarity of ∇θL(z0)and ∇θL(z).

Given a probing instance

and its probing gra-

dient

∇θL(z0)

, instances in the training set that

yield a positive value of

ITracIn(z, z0)

are called

the positively inﬂuential training instances (+IF-

Train) whereas those that yield a negative value of

ITracIn(z, z0)

are called the negatively inﬂuential

training instances (-IFTrain). Taking a gradient

step on +IFTrain reduces the loss on the probing

example while taking a gradient step on -IFTrain

increases it. IF can be used for data ﬁltering by

removing the +IFTrain examples of low quality

probing samples since their gradients have similar

direction. Conversely, if the probing sample is of

high quality, removing -IFTrain examples from the

training data would be expected to increase transla-

tion quality w.r.t. the probing sample.

3 Experimental Setting

Model conﬁguration and training

We use

Transformer BASE conﬁguration as described in

Vaswani et al. (2017) with default setting and im-

plementation in FAIRSEQ. We use a sentence-piece

model to create subword units of size 32k. Un-

less otherwise speciﬁed, we pre-trained our NMT

on Europarl-v7 data and News Commentary-v12

data in German-English direction from WMT17

for 100 epochs, about 112K updates, using Adam

Shared parameters Non-shared parameters

Samples ∇F ull ∇Emb ∇srcEmb ∇trgEmb ∇output ∇concat

Probing Noch kommt Volkswagen glimpﬂich durch. 1 1 1 1 1 1

Volkswagen gets off lightly.

1Das £ 1,35 Mrd. teure Projekt soll bis 0.153 0.240 0.006 0.287 0.437 0.339

Mai 2017 fertiggestellt werden

Volkswagen gets off lightly.

2Alle in Frage kommenden Produkte wurden 0.238 0.320 0.013 0.230 0.401 0.319

aus dem Verkauf gezogen.

Volkswagen gets off lightly.

3 Noch kommt Volkswagen glimpﬂich durch. -0.021 -0.030 -0.149 -0.022 -0.017 -0.040

In 2008, most malware programmes were

still focused on sending out adverts.

4 Noch kommt Volkswagen glimpﬂich durch. -0.007 -0.016 -0.120 -0.003 0.011 -0.013

We’ve made a complete turnaround.

5 Noch kommt Volkswagen glimpﬂich durch. 0.950 0.894 0.973 0.927 0.843 0.873

Volkswagen gets off lightly!

6 Noch kommt Volkswagen glimpﬂich durch!0.899 0.912 0.873 0.915 0.940 0.927

Volkswagen gets off lightly.

Table 1: Example showing the changes of inﬂuence by network components. Segments that are marked in red

are perturbed from the probing example. ∇Xindicates the network components used in computing the inﬂuence,

∇concat indicates the concatenation of ∇srcEmb,∇trgEmb and ∇output.

optimizerion training of 16-bit

. The effective

mini-batch size is 4096 x 16 tokens and it takes a

p3.16xlarge

machine on AWS 6 hours for training.

We evaluate the MT model on the newstest2017

test set with a checkpoint averaged over the 10-best

checkpoints, measured by the validation loss on

the newstest2014-2016 dev set. On the test set, our

NMT model with non-shared parameters with the

two word embeddings and the output layer scores

29.99 BLEU whereas the one with shared parame-

ters scores 29.78 BLEU. We use beam search with

beam size of 5 in decoding.

TracIn

We select 5 checkpoints, i.e., at epoch 5,

8, 15, 30 and 100 for computing TracIn

. We select

checkpoints which have relatively large changes

in the validation loss, i.e., usually in the earlier

phrase of training, and include the last one to cover

information at the end of the training. We com-

We use 32-bit precision to compute the gradient similarity

once the training is done.

See https://aws.amazon.com/ec2/instance-types/ for de-

tails.

It is tempting to just use the deployed checkpoint to com-

pute the inﬂuence. As shown by Liang et al. 2017, however,

the Hessian term in equation 1 captures more accurately the

effect of model training than the dot product of the optimal

checkpoint. In TracIn, the Hessian is approximated by the av-

erage over a set of checkpoints, and we follow their guidelines

for checkpoints selection.

pute the per-sample gradient with a batch size of

1 parallelized over multiple processes with several

g4dn.2x3machines on AWS.

4 Experimental results

This section describes our ﬁndings on the proper-

ties of applying IF on NMT for instance-speciﬁc

data ﬁltering.

4.1 Sensitivity of gradient similarity to the

network components

In previous works, the inﬂuence, or called the gra-

dient similarity, is usually computed with respect to

a small part of the network parameters, especially

the last or the last few layers (Han et al. (2020);Bar-

shan et al. (2020); inter alia). In NMT, we found

that the resulting inﬂuence is highly sensitive to

the network components used in computing the

gradients (or gradient component). For illustration,

we construct a set of perturbed instances, compute

its inﬂuence by different gradient components and

observe their changes. The perturbed instances are

not included during the NMT training. This in-

dependence between the NMT and the perturbed

instances provides a simpler setting for checking

how gradient components and the perturbed exam-

ples affect the inﬂuence.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnalyzingtheUseofInuenceFunctionsforInstance-SpecicDataFilteringinNeuralMachineTranslationTszKinLamICL,HeidelbergUniversitylam@cl.uni-heidelberg.deEvaHaslerAmazonAITranslateehasler@amazon.comFelixHieberAmazonAITranslatefhieber@amazon.comAbstractCustomerfeedbackcanbeanimportantsignalforimprovingco...

展开>> 收起<<

Analyzing the Use of Inﬂuence Functions for Instance-Speciﬁc Data Filtering in Neural Machine Translation Tsz Kin Lam.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Analyzing the Use of Inﬂuence Functions for Instance-Speciﬁc Data Filtering in Neural Machine Translation Tsz Kin Lam

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: