Towards Robust k-Nearest-Neighbor Machine Translation Hui Jiang1 Ziyao Lu2 Fandong Meng2 Chulun Zhou2 Jie Zhou2 Degen Huang3and Jinsong Su14y

2025-05-06 1 0 2.78MB 10 页 10玖币

侵权投诉

Towards Robust k-Nearest-Neighbor Machine Translation

Hui Jiang1∗

, Ziyao Lu2∗

, Fandong Meng2, Chulun Zhou2,

Jie Zhou2, Degen Huang3and Jinsong Su1,4†

1School of Informatics, Xiamen University, China

2Pattern Recognition Center, WeChat AI, Tencent Inc, China

3Dalian University of Technology, China

4Pengcheng Laboratory, China

hjiang@stu.xmu.edu.cn {ziyaolu,fandongmeng,chulunzhou,withtomzhou}@tencent.com

huangdg@dlut.edu.cn jssu@xmu.edu.cn

Abstract

k-Nearest-Neighbor Machine Translation

(kNN-MT) becomes an important research

direction of NMT in recent years. Its main

idea is to retrieve useful key-value pairs from

an additional datastore to modify translations

without updating the NMT model. However,

the underlying retrieved noisy pairs will dra-

matically deteriorate the model performance.

In this paper, we conduct a preliminary study

and ﬁnd that this problem results from not

fully exploiting the prediction of the NMT

model. To alleviate the impact of noise, we

propose a conﬁdence-enhanced kNN-MT

model with robust training. Concretely, we

introduce the NMT conﬁdence to reﬁne the

modeling of two important components of

kNN-MT: kNN distribution and the inter-

polation weight. Meanwhile we inject two

types of perturbations into the retrieved pairs

for robust training. Experimental results

on four benchmark datasets demonstrate

that our model not only achieves signiﬁcant

improvements over current kNN-MT models,

but also exhibits better robustness. Our

code is available at https://github.com/

DeepLearnXMU/Robust-knn-mt.

1 Introduction

As a commonly-used paradigm of retrieval-based

neural machine translation (NMT),

-Nearest-

Neighbor Machine Translation (

NN-MT) has

proven to be effective in many studies (Khandel-

wal et al.,2021;Zheng et al.,2021a;Jiang et al.,

2021;Wang et al.,2022;Meng et al.,2022), and

thus attracted much attention in the community of

machine translation. The core of

NN-MT is to use

an auxiliary datastore containing cached decoder

representations and corresponding target tokens.

This datastore can ﬂexibly guide the NMT model

This work is done when Hui Jiang was interning at Pattern

Recognition Center, WeChat AI, Tencent Inc, China.

*Equal contribution

†Corresponding author

Distance

Value

Key

Final Distribution ‘‘been’’

Estimate

I have been …

‘‘you’’

… no…been…

been

…

you summer

Decoder Output:

NMT Model

… no…been…

𝑝

𝑝

… no…been…

𝜆 = 0.7

J'ai été dans ma propre chambre.

Neighbors

Value Distance

…you

been

you

…

summer

…

Datastore

kNN Distribution NMT Distribution

Source:

Decoder Representation

(1 − 𝜆)

𝜆

Value

Retrieved Pairs

Key

‘‘no’’

Figure 1: An example of kNN-MT model using dynam-

ically estimation of λ(Zheng et al.,2021a;Jiang et al.,

2021).

to make better predictions, especially for domain

adaptation. Compared with other retrieval-based

paradigm (Tu et al.,2018;Cai et al.,2021),

NN-

MT has two advantages: 1) It is more scalable

because we can directly improve the NMT model

by just manipulating the datastore. 2) It is more

interpretable due to its observable retrieved pairs.

Generally,

NN-MT mainly involves two stages:

datastore establishment and candidate retrieval.

During the ﬁrst stage, a pre-trained NMT model is

used to construct a datastore containing key-value

pairs, where the key is the decoder representation

and the value is the corresponding target token. At

the second stage, given the current decoder repre-

sentation as a query at each time step, the

nearest

key-value pairs are retrieved from the datastore.

Then, according to the query-key distances, the

arXiv:2210.08808v1 [cs.CL] 17 Oct 2022

retrieved values are converted into a translation

probability distribution over candidate target to-

kens, denoted as

NN distribution. Finally, the

predicted distribution of the NMT model is interpo-

lated by

NN distribution with a hyper-parameter

. Along this line, many efforts have been made to

improve

NN-MT (Zheng et al.,2021a;Jiang et al.,

2021). Particularly, as shown in Figure 1, adaptive

NN-MT (Zheng et al.,2021a) uses the query-key

distances and retrieved pairs to dynamically esti-

mate

, exhibiting better performance than most

kNN-MT models.

However, we ﬁnd that existing

NN-MT models

often suffer from a serious drawback: the model

performance will dramatically deteriorate due to

the underlying noise in retrieved pairs. For ex-

ample, in Figure 1, the retrieved results may con-

tain unrelated tokens such as "no", which leads

to a harmful

NN distribution. Besides, for esti-

mating

, previous studies (Zheng et al.,2021a;

Jiang et al.,2021) only consider the retrieved pairs

while ignoring the NMT distribution. Back to Fig-

ure 1, compared with the

NN distribution, the

NMT model gives a much higher probability to

the ground-truth token “been”. Although the

distribution is insufﬁciently accurate, it is still as-

signed with a greater weight than the NMT dis-

tribution. Obviously, this is inconsistent with our

intuition that when the NMT model has high con-

ﬁdence in its prediction, it needs less help from

others, and thus the

NN distribution should have a

lower weight. Moreover, we ﬁnd that during train-

ing, a non-negligible proportion of the retrieved

pairs from the datastore do not contain ground-

truth tokens. This can cause insufﬁcient training

NN modules. To sum up, conventional

NN-

MT models are vulnerable to noise in datastore, for

which we further conduct a preliminary study to

validate the above issues. Therefore, dealing with

the noise for the

NN-MT model remains to be a

signiﬁcant task.

In this paper, we explore a robust

NN-MT

model. In terms of model architecture, we ex-

plore how to more accurately estimate the

distribution and better combine it with the NMT

distribution. Concretely, unlike previous studies

(Zheng et al.,2021a;Jiang et al.,2021) that only

use retrieved pairs to dynamically estimate

, we

additionally use the conﬁdence of NMT prediction

to calibrate the calculation of

where conﬁdence

is the predicted probability on each retrieved token.

Meanwhile, we improve the

NN distribution by

integrating the conﬁdence to reduce the effect of

noise. Besides, we propose to boost the robustness

of our model by randomly adding perturbations to

retrieved key representations and augmenting re-

trieved pairs with pseudo ground-truth tokens. By

these means, our proposed approach can enhance

the

NN-MT model to better cope with the noise

in retrieved pairs, thus improving its robustness.

To investigate the effectiveness and generality

of our model, we conduct experiments on several

commonly-used benchmarks. Experimental results

show that our model signiﬁcantly outperforms the

adaptive

NN-MT, which is the state-of-the-art

NN-MT model, across most domains. Moreover,

our model exhibits better performance than adap-

tive kNN-MT on pruned datastores.

2 Related Work

Retrieval-based approaches leveraging auxiliary

sentences have shown effectiveness in improving

NMT models. Usually, they ﬁrst retrieve relevant

sentences from translation memory and then ex-

ploit them to boost NMT models during making

a translation. For example, Tu et al. (2018) main-

tains a continuous cache storing attention vectors

as keys and decoder representations as values. The

retrieved values are then used to update the decoder

representations. Bapna and Firat (2019) preform

n-gram retrieval to identify similar source n-grams

from the translation memory, where the correspond-

ing target words are then encoded to update de-

coder representations. Xia et al. (2019) pack the re-

trieved target sentences into a compact graph which

is then incorporated into decoder representations.

He et al. (2021) propose several Transformer-based

encoding methods to vectorize retrieved target sen-

tences. Cai et al. (2021) propose a cross-lingual

memory retriever to leverage target-side monolin-

gual translation memory, showing effectiveness in

low-resource and domain adaption scenarios.

Compared with the above studies involving addi-

tional training, non-parametric retrieval-augmented

approaches (Zhang et al.,2018;Bulté and Tez-

can,2019;Xu et al.,2020) are more ﬂexible and

thus attract much attention. According to word

alignments, Zhang et al. (2018) retrieve similar

source sentences with target words from a transla-

tion memory, which are used to increase the proba-

bilities of the collected target words to be translated.

Both Bulté and Tezcan (2019) and Xu et al. (2020)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsRobustk-Nearest-NeighborMachineTranslationHuiJiang1,ZiyaoLu2,FandongMeng2,ChulunZhou2,JieZhou2,DegenHuang3andJinsongSu1,4y1SchoolofInformatics,XiamenUniversity,China2PatternRecognitionCenter,WeChatAI,TencentInc,China3DalianUniversityofTechnology,China4PengchengLaboratory,Chinahjiang@stu.xmu...

展开>> 收起<<

Towards Robust k-Nearest-Neighbor Machine Translation Hui Jiang1 Ziyao Lu2 Fandong Meng2 Chulun Zhou2 Jie Zhou2 Degen Huang3and Jinsong Su14y.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Robust k-Nearest-Neighbor Machine Translation Hui Jiang1 Ziyao Lu2 Fandong Meng2 Chulun Zhou2 Jie Zhou2 Degen Huang3and Jinsong Su14y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: