Towards Robust k-Nearest-Neighbor Machine Translation Hui Jiang1 Ziyao Lu2 Fandong Meng2 Chulun Zhou2 Jie Zhou2 Degen Huang3and Jinsong Su14y

2025-05-06 0 0 2.78MB 10 页 10玖币
侵权投诉
Towards Robust k-Nearest-Neighbor Machine Translation
Hui Jiang1
, Ziyao Lu2
, Fandong Meng2, Chulun Zhou2,
Jie Zhou2, Degen Huang3and Jinsong Su1,4
1School of Informatics, Xiamen University, China
2Pattern Recognition Center, WeChat AI, Tencent Inc, China
3Dalian University of Technology, China
4Pengcheng Laboratory, China
hjiang@stu.xmu.edu.cn {ziyaolu,fandongmeng,chulunzhou,withtomzhou}@tencent.com
huangdg@dlut.edu.cn jssu@xmu.edu.cn
Abstract
k-Nearest-Neighbor Machine Translation
(kNN-MT) becomes an important research
direction of NMT in recent years. Its main
idea is to retrieve useful key-value pairs from
an additional datastore to modify translations
without updating the NMT model. However,
the underlying retrieved noisy pairs will dra-
matically deteriorate the model performance.
In this paper, we conduct a preliminary study
and find that this problem results from not
fully exploiting the prediction of the NMT
model. To alleviate the impact of noise, we
propose a confidence-enhanced kNN-MT
model with robust training. Concretely, we
introduce the NMT confidence to refine the
modeling of two important components of
kNN-MT: kNN distribution and the inter-
polation weight. Meanwhile we inject two
types of perturbations into the retrieved pairs
for robust training. Experimental results
on four benchmark datasets demonstrate
that our model not only achieves significant
improvements over current kNN-MT models,
but also exhibits better robustness. Our
code is available at https://github.com/
DeepLearnXMU/Robust-knn-mt.
1 Introduction
As a commonly-used paradigm of retrieval-based
neural machine translation (NMT),
k
-Nearest-
Neighbor Machine Translation (
k
NN-MT) has
proven to be effective in many studies (Khandel-
wal et al.,2021;Zheng et al.,2021a;Jiang et al.,
2021;Wang et al.,2022;Meng et al.,2022), and
thus attracted much attention in the community of
machine translation. The core of
k
NN-MT is to use
an auxiliary datastore containing cached decoder
representations and corresponding target tokens.
This datastore can flexibly guide the NMT model
This work is done when Hui Jiang was interning at Pattern
Recognition Center, WeChat AI, Tencent Inc, China.
*Equal contribution
Corresponding author
Distance
Value
Key
Final Distribution ‘‘been’’
Estimate
I have been
‘‘you’’
nobeen
2
9
20
33
no
been
a
my
you summer
Decoder Output:
NMT Model
nobeen
𝑝
𝑝
nobeen
𝜆 = 0.7
J'ai été dans ma propre chambre.
Neighbors
Value Distance
you
been
a
my
you
summer
Datastore
kNN Distribution NMT Distribution
Source:
Decoder Representation
(1 − 𝜆)
𝜆
Value
Value
Retrieved Pairs
Key
‘‘no’’
Figure 1: An example of kNN-MT model using dynam-
ically estimation of λ(Zheng et al.,2021a;Jiang et al.,
2021).
to make better predictions, especially for domain
adaptation. Compared with other retrieval-based
paradigm (Tu et al.,2018;Cai et al.,2021),
k
NN-
MT has two advantages: 1) It is more scalable
because we can directly improve the NMT model
by just manipulating the datastore. 2) It is more
interpretable due to its observable retrieved pairs.
Generally,
k
NN-MT mainly involves two stages:
datastore establishment and candidate retrieval.
During the first stage, a pre-trained NMT model is
used to construct a datastore containing key-value
pairs, where the key is the decoder representation
and the value is the corresponding target token. At
the second stage, given the current decoder repre-
sentation as a query at each time step, the
k
nearest
key-value pairs are retrieved from the datastore.
Then, according to the query-key distances, the
arXiv:2210.08808v1 [cs.CL] 17 Oct 2022
retrieved values are converted into a translation
probability distribution over candidate target to-
kens, denoted as
k
NN distribution. Finally, the
predicted distribution of the NMT model is interpo-
lated by
k
NN distribution with a hyper-parameter
λ
. Along this line, many efforts have been made to
improve
k
NN-MT (Zheng et al.,2021a;Jiang et al.,
2021). Particularly, as shown in Figure 1, adaptive
k
NN-MT (Zheng et al.,2021a) uses the query-key
distances and retrieved pairs to dynamically esti-
mate
λ
, exhibiting better performance than most
kNN-MT models.
However, we find that existing
k
NN-MT models
often suffer from a serious drawback: the model
performance will dramatically deteriorate due to
the underlying noise in retrieved pairs. For ex-
ample, in Figure 1, the retrieved results may con-
tain unrelated tokens such as "no", which leads
to a harmful
k
NN distribution. Besides, for esti-
mating
λ
, previous studies (Zheng et al.,2021a;
Jiang et al.,2021) only consider the retrieved pairs
while ignoring the NMT distribution. Back to Fig-
ure 1, compared with the
k
NN distribution, the
NMT model gives a much higher probability to
the ground-truth token “been”. Although the
k
NN
distribution is insufficiently accurate, it is still as-
signed with a greater weight than the NMT dis-
tribution. Obviously, this is inconsistent with our
intuition that when the NMT model has high con-
fidence in its prediction, it needs less help from
others, and thus the
k
NN distribution should have a
lower weight. Moreover, we find that during train-
ing, a non-negligible proportion of the retrieved
pairs from the datastore do not contain ground-
truth tokens. This can cause insufficient training
of
k
NN modules. To sum up, conventional
k
NN-
MT models are vulnerable to noise in datastore, for
which we further conduct a preliminary study to
validate the above issues. Therefore, dealing with
the noise for the
k
NN-MT model remains to be a
significant task.
In this paper, we explore a robust
k
NN-MT
model. In terms of model architecture, we ex-
plore how to more accurately estimate the
k
NN
distribution and better combine it with the NMT
distribution. Concretely, unlike previous studies
(Zheng et al.,2021a;Jiang et al.,2021) that only
use retrieved pairs to dynamically estimate
λ
, we
additionally use the confidence of NMT prediction
to calibrate the calculation of
λ
where confidence
is the predicted probability on each retrieved token.
Meanwhile, we improve the
k
NN distribution by
integrating the confidence to reduce the effect of
noise. Besides, we propose to boost the robustness
of our model by randomly adding perturbations to
retrieved key representations and augmenting re-
trieved pairs with pseudo ground-truth tokens. By
these means, our proposed approach can enhance
the
k
NN-MT model to better cope with the noise
in retrieved pairs, thus improving its robustness.
To investigate the effectiveness and generality
of our model, we conduct experiments on several
commonly-used benchmarks. Experimental results
show that our model significantly outperforms the
adaptive
k
NN-MT, which is the state-of-the-art
k
NN-MT model, across most domains. Moreover,
our model exhibits better performance than adap-
tive kNN-MT on pruned datastores.
2 Related Work
Retrieval-based approaches leveraging auxiliary
sentences have shown effectiveness in improving
NMT models. Usually, they first retrieve relevant
sentences from translation memory and then ex-
ploit them to boost NMT models during making
a translation. For example, Tu et al. (2018) main-
tains a continuous cache storing attention vectors
as keys and decoder representations as values. The
retrieved values are then used to update the decoder
representations. Bapna and Firat (2019) preform
n-gram retrieval to identify similar source n-grams
from the translation memory, where the correspond-
ing target words are then encoded to update de-
coder representations. Xia et al. (2019) pack the re-
trieved target sentences into a compact graph which
is then incorporated into decoder representations.
He et al. (2021) propose several Transformer-based
encoding methods to vectorize retrieved target sen-
tences. Cai et al. (2021) propose a cross-lingual
memory retriever to leverage target-side monolin-
gual translation memory, showing effectiveness in
low-resource and domain adaption scenarios.
Compared with the above studies involving addi-
tional training, non-parametric retrieval-augmented
approaches (Zhang et al.,2018;Bulté and Tez-
can,2019;Xu et al.,2020) are more flexible and
thus attract much attention. According to word
alignments, Zhang et al. (2018) retrieve similar
source sentences with target words from a transla-
tion memory, which are used to increase the proba-
bilities of the collected target words to be translated.
Both Bulté and Tezcan (2019) and Xu et al. (2020)
摘要:

TowardsRobustk-Nearest-NeighborMachineTranslationHuiJiang1,ZiyaoLu2,FandongMeng2,ChulunZhou2,JieZhou2,DegenHuang3andJinsongSu1,4y1SchoolofInformatics,XiamenUniversity,China2PatternRecognitionCenter,WeChatAI,TencentInc,China3DalianUniversityofTechnology,China4PengchengLaboratory,Chinahjiang@stu.xmu...

展开>> 收起<<
Towards Robust k-Nearest-Neighbor Machine Translation Hui Jiang1 Ziyao Lu2 Fandong Meng2 Chulun Zhou2 Jie Zhou2 Degen Huang3and Jinsong Su14y.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:2.78MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注