retrieved values are converted into a translation
probability distribution over candidate target to-
kens, denoted as
k
NN distribution. Finally, the
predicted distribution of the NMT model is interpo-
lated by
k
NN distribution with a hyper-parameter
λ
. Along this line, many efforts have been made to
improve
k
NN-MT (Zheng et al.,2021a;Jiang et al.,
2021). Particularly, as shown in Figure 1, adaptive
k
NN-MT (Zheng et al.,2021a) uses the query-key
distances and retrieved pairs to dynamically esti-
mate
λ
, exhibiting better performance than most
kNN-MT models.
However, we find that existing
k
NN-MT models
often suffer from a serious drawback: the model
performance will dramatically deteriorate due to
the underlying noise in retrieved pairs. For ex-
ample, in Figure 1, the retrieved results may con-
tain unrelated tokens such as "no", which leads
to a harmful
k
NN distribution. Besides, for esti-
mating
λ
, previous studies (Zheng et al.,2021a;
Jiang et al.,2021) only consider the retrieved pairs
while ignoring the NMT distribution. Back to Fig-
ure 1, compared with the
k
NN distribution, the
NMT model gives a much higher probability to
the ground-truth token “been”. Although the
k
NN
distribution is insufficiently accurate, it is still as-
signed with a greater weight than the NMT dis-
tribution. Obviously, this is inconsistent with our
intuition that when the NMT model has high con-
fidence in its prediction, it needs less help from
others, and thus the
k
NN distribution should have a
lower weight. Moreover, we find that during train-
ing, a non-negligible proportion of the retrieved
pairs from the datastore do not contain ground-
truth tokens. This can cause insufficient training
of
k
NN modules. To sum up, conventional
k
NN-
MT models are vulnerable to noise in datastore, for
which we further conduct a preliminary study to
validate the above issues. Therefore, dealing with
the noise for the
k
NN-MT model remains to be a
significant task.
In this paper, we explore a robust
k
NN-MT
model. In terms of model architecture, we ex-
plore how to more accurately estimate the
k
NN
distribution and better combine it with the NMT
distribution. Concretely, unlike previous studies
(Zheng et al.,2021a;Jiang et al.,2021) that only
use retrieved pairs to dynamically estimate
λ
, we
additionally use the confidence of NMT prediction
to calibrate the calculation of
λ
where confidence
is the predicted probability on each retrieved token.
Meanwhile, we improve the
k
NN distribution by
integrating the confidence to reduce the effect of
noise. Besides, we propose to boost the robustness
of our model by randomly adding perturbations to
retrieved key representations and augmenting re-
trieved pairs with pseudo ground-truth tokens. By
these means, our proposed approach can enhance
the
k
NN-MT model to better cope with the noise
in retrieved pairs, thus improving its robustness.
To investigate the effectiveness and generality
of our model, we conduct experiments on several
commonly-used benchmarks. Experimental results
show that our model significantly outperforms the
adaptive
k
NN-MT, which is the state-of-the-art
k
NN-MT model, across most domains. Moreover,
our model exhibits better performance than adap-
tive kNN-MT on pruned datastores.
2 Related Work
Retrieval-based approaches leveraging auxiliary
sentences have shown effectiveness in improving
NMT models. Usually, they first retrieve relevant
sentences from translation memory and then ex-
ploit them to boost NMT models during making
a translation. For example, Tu et al. (2018) main-
tains a continuous cache storing attention vectors
as keys and decoder representations as values. The
retrieved values are then used to update the decoder
representations. Bapna and Firat (2019) preform
n-gram retrieval to identify similar source n-grams
from the translation memory, where the correspond-
ing target words are then encoded to update de-
coder representations. Xia et al. (2019) pack the re-
trieved target sentences into a compact graph which
is then incorporated into decoder representations.
He et al. (2021) propose several Transformer-based
encoding methods to vectorize retrieved target sen-
tences. Cai et al. (2021) propose a cross-lingual
memory retriever to leverage target-side monolin-
gual translation memory, showing effectiveness in
low-resource and domain adaption scenarios.
Compared with the above studies involving addi-
tional training, non-parametric retrieval-augmented
approaches (Zhang et al.,2018;Bulté and Tez-
can,2019;Xu et al.,2020) are more flexible and
thus attract much attention. According to word
alignments, Zhang et al. (2018) retrieve similar
source sentences with target words from a transla-
tion memory, which are used to increase the proba-
bilities of the collected target words to be translated.
Both Bulté and Tezcan (2019) and Xu et al. (2020)