
batches. Our LASER3-CO approach has the fol-
lowing steps:
•
Pre-train LASER to use as the teacher encoder
θt
on high resource languages such as English.
Randomly initialize the student encoder and
perform distillation with the teacher encoder.
After distillation, we obtain the pre-trained
student encoder θs.
•
Fine-tune
θs
using queue(memory)-based
1
contrastive learning. For each input source
sentence
x
and target sentence
y
, encode
them with student and teacher encoders re-
spectively and we have their representations
q=θs(x), ktarget =k+=θt(y)
(we use "+"
as an abbreviation for the positive target sen-
tence). We also encode all (N) negative sen-
tences in the queue and have their representa-
tion ki=θt(queuei), i ∈[1, N].
•
Perform normalization on
q, ki, i ∈[0, N]
and
then compute the contrastive loss using in-
foNCE (van den Oord et al.,2018)
L=−log exp(q·k+/τ)
PN
i=1 exp(q·ki/τ)(1)
Here
τ
is the temperature parameter
2
that con-
trols the strength of the penalty.
•
Update student encoder
θs
with loss. Enqueue
the most recent target language (English) sen-
tences and dequeue the earliest sentences if
the queue size exceeds the limit (N).
In LASER3-CO, our training process is very simi-
lar to MOCO (He et al.,2020), though we do not
use the momentum update for the teacher. Instead,
we freeze the pre-trained teacher (LASER in our
case) that already has a high-quality representation
of English (or any other pivot language), so that
English sentences are encoded consistently during
distillation. Then we try to align the representation
of the student encoder to the teacher, only allowing
gradients to flow through the student encoder.
2.2 LASER3-CO-Filter
This architecture corresponds to Figure 2(b). The
motivation is that previous work (Robinson et al.,
1We use queue size N=4096 throughout experiments.
2
Empirically we found
τ= 0.05
(widely used in previous
literature) works well.
2020) has shown hard negatives improve represen-
tation. In our task, for each parallel sentence pair
(x, y)
, the hard negatives would be sentences
y0
that are hard to distinguish from the true target
y
(we can also find hard negatives
x0
in the source
language, but because our teacher encoder has bet-
ter representation for the target language, English,
we decide to focus on target-side hard negatives
only). Hard negatives are beneficial because they
force the model to learn more complex features to
distinguish them from the true target sentence.
To find more hard-negatives for contrastive fine-
tuning, we change LASER3-CO in two ways:
(1) disable shuffling for the data loader and (2) use
a pre-filtering mechanism to filter out bad samples
from the queue (we name this model LASER3-
CO-Filter where Filter refers to the pre-filtering
mechanism). Bad samples are hard negative sam-
ples
y0
that are too similar to
y
(e.g. "What is your
name" versus "What’s your name"). Treating these
extremely similar sentences as negative samples
would hurt the encoder’s representation and we
devise a pre-filtering method to filter them out.
Disable Shuffling
We sort sentences by length
3
and disable the shuffling of data loader so that con-
secutive batches contain sentences of similar length.
As our queue is updated by enqueuing the most
recent batch and dequeuing the earliest batch, dis-
abling shuffling makes the queue store sentences
of similar length. Because all samples in the queue
are used as negative samples to be contrasted with
the true target sentence and because they are of a
similar length, it is much more likely that we find
hard negatives (quantitative analysis provided in §5
and Figure 4).
Pre-filtering Mechanism
After disabling shuf-
fling, we show that more hard-negatives are found.
However, we also found that there are many cases
of extremely hard negatives (e.g. "What is your
name" versus "What’s your name", "Bring them
here to me" versus "Bring it here to me"). Though
hard negatives help contrastive fine-tuning to ob-
tain better sentence representation, extremely hard
negatives would confuse the model and incorrectly
update the parameters. Therefore we employ a
simple pre-filtering mechanism to prevent these ex-
tremely hard negatives from contributing to the con-
trastive loss: After we encode the target sentence
3
Our implementation is based on Fairseq (Ott et al.,2019)
which by default sorts the sentences by length