Multilingual Representation Distillation with Contrastive Learning Weiting TanKevin HeffernanHolger SchwenkPhilipp Koehn Johns Hopkins University

2025-05-02 0 0 559.49KB 14 页 10玖币
侵权投诉
Multilingual Representation Distillation with Contrastive Learning
Weiting TanKevin HeffernanHolger SchwenkPhilipp Koehn♠♥
Johns Hopkins University
Meta AI Research
{wtan12, phi}@jhu.edu {kevinheffernan,schwenk}@meta.com
Abstract
Multilingual sentence representations from
large models encode semantic information
from two or more languages and can be
used for different cross-lingual information
retrieval and matching tasks. In this paper,
we integrate contrastive learning into multilin-
gual representation distillation and use it for
quality estimation of parallel sentences (i.e.,
find semantically similar sentences that can be
used as translations of each other). We vali-
date our approach with multilingual similarity
search and corpus filtering tasks. Experiments
across different low-resource languages show
that our method greatly outperforms previous
sentence encoders such as LASER, LASER3,
and LaBSE.
1 Introduction
With the rise of neural networks, high-dimensional
word-level sequence representations play an impor-
tant role in almost any natural language processing
task. Contextual representations from large pre-
trained language models (Vaswani et al.,2017;De-
vlin et al.,2019;Liu et al.,2019;CONNEAU and
Lample,2019) have shown advantages compared
to earlier static embeddings (Mikolov et al.,2013;
Pennington et al.,2014). However, they are not
pre-trained with sentence-level objectives and their
representations of two different sentences cannot be
easily used to indicate semantic similarity. To en-
code sentence-level information, LASER (Artetxe
and Schwenk,2019b) pools a sentence embedding
from the encoder and feeds it to the decoder. An-
other approach is to use siamese-structured models
where two identical encoders are used to repre-
sent sentences of similar meaning and are trained
to push their representations close to each other.
Sentence-Transformers (Reimers and Gurevych,
2019) are siamese-structured models that are ini-
tialized with pre-trained large models like BERT
Work was done during an internship at Meta AI Re-
search.
student
encoder teacher
encoder
cosine loss
sentence embedding sentence embedding
gradient
"bring them here
to me, " he said.
target
na ka mea ia, mauria
mai ki konei ki ahau.
source
Figure 1: Student-Teacher Distillation in LASER3.
(Devlin et al.,2019) or Roberta (Liu et al.,2019).
After fine-tuning, Sentence-Transformers improve
their sentence representation for cross-lingual tasks.
Besides fine-tuning with identical (siamese) en-
coders, distillation can be used to retrieve better
representations. Reimers and Gurevych (2020a)
extends a monolingual sentence representation into
a multilingual representation with model distilla-
tion. Similarly, Heffernan et al. (2022) proposed
LASER3, a student-teacher architecture that distills
information from a pre-trained teacher encoder to
a student encoder in new languages. As shown in
Figure 1, the distillation process updates the stu-
dent encoder with the gradient from the cosine loss.
Note that it freezes the parameters of the teacher
encoder, which is already pre-trained on the tar-
get language (English in our case). Therefore, the
target sentence embedding is fixed and the corre-
sponding source embedding is aligned with the
target embedding.
In this paper, we focus on the quality estimation
of parallel sentences in low-resource languages,
which requires models to distinguish similar and
dissimilar sentence pairs. Contrastive learning is
helpful because its objective not only aligns similar
sentences’ representations but also pushes away
representations from dissimilar sentences, which
arXiv:2210.05033v2 [cs.CL] 30 Apr 2023
student
encoder teacher
encoder
na ka mea ia, mauria
mai ki konei ki ahau. "bring them here
to me, " he said.
q ktarget k1 k2 ...
cosine similarity
contrastive loss
1. "bring it here to me," he said
2. I don't know hey? I am no expert.
3. shove their warehouses full with
specialized products.
4. hurry up! you tell me you're gonna
change, but you never do!
targetsource queue for negative samples
student
encoder teacher
encoder
qktarget k1 k2 ...
contrastive loss
1. "bring it here to me," he said
2. to better serve our customers
3. how are you planning to fix this?
4. a set of shoes free with every bike.
queue for negative samples
(a) LASER3-CO (b) LASER3-CO-Filter
0.9 0.4 0.3 0.1
target similarity
disable contribution
gradient gradient
na ka mea ia, mauria
mai ki konei ki ahau.
source
"bring them here
to me, " he said.
target
Figure 2: Visual Explanation of LASER3-CO (vanilla) and LASER3-CO-Filter (filtered). Source and target come
from input bitexts dataset and queue is constructed with earlier batch’s samples. In the (b) filtered version, a
pre-filtering mechanism is employed to filter out extremely negative samples from the queue.
enables the model to be more confident in similar
sentences. Additionally, contrastive learning is a
form of self-supervision that fits well in our low-
resource setting where only a limited amount of
clean data is available.
In practice, we integrate contrastive learning into
the distillation-based architecture from LASER3
and use our contrastive distillation method to train
encoders for low-resource languages. Inspired by
He et al. (2020) and Wu et al. (2018), we used a
queue to store negative samples as self-supervision
to train better encoders. We also employed a pre-
filtering mechanism to find hard negative samples
and showed that they benefit the distillation of sen-
tence representations.
To evaluate different encoders, we rely on multi-
lingual similarity search and corpus filtering tasks.
In the multilingual similarity search task, we en-
code all source and target sentences and use a
cosine-based similarity metric called margin score
(Artetxe and Schwenk,2019a) to pair source and
target sentences. In the corpus filtering task, a
mined noisy parallel corpus is given and we use
the encoder with margin score to compute a simi-
larity score for each sentence pair. Then we filter
out the low-score pairs and use the remaining cor-
pus to train a neural machine translation model
whose performance is evaluated with BLEU (Pa-
pineni et al.,2002). Compared to previous works,
we observe consistent improvement when using
contrastive distillation. We also compared our ap-
proach with another simple but effective data aug-
mentation technique, back-translation, and show
that contrastive distillation achieves tied or better
performance with less data.
2 Approach
To train high-quality sentence representation, we
integrate contrastive learning into sentence repre-
sentation distillation and our contrastive distillation
method is visualized in Figure 2. The motivation
for using contrastive learning is two-fold:
The self-supervision from contrastive learning
helps representation learning in low-resource
settings.
Contrastive learning enables models to recog-
nize similar and dissimilar sentences, which
is crucial for filtering out noisy sentence pairs.
2.1 LASER3-CO
This architecture corresponds to Figure 2(a). We
name it LASER3-CO since it integrates
co
ntrastive
learning to LASER3’s distillation pipeline. We are
inspired by He et al. (2020) to use a queue to store
contrastive (negative) samples. The negative sam-
ples are used to train encoders so that sentences
of different meaning have dissimilar representa-
tions. During training, we use previous batches’
target language sentences as negative samples and
when there are too many negative samples in the
queue, we remove the samples from the earliest
batches. Our LASER3-CO approach has the fol-
lowing steps:
Pre-train LASER to use as the teacher encoder
θt
on high resource languages such as English.
Randomly initialize the student encoder and
perform distillation with the teacher encoder.
After distillation, we obtain the pre-trained
student encoder θs.
Fine-tune
θs
using queue(memory)-based
1
contrastive learning. For each input source
sentence
x
and target sentence
y
, encode
them with student and teacher encoders re-
spectively and we have their representations
q=θs(x), ktarget =k+=θt(y)
(we use "+"
as an abbreviation for the positive target sen-
tence). We also encode all (N) negative sen-
tences in the queue and have their representa-
tion ki=θt(queuei), i [1, N].
Perform normalization on
q, ki, i [0, N]
and
then compute the contrastive loss using in-
foNCE (van den Oord et al.,2018)
L=log exp(q·k+)
PN
i=1 exp(q·ki)(1)
Here
τ
is the temperature parameter
2
that con-
trols the strength of the penalty.
Update student encoder
θs
with loss. Enqueue
the most recent target language (English) sen-
tences and dequeue the earliest sentences if
the queue size exceeds the limit (N).
In LASER3-CO, our training process is very simi-
lar to MOCO (He et al.,2020), though we do not
use the momentum update for the teacher. Instead,
we freeze the pre-trained teacher (LASER in our
case) that already has a high-quality representation
of English (or any other pivot language), so that
English sentences are encoded consistently during
distillation. Then we try to align the representation
of the student encoder to the teacher, only allowing
gradients to flow through the student encoder.
2.2 LASER3-CO-Filter
This architecture corresponds to Figure 2(b). The
motivation is that previous work (Robinson et al.,
1We use queue size N=4096 throughout experiments.
2
Empirically we found
τ= 0.05
(widely used in previous
literature) works well.
2020) has shown hard negatives improve represen-
tation. In our task, for each parallel sentence pair
(x, y)
, the hard negatives would be sentences
y0
that are hard to distinguish from the true target
y
(we can also find hard negatives
x0
in the source
language, but because our teacher encoder has bet-
ter representation for the target language, English,
we decide to focus on target-side hard negatives
only). Hard negatives are beneficial because they
force the model to learn more complex features to
distinguish them from the true target sentence.
To find more hard-negatives for contrastive fine-
tuning, we change LASER3-CO in two ways:
(1) disable shuffling for the data loader and (2) use
a pre-filtering mechanism to filter out bad samples
from the queue (we name this model LASER3-
CO-Filter where Filter refers to the pre-filtering
mechanism). Bad samples are hard negative sam-
ples
y0
that are too similar to
y
(e.g. "What is your
name" versus "What’s your name"). Treating these
extremely similar sentences as negative samples
would hurt the encoder’s representation and we
devise a pre-filtering method to filter them out.
Disable Shuffling
We sort sentences by length
3
and disable the shuffling of data loader so that con-
secutive batches contain sentences of similar length.
As our queue is updated by enqueuing the most
recent batch and dequeuing the earliest batch, dis-
abling shuffling makes the queue store sentences
of similar length. Because all samples in the queue
are used as negative samples to be contrasted with
the true target sentence and because they are of a
similar length, it is much more likely that we find
hard negatives (quantitative analysis provided in §5
and Figure 4).
Pre-filtering Mechanism
After disabling shuf-
fling, we show that more hard-negatives are found.
However, we also found that there are many cases
of extremely hard negatives (e.g. "What is your
name" versus "What’s your name", "Bring them
here to me" versus "Bring it here to me"). Though
hard negatives help contrastive fine-tuning to ob-
tain better sentence representation, extremely hard
negatives would confuse the model and incorrectly
update the parameters. Therefore we employ a
simple pre-filtering mechanism to prevent these ex-
tremely hard negatives from contributing to the con-
trastive loss: After we encode the target sentence
3
Our implementation is based on Fairseq (Ott et al.,2019)
which by default sorts the sentences by length
摘要:

MultilingualRepresentationDistillationwithContrastiveLearningWeitingTanKevinHeffernan~HolgerSchwenk~PhilippKoehn~JohnsHopkinsUniversity~MetaAIResearch{wtan12,phi}@jhu.edu{kevinheffernan,schwenk}@meta.comAbstractMultilingualsentencerepresentationsfromlargemodelsencodesemanticinformationfromtwoorm...

展开>> 收起<<
Multilingual Representation Distillation with Contrastive Learning Weiting TanKevin HeffernanHolger SchwenkPhilipp Koehn Johns Hopkins University.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:559.49KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注