Multilingual Representation Distillation with Contrastive Learning Weiting TanKevin HeffernanHolger SchwenkPhilipp Koehn Johns Hopkins University

2025-05-02 0 0 559.49KB 14 页 10玖币

侵权投诉

Multilingual Representation Distillation with Contrastive Learning

Weiting Tan♠∗Kevin Heffernan♥Holger Schwenk♥Philipp Koehn♠♥

♠Johns Hopkins University

♥Meta AI Research

{wtan12, phi}@jhu.edu {kevinheffernan,schwenk}@meta.com

Abstract

Multilingual sentence representations from

large models encode semantic information

from two or more languages and can be

used for different cross-lingual information

retrieval and matching tasks. In this paper,

we integrate contrastive learning into multilin-

gual representation distillation and use it for

quality estimation of parallel sentences (i.e.,

ﬁnd semantically similar sentences that can be

used as translations of each other). We vali-

date our approach with multilingual similarity

search and corpus ﬁltering tasks. Experiments

across different low-resource languages show

that our method greatly outperforms previous

sentence encoders such as LASER, LASER3,

and LaBSE.

1 Introduction

With the rise of neural networks, high-dimensional

word-level sequence representations play an impor-

tant role in almost any natural language processing

task. Contextual representations from large pre-

trained language models (Vaswani et al.,2017;De-

vlin et al.,2019;Liu et al.,2019;CONNEAU and

Lample,2019) have shown advantages compared

to earlier static embeddings (Mikolov et al.,2013;

Pennington et al.,2014). However, they are not

pre-trained with sentence-level objectives and their

representations of two different sentences cannot be

easily used to indicate semantic similarity. To en-

code sentence-level information, LASER (Artetxe

and Schwenk,2019b) pools a sentence embedding

from the encoder and feeds it to the decoder. An-

other approach is to use siamese-structured models

where two identical encoders are used to repre-

sent sentences of similar meaning and are trained

to push their representations close to each other.

Sentence-Transformers (Reimers and Gurevych,

2019) are siamese-structured models that are ini-

tialized with pre-trained large models like BERT

∗

Work was done during an internship at Meta AI Re-

search.

student

encoder teacher

encoder

cosine loss

sentence embedding sentence embedding

gradient

"bring them here

to me, " he said.

target

na ka mea ia, mauria

mai ki konei ki ahau.

source

Figure 1: Student-Teacher Distillation in LASER3.

(Devlin et al.,2019) or Roberta (Liu et al.,2019).

After ﬁne-tuning, Sentence-Transformers improve

their sentence representation for cross-lingual tasks.

Besides ﬁne-tuning with identical (siamese) en-

coders, distillation can be used to retrieve better

representations. Reimers and Gurevych (2020a)

extends a monolingual sentence representation into

a multilingual representation with model distilla-

tion. Similarly, Heffernan et al. (2022) proposed

LASER3, a student-teacher architecture that distills

information from a pre-trained teacher encoder to

a student encoder in new languages. As shown in

Figure 1, the distillation process updates the stu-

dent encoder with the gradient from the cosine loss.

Note that it freezes the parameters of the teacher

encoder, which is already pre-trained on the tar-

get language (English in our case). Therefore, the

target sentence embedding is ﬁxed and the corre-

sponding source embedding is aligned with the

target embedding.

In this paper, we focus on the quality estimation

of parallel sentences in low-resource languages,

which requires models to distinguish similar and

dissimilar sentence pairs. Contrastive learning is

helpful because its objective not only aligns similar

sentences’ representations but also pushes away

representations from dissimilar sentences, which

arXiv:2210.05033v2 [cs.CL] 30 Apr 2023

student

encoder teacher

encoder

na ka mea ia, mauria

mai ki konei ki ahau. "bring them here

to me, " he said.

q ktarget k1 k2 ...

cosine similarity

contrastive loss

1. "bring it here to me," he said

2. I don't know hey? I am no expert.

3. shove their warehouses full with

specialized products.

4. hurry up! you tell me you're gonna

change, but you never do!

targetsource queue for negative samples

student

encoder teacher

encoder

qktarget k1 k2 ...

contrastive loss

1. "bring it here to me," he said

2. to better serve our customers

3. how are you planning to fix this?

4. a set of shoes free with every bike.

queue for negative samples

(a) LASER3-CO (b) LASER3-CO-Filter

0.9 0.4 0.3 0.1

target similarity

disable contribution

gradient gradient

na ka mea ia, mauria

mai ki konei ki ahau.

source

"bring them here

to me, " he said.

target

Figure 2: Visual Explanation of LASER3-CO (vanilla) and LASER3-CO-Filter (ﬁltered). Source and target come

from input bitexts dataset and queue is constructed with earlier batch’s samples. In the (b) ﬁltered version, a

pre-ﬁltering mechanism is employed to ﬁlter out extremely negative samples from the queue.

enables the model to be more conﬁdent in similar

sentences. Additionally, contrastive learning is a

form of self-supervision that ﬁts well in our low-

resource setting where only a limited amount of

clean data is available.

In practice, we integrate contrastive learning into

the distillation-based architecture from LASER3

and use our contrastive distillation method to train

encoders for low-resource languages. Inspired by

He et al. (2020) and Wu et al. (2018), we used a

queue to store negative samples as self-supervision

to train better encoders. We also employed a pre-

ﬁltering mechanism to ﬁnd hard negative samples

and showed that they beneﬁt the distillation of sen-

tence representations.

To evaluate different encoders, we rely on multi-

lingual similarity search and corpus ﬁltering tasks.

In the multilingual similarity search task, we en-

code all source and target sentences and use a

cosine-based similarity metric called margin score

(Artetxe and Schwenk,2019a) to pair source and

target sentences. In the corpus ﬁltering task, a

mined noisy parallel corpus is given and we use

the encoder with margin score to compute a simi-

larity score for each sentence pair. Then we ﬁlter

out the low-score pairs and use the remaining cor-

pus to train a neural machine translation model

whose performance is evaluated with BLEU (Pa-

pineni et al.,2002). Compared to previous works,

we observe consistent improvement when using

contrastive distillation. We also compared our ap-

proach with another simple but effective data aug-

mentation technique, back-translation, and show

that contrastive distillation achieves tied or better

performance with less data.

2 Approach

To train high-quality sentence representation, we

integrate contrastive learning into sentence repre-

sentation distillation and our contrastive distillation

method is visualized in Figure 2. The motivation

for using contrastive learning is two-fold:

•

The self-supervision from contrastive learning

helps representation learning in low-resource

settings.

•

Contrastive learning enables models to recog-

nize similar and dissimilar sentences, which

is crucial for ﬁltering out noisy sentence pairs.

2.1 LASER3-CO

This architecture corresponds to Figure 2(a). We

name it LASER3-CO since it integrates

ntrastive

learning to LASER3’s distillation pipeline. We are

inspired by He et al. (2020) to use a queue to store

contrastive (negative) samples. The negative sam-

ples are used to train encoders so that sentences

of different meaning have dissimilar representa-

tions. During training, we use previous batches’

target language sentences as negative samples and

when there are too many negative samples in the

queue, we remove the samples from the earliest

batches. Our LASER3-CO approach has the fol-

lowing steps:

•

Pre-train LASER to use as the teacher encoder

θt

on high resource languages such as English.

Randomly initialize the student encoder and

perform distillation with the teacher encoder.

After distillation, we obtain the pre-trained

student encoder θs.

•

Fine-tune

θs

using queue(memory)-based

contrastive learning. For each input source

sentence

and target sentence

, encode

them with student and teacher encoders re-

spectively and we have their representations

q=θs(x), ktarget =k+=θt(y)

(we use "+"

as an abbreviation for the positive target sen-

tence). We also encode all (N) negative sen-

tences in the queue and have their representa-

tion ki=θt(queuei), i ∈[1, N].

•

Perform normalization on

q, ki, i ∈[0, N]

and

then compute the contrastive loss using in-

foNCE (van den Oord et al.,2018)

L=−log exp(q·k+/τ)

i=1 exp(q·ki/τ)(1)

Here

is the temperature parameter

that con-

trols the strength of the penalty.

•

Update student encoder

θs

with loss. Enqueue

the most recent target language (English) sen-

tences and dequeue the earliest sentences if

the queue size exceeds the limit (N).

In LASER3-CO, our training process is very simi-

lar to MOCO (He et al.,2020), though we do not

use the momentum update for the teacher. Instead,

we freeze the pre-trained teacher (LASER in our

case) that already has a high-quality representation

of English (or any other pivot language), so that

English sentences are encoded consistently during

distillation. Then we try to align the representation

of the student encoder to the teacher, only allowing

gradients to ﬂow through the student encoder.

2.2 LASER3-CO-Filter

This architecture corresponds to Figure 2(b). The

motivation is that previous work (Robinson et al.,

1We use queue size N=4096 throughout experiments.

Empirically we found

τ= 0.05

(widely used in previous

literature) works well.

2020) has shown hard negatives improve represen-

tation. In our task, for each parallel sentence pair

(x, y)

, the hard negatives would be sentences

that are hard to distinguish from the true target

(we can also ﬁnd hard negatives

in the source

language, but because our teacher encoder has bet-

ter representation for the target language, English,

we decide to focus on target-side hard negatives

only). Hard negatives are beneﬁcial because they

force the model to learn more complex features to

distinguish them from the true target sentence.

To ﬁnd more hard-negatives for contrastive ﬁne-

tuning, we change LASER3-CO in two ways:

(1) disable shufﬂing for the data loader and (2) use

a pre-ﬁltering mechanism to ﬁlter out bad samples

from the queue (we name this model LASER3-

CO-Filter where Filter refers to the pre-ﬁltering

mechanism). Bad samples are hard negative sam-

ples

that are too similar to

(e.g. "What is your

name" versus "What’s your name"). Treating these

extremely similar sentences as negative samples

would hurt the encoder’s representation and we

devise a pre-ﬁltering method to ﬁlter them out.

Disable Shufﬂing

We sort sentences by length

and disable the shufﬂing of data loader so that con-

secutive batches contain sentences of similar length.

As our queue is updated by enqueuing the most

recent batch and dequeuing the earliest batch, dis-

abling shufﬂing makes the queue store sentences

of similar length. Because all samples in the queue

are used as negative samples to be contrasted with

the true target sentence and because they are of a

similar length, it is much more likely that we ﬁnd

hard negatives (quantitative analysis provided in §5

and Figure 4).

Pre-ﬁltering Mechanism

After disabling shuf-

ﬂing, we show that more hard-negatives are found.

However, we also found that there are many cases

of extremely hard negatives (e.g. "What is your

name" versus "What’s your name", "Bring them

here to me" versus "Bring it here to me"). Though

hard negatives help contrastive ﬁne-tuning to ob-

tain better sentence representation, extremely hard

negatives would confuse the model and incorrectly

update the parameters. Therefore we employ a

simple pre-ﬁltering mechanism to prevent these ex-

tremely hard negatives from contributing to the con-

trastive loss: After we encode the target sentence

Our implementation is based on Fairseq (Ott et al.,2019)

which by default sorts the sentences by length

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MultilingualRepresentationDistillationwithContrastiveLearningWeitingTanKevinHeffernan~HolgerSchwenk~PhilippKoehn~JohnsHopkinsUniversity~MetaAIResearch{wtan12,phi}@jhu.edu{kevinheffernan,schwenk}@meta.comAbstractMultilingualsentencerepresentationsfromlargemodelsencodesemanticinformationfromtwoorm...

展开>> 收起<<

Multilingual Representation Distillation with Contrastive Learning Weiting TanKevin HeffernanHolger SchwenkPhilipp Koehn Johns Hopkins University.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multilingual Representation Distillation with Contrastive Learning Weiting TanKevin HeffernanHolger SchwenkPhilipp Koehn Johns Hopkins University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: