Language Agnostic Multilingual Information Retrieval with Contrastive Learning Xiyang Hu1 Xinchi Chen2 Peng Qi2 Deguang Kong2 Kunlun Liu2

2025-04-24 0 0 706.18KB 12 页 10玖币

侵权投诉

Language Agnostic Multilingual Information Retrieval

with Contrastive Learning

Xiyang Hu1, Xinchi Chen2∗, Peng Qi2∗, Deguang Kong2, Kunlun Liu2,

William Yang Wang2, Zhiheng Huang2

1Carnegie Mellon University

2AWS AI Labs

xiyanghu@cmu.edu,{xcc,pengqi,kongdegu,kll}@amazon.com

wyw@amazon.com,zhiheng@amazon.com

Abstract

Multilingual information retrieval (IR) is chal-

lenging since annotated training data is costly

to obtain in many languages. We present an

effective method to train multilingual IR sys-

tems when only English IR training data and

some parallel corpora between English and

other languages are available. We leverage

parallel and non-parallel corpora to improve

the pretrained multilingual language models’

cross-lingual transfer ability. We design a se-

mantic contrastive loss to align representations

of parallel sentences that share the same seman-

tics in different languages, and a new language

contrastive loss to leverage parallel sentence

pairs to remove language-speciﬁc information

in sentence representations from non-parallel

corpora. When trained on English IR data with

these losses and evaluated zero-shot on non-

English data, our model demonstrates signiﬁ-

cant improvement to prior work on retrieval per-

formance, while it requires much less computa-

tional effort. We also demonstrate the value of

our model for a practical setting when a paral-

lel corpus is only available for a few languages,

but a lack of parallel corpora resources persists

for many other low-resource languages. Our

model can work well even with a small number

of parallel sentences, and be used as an add-on

module to any backbones and other tasks.

1 Introduction

Information retrieval (IR) is an important natural

language processing task that helps users efﬁciently

gather information from a large corpus (some rep-

resentative downstream tasks include question an-

swering, summarization, search, recommendation,

etc.), but developing effective IR systems for all

languages is challenging due to the cost of, and

therefore lack of, annotated training data in many

languages. While this problem is not unique to IR

1Work done during an internship at AWS AI Labs.

* These authors contributed equally to this work.

Code https://github.com/xiyanghu/multilingualIR.

:parallel pair of

:negative sample

:any other sample

:pull closer

:push away

:distance

(a) Semantic Contrastive Loss (b) Language Contrastive Loss

Figure 1: (a) The semantic contrastive loss encourages

the embeddings of parallel pairs, i.e. sentences that have

the same semantics but from different languages, to be

close to each other, and away from the rest negative

samples — sentences with different semantics. (b) The

language contrastive loss incorporates the non-parallel

corpora in addition to the parallel ones. It encourages

the distances from a sentence representation, which can

be a sample from both the parallel corpora and the non-

parallel corpora, to the two embeddings of a paralleled

pair to be the same.

research (Joshi et al.,2020), constructing IR data is

often more costly due to the need to either translate

a large text corpus or gather relevancy annotations,

or both, which makes it difﬁcult to generalize IR

models to lower-resource languages.

One solution to this is to leverage the pretrained

multilingual language models to encode queries

and corpora for multilingual IR tasks (Zhang et al.,

2021;Sun and Duh,2020). One series of work

on multilingual representation learning is based

on training a masked language model, some with

the next sentence prediction task, on monolingual

corpora of many languages, such as mBERT and

XLM-R (Conneau et al.,2020). They generally do

not explicitly learn the alignment across different

languages and do not perform effectively in empir-

ical IR experiments. Other works directly leverage

multilingual parallel corpora or translation pairs

to explicitly align the sentences in two languages,

such as InfoXLM (Chi et al.,2021) and LaBSE

(Feng et al.,2022).

In this work, we propose to use the semantic con-

trastive loss and the language contrastive loss to

arXiv:2210.06633v3 [cs.IR] 26 May 2023

jointly train with the information retrieval objective,

for learning cross-lingual representations that en-

courage efﬁcient lingual transfer ability on retrieval

tasks. Our semantic contrastive loss aims to align

the embeddings of sentences that have the same

semantics. It is similar to the regular InfoNCE

(Oord et al.,2018) loss, which forces the represen-

tations of parallel sentence pairs in two languages

to be close to each other, and away from other

negative samples. Our language contrastive loss

aims to leverage the non-parallel corpora for lan-

guages without any parallel data, which are ignored

by the semantic contrastive loss. It addresses the

practical scenario wherein parallel corpora are eas-

ily accessible for a few languages, but the lack of

such resources persists for many low-resource lan-

guages. The language contrastive loss encourages

the distances from a sentence representation to the

two embeddings of a paralleled pair to be the same.

Figure 1illustrates how the two losses improve

language alignment. In experiments, we evaluate

the zero-shot cross-lingual transfer ability of our

model on monolingual information retrieval tasks

for 10 different languages. Experimental results

show that our proposed method obtains signiﬁcant

gains, and it can be used as an add-on module to any

backbones. We also demonstrate that our method

is much more computationally efﬁcient than prior

work. Our method works well with only a small

number of parallel sentence pairs and works well

on languages without any parallel corpora.

2 Background: Multilingual DPR

Dense Passage Retriever (DPR) (Karpukhin et al.,

2020) uses a dual-encoder structure to encode the

queries and passages separately for information

retrieval. To generalize to multilingual scenarios,

we replace DPR’s original BERT encoders with

a multilingual language model XLM-R (Conneau

et al.,2020) to transfer English training knowledge

to other languages.

Concretely, given a batch of

query-passage

pairs (

), we consider all other passages

pj, j ̸=i

in the batch irrelevant (negative) pas-

sages, and optimize the retrieval loss function as

the negative log-likelihood of the gold passage:

LIR =−1

i=1

log exp (sim (qi,pi))

exp (sim (qi,pi)) + PN

j=1,j̸=iexp sim qi,pj

(1)

where the similarity of two vectors is deﬁned as

sim(u,v) = u⊤v

∥u∥∥v∥.

Contrastive Learning for Cross-Lingual

Generalization

The multilingual dense passage retriever only uses

English corpora for training. To improve the

model’s generalization ability to other languages,

we leverage two contrastive losses, semantic con-

trastive loss and language contrastive loss. Fig-

ure 2shows our model framework.

Speciﬁcally, the semantic contrastive loss (Chen

et al.,2020a) pushes the embedding vectors of a

pair of parallel sentences close to each other, and

at the same time away from other in-batch samples

that have different semantics. The language con-

trastive loss focuses on the scenario when there is

no parallel corpora for some languages, which en-

courages the distance from a sentence embedding

to paralleled embedding pairs to be the same.

3.1 Semantic Contrastive Loss

To learn a language-agnostic IR model, we wish

to encode the sentences with the same semantics

but from different languages to have the same em-

beddings. For each parallel corpora batch, we do

not limit our sample to just one speciﬁc language

pair. We randomly sample different language pairs

for a batch. For example, a sampled batch could

contain multiple language pairs of En-Ar, En-Ru,

En-Zh, etc. This strategy can increase the difﬁculty

of our contrastive learning and make the training

more stable.

Concretely, we randomly sample a mini-batch

data points (

here does not have to be

the same value as the

in Section 2). The batch

contains

pairs of parallel sentences from multi-

ple different languages. Given a positive pair

and

, the embedding vectors of a pair of paral-

lel sentences

(i, j)

from two languages, the rest

2(N−1)

samples are used as negative samples.

The semantic contrastive loss for a batch is:

LsemaCL =−1

2NX

(i,j)

log exp (sim (zi,zj)/τ)

P2N

k=1,k̸=iexp (sim (zi,zk)/τ )+

log exp (sim (zj,zi)/τ)

P2N

k=1,k̸=jexp (sim (zj,zk)/τ )

(2)

where τis a temperature hyperparameter.

Query

Encoder

(e.g., XLM-R)

Where was Alan Turing

born?

Over time, people do what

you pay them to do.

Alan Turing was a British

computer scientist…

ﺎﻣ سﺎﻧﻟا لﻌﻔﯾ ،تﻗوﻟا رﻣ ﻰﻠﻋ

ﮫﻠﻌﻔﻟ مﮭﻟ نوﻌﻓدﺗ.

Alan Turing was a British

computer scientist…

Alan Turing was a British

computer scientist…

Cosine

Similarity

Semantic

Contrastive

Loss

Information

Retrieval

Task

Parallel

Corpus

Task

Passage

Encoder

(e.g., XLM-R)

Query

Passages

Parallel Sentences

Query Embeddings

Passage Embeddings

Parallel Sentence

Embeddings

Language

Contrastive

Loss

Figure 2: Our model framework contains two parts: the main task (IR), and the parallel corpora task. For the main

task part, we use a dual-encoder dense passage retrieval module for information retrieval. For the parallel corpora

task part, we adopt the semantic contrastive loss to improve cross-lingual domain adaptation with parallel corpora.

We also use the language contrastive loss by leveraging parallel corpora and non-parallel corpora altogether.

3.2 Language Contrastive Loss

When training multilingual IR systems, we might

not always have parallel corpora for all languages

of interest. In a realistic scenario, we have easy

access to a few high-resource languages’ parallel

corpora, but no such availability for many low-

resource languages. We propose a language con-

trastive loss to generalize the model’s ability to the

languages which do not have any parallel corpora.

For a batch

consisting of both parallel corpora

and non-parallel corpora

, we denote

and

as the embeddings of a pair of parallel sentences

(i, j)

from two languages. We wish the cosine simi-

larity from any other sentence embedding

to the

two embeddings of a parallel pair to be the same.

Therefore, we minimize the following loss.

LlangCL =−1

N(N−2) X

(i,j)∈P

k∈(P∪Q)\{i,j}

log exp(sim (zi,zk))

exp(sim (zi,zk)) + exp(sim (zj,zk)) +

log exp(sim (zj,zk))

exp(sim (zi,zk)) + exp(sim (zj,zk)) 

(3)

The optimum can be reached when

sim (zi,zk) =

sim (zj,zk)

for all

i, j, k

. Note that the parallel

corpus involved is not the target language’s parallel

corpus. For example, in Formula 3,

and

are two

languages that are parallel with each other, and

is a third language (target language) that does not

have any parallel corpus with other languages.

3.3

Semantic vs Language Contrastive Losses

While both the semantic contrastive loss and lan-

guage contrastive loss can serve to align the rep-

resentations of parallel sentences and remove lan-

guage bias, they achieve this goal differently, one

via contrasting against in-batch negative samples,

the other using in-batch parallel examples to con-

strain the target language embeddings. Moreover, a

key property of the language contrastive loss is that

as long as there is some parallel corpus, we can use

this loss function to remove the language bias from

representations of sentences where no parallel data

exists, which makes it more broadly applicable.

4 Training

The two contrastive losses are applied to the pas-

sage encoder only. Experiments show that applying

them to both the passage encoder and the query en-

coder would result in unstable optimization, where

we see weird jumps in the training loss curves.

The joint loss with the information retrieval loss,

the semantic contrastive loss, and the language con-

trastive loss is

L=LIR +wsLsemaCL +wlLlangCL,(4)

where

and

are hyperparameters for the se-

mantic contrastive loss and the language contrastive

loss weights which need to be tuned adaptively in

different tasks.

We train our model using 8 Nvidia Tesla V100

32GB GPUs. We use a batch size of 48. We use

the AdamW optimizer with

β1= 0.9, β2= 0.999

and a learning rate of

10−5

. For the three losses

LIR,LsemaCL,LlangCL

, we sequentially calculate

the loss and the gradients. We use

ws= 0.01

and

wl= 0.001

. The hyperparameters are determined

through a simple grid search.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LanguageAgnosticMultilingualInformationRetrievalwithContrastiveLearningXiyangHu1,XinchiChen2∗,PengQi2∗,DeguangKong2,KunlunLiu2,WilliamYangWang2,ZhihengHuang21CarnegieMellonUniversity2AWSAILabsxiyanghu@cmu.edu,{xcc,pengqi,kongdegu,kll}@amazon.comwyw@amazon.com,zhiheng@amazon.comAbstractMultilingualin...

展开>> 收起<<

Language Agnostic Multilingual Information Retrieval with Contrastive Learning Xiyang Hu1 Xinchi Chen2 Peng Qi2 Deguang Kong2 Kunlun Liu2.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Language Agnostic Multilingual Information Retrieval with Contrastive Learning Xiyang Hu1 Xinchi Chen2 Peng Qi2 Deguang Kong2 Kunlun Liu2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: