Language Agnostic Multilingual Information Retrieval with Contrastive Learning Xiyang Hu1 Xinchi Chen2 Peng Qi2 Deguang Kong2 Kunlun Liu2

2025-04-24 0 0 706.18KB 12 页 10玖币
侵权投诉
Language Agnostic Multilingual Information Retrieval
with Contrastive Learning
Xiyang Hu1, Xinchi Chen2, Peng Qi2, Deguang Kong2, Kunlun Liu2,
William Yang Wang2, Zhiheng Huang2
1Carnegie Mellon University
2AWS AI Labs
xiyanghu@cmu.edu,{xcc,pengqi,kongdegu,kll}@amazon.com
wyw@amazon.com,zhiheng@amazon.com
Abstract
Multilingual information retrieval (IR) is chal-
lenging since annotated training data is costly
to obtain in many languages. We present an
effective method to train multilingual IR sys-
tems when only English IR training data and
some parallel corpora between English and
other languages are available. We leverage
parallel and non-parallel corpora to improve
the pretrained multilingual language models’
cross-lingual transfer ability. We design a se-
mantic contrastive loss to align representations
of parallel sentences that share the same seman-
tics in different languages, and a new language
contrastive loss to leverage parallel sentence
pairs to remove language-specific information
in sentence representations from non-parallel
corpora. When trained on English IR data with
these losses and evaluated zero-shot on non-
English data, our model demonstrates signifi-
cant improvement to prior work on retrieval per-
formance, while it requires much less computa-
tional effort. We also demonstrate the value of
our model for a practical setting when a paral-
lel corpus is only available for a few languages,
but a lack of parallel corpora resources persists
for many other low-resource languages. Our
model can work well even with a small number
of parallel sentences, and be used as an add-on
module to any backbones and other tasks.
1 Introduction
Information retrieval (IR) is an important natural
language processing task that helps users efficiently
gather information from a large corpus (some rep-
resentative downstream tasks include question an-
swering, summarization, search, recommendation,
etc.), but developing effective IR systems for all
languages is challenging due to the cost of, and
therefore lack of, annotated training data in many
languages. While this problem is not unique to IR
1Work done during an internship at AWS AI Labs.
* These authors contributed equally to this work.
Code https://github.com/xiyanghu/multilingualIR.
:parallel pair of
:negative sample
:any other sample
:pull closer
:push away
:distance
(a) Semantic Contrastive Loss (b) Language Contrastive Loss
Figure 1: (a) The semantic contrastive loss encourages
the embeddings of parallel pairs, i.e. sentences that have
the same semantics but from different languages, to be
close to each other, and away from the rest negative
samples — sentences with different semantics. (b) The
language contrastive loss incorporates the non-parallel
corpora in addition to the parallel ones. It encourages
the distances from a sentence representation, which can
be a sample from both the parallel corpora and the non-
parallel corpora, to the two embeddings of a paralleled
pair to be the same.
research (Joshi et al.,2020), constructing IR data is
often more costly due to the need to either translate
a large text corpus or gather relevancy annotations,
or both, which makes it difficult to generalize IR
models to lower-resource languages.
One solution to this is to leverage the pretrained
multilingual language models to encode queries
and corpora for multilingual IR tasks (Zhang et al.,
2021;Sun and Duh,2020). One series of work
on multilingual representation learning is based
on training a masked language model, some with
the next sentence prediction task, on monolingual
corpora of many languages, such as mBERT and
XLM-R (Conneau et al.,2020). They generally do
not explicitly learn the alignment across different
languages and do not perform effectively in empir-
ical IR experiments. Other works directly leverage
multilingual parallel corpora or translation pairs
to explicitly align the sentences in two languages,
such as InfoXLM (Chi et al.,2021) and LaBSE
(Feng et al.,2022).
In this work, we propose to use the semantic con-
trastive loss and the language contrastive loss to
arXiv:2210.06633v3 [cs.IR] 26 May 2023
jointly train with the information retrieval objective,
for learning cross-lingual representations that en-
courage efficient lingual transfer ability on retrieval
tasks. Our semantic contrastive loss aims to align
the embeddings of sentences that have the same
semantics. It is similar to the regular InfoNCE
(Oord et al.,2018) loss, which forces the represen-
tations of parallel sentence pairs in two languages
to be close to each other, and away from other
negative samples. Our language contrastive loss
aims to leverage the non-parallel corpora for lan-
guages without any parallel data, which are ignored
by the semantic contrastive loss. It addresses the
practical scenario wherein parallel corpora are eas-
ily accessible for a few languages, but the lack of
such resources persists for many low-resource lan-
guages. The language contrastive loss encourages
the distances from a sentence representation to the
two embeddings of a paralleled pair to be the same.
Figure 1illustrates how the two losses improve
language alignment. In experiments, we evaluate
the zero-shot cross-lingual transfer ability of our
model on monolingual information retrieval tasks
for 10 different languages. Experimental results
show that our proposed method obtains significant
gains, and it can be used as an add-on module to any
backbones. We also demonstrate that our method
is much more computationally efficient than prior
work. Our method works well with only a small
number of parallel sentence pairs and works well
on languages without any parallel corpora.
2 Background: Multilingual DPR
Dense Passage Retriever (DPR) (Karpukhin et al.,
2020) uses a dual-encoder structure to encode the
queries and passages separately for information
retrieval. To generalize to multilingual scenarios,
we replace DPR’s original BERT encoders with
a multilingual language model XLM-R (Conneau
et al.,2020) to transfer English training knowledge
to other languages.
Concretely, given a batch of
N
query-passage
pairs (
pi
,
qi
), we consider all other passages
pj, j ̸=i
in the batch irrelevant (negative) pas-
sages, and optimize the retrieval loss function as
the negative log-likelihood of the gold passage:
LIR =1
N
N
X
i=1
log exp (sim (qi,pi))
exp (sim (qi,pi)) + PN
j=1,j̸=iexp sim qi,pj
(1)
where the similarity of two vectors is defined as
sim(u,v) = uv
u∥∥v.
3
Contrastive Learning for Cross-Lingual
Generalization
The multilingual dense passage retriever only uses
English corpora for training. To improve the
model’s generalization ability to other languages,
we leverage two contrastive losses, semantic con-
trastive loss and language contrastive loss. Fig-
ure 2shows our model framework.
Specifically, the semantic contrastive loss (Chen
et al.,2020a) pushes the embedding vectors of a
pair of parallel sentences close to each other, and
at the same time away from other in-batch samples
that have different semantics. The language con-
trastive loss focuses on the scenario when there is
no parallel corpora for some languages, which en-
courages the distance from a sentence embedding
to paralleled embedding pairs to be the same.
3.1 Semantic Contrastive Loss
To learn a language-agnostic IR model, we wish
to encode the sentences with the same semantics
but from different languages to have the same em-
beddings. For each parallel corpora batch, we do
not limit our sample to just one specific language
pair. We randomly sample different language pairs
for a batch. For example, a sampled batch could
contain multiple language pairs of En-Ar, En-Ru,
En-Zh, etc. This strategy can increase the difficulty
of our contrastive learning and make the training
more stable.
Concretely, we randomly sample a mini-batch
of
2N
data points (
N
here does not have to be
the same value as the
N
in Section 2). The batch
contains
N
pairs of parallel sentences from multi-
ple different languages. Given a positive pair
zi
and
zj
, the embedding vectors of a pair of paral-
lel sentences
(i, j)
from two languages, the rest
2(N1)
samples are used as negative samples.
The semantic contrastive loss for a batch is:
LsemaCL =1
2NX
(i,j)
log exp (sim (zi,zj))
P2N
k=1,k̸=iexp (sim (zi,zk) )+
log exp (sim (zj,zi))
P2N
k=1,k̸=jexp (sim (zj,zk) )
(2)
where τis a temperature hyperparameter.
Query
Encoder
(e.g., XLM-R)
Where was Alan Turing
born?
Over time, people do what
you pay them to do.
Alan Turing was a British
computer scientist
ﺎﻣ سﺎﻧﻟا لﻌﻔﯾ ،تﻗوﻟا رﻣ ﻰﻠﻋ
ﮫﻠﻌﻔﻟ مﮭﻟ نوﻌﻓدﺗ.
Alan Turing was a British
computer scientist
Alan Turing was a British
computer scientist
Cosine
Similarity
Semantic
Contrastive
Loss
Information
Retrieval
Task
Parallel
Corpus
Task
Passage
Encoder
(e.g., XLM-R)
Query
Passages
Parallel Sentences
Query Embeddings
Passage Embeddings
Parallel Sentence
Embeddings
Language
Contrastive
Loss
Figure 2: Our model framework contains two parts: the main task (IR), and the parallel corpora task. For the main
task part, we use a dual-encoder dense passage retrieval module for information retrieval. For the parallel corpora
task part, we adopt the semantic contrastive loss to improve cross-lingual domain adaptation with parallel corpora.
We also use the language contrastive loss by leveraging parallel corpora and non-parallel corpora altogether.
3.2 Language Contrastive Loss
When training multilingual IR systems, we might
not always have parallel corpora for all languages
of interest. In a realistic scenario, we have easy
access to a few high-resource languages’ parallel
corpora, but no such availability for many low-
resource languages. We propose a language con-
trastive loss to generalize the model’s ability to the
languages which do not have any parallel corpora.
For a batch
B
consisting of both parallel corpora
P
and non-parallel corpora
Q
, we denote
zi
and
zj
as the embeddings of a pair of parallel sentences
(i, j)
from two languages. We wish the cosine simi-
larity from any other sentence embedding
zk
to the
two embeddings of a parallel pair to be the same.
Therefore, we minimize the following loss.
LlangCL =1
N(N2) X
(i,j)P
X
k(PQ)\{i,j}
log exp(sim (zi,zk))
exp(sim (zi,zk)) + exp(sim (zj,zk)) +
log exp(sim (zj,zk))
exp(sim (zi,zk)) + exp(sim (zj,zk))
(3)
The optimum can be reached when
sim (zi,zk) =
sim (zj,zk)
for all
i, j, k
. Note that the parallel
corpus involved is not the target language’s parallel
corpus. For example, in Formula 3,
i
and
j
are two
languages that are parallel with each other, and
k
is a third language (target language) that does not
have any parallel corpus with other languages.
3.3
Semantic vs Language Contrastive Losses
While both the semantic contrastive loss and lan-
guage contrastive loss can serve to align the rep-
resentations of parallel sentences and remove lan-
guage bias, they achieve this goal differently, one
via contrasting against in-batch negative samples,
the other using in-batch parallel examples to con-
strain the target language embeddings. Moreover, a
key property of the language contrastive loss is that
as long as there is some parallel corpus, we can use
this loss function to remove the language bias from
representations of sentences where no parallel data
exists, which makes it more broadly applicable.
4 Training
The two contrastive losses are applied to the pas-
sage encoder only. Experiments show that applying
them to both the passage encoder and the query en-
coder would result in unstable optimization, where
we see weird jumps in the training loss curves.
The joint loss with the information retrieval loss,
the semantic contrastive loss, and the language con-
trastive loss is
L=LIR +wsLsemaCL +wlLlangCL,(4)
where
ws
and
wl
are hyperparameters for the se-
mantic contrastive loss and the language contrastive
loss weights which need to be tuned adaptively in
different tasks.
We train our model using 8 Nvidia Tesla V100
32GB GPUs. We use a batch size of 48. We use
the AdamW optimizer with
β1= 0.9, β2= 0.999
and a learning rate of
105
. For the three losses
LIR,LsemaCL,LlangCL
, we sequentially calculate
the loss and the gradients. We use
ws= 0.01
and
wl= 0.001
. The hyperparameters are determined
through a simple grid search.
摘要:

LanguageAgnosticMultilingualInformationRetrievalwithContrastiveLearningXiyangHu1,XinchiChen2∗,PengQi2∗,DeguangKong2,KunlunLiu2,WilliamYangWang2,ZhihengHuang21CarnegieMellonUniversity2AWSAILabsxiyanghu@cmu.edu,{xcc,pengqi,kongdegu,kll}@amazon.comwyw@amazon.com,zhiheng@amazon.comAbstractMultilingualin...

展开>> 收起<<
Language Agnostic Multilingual Information Retrieval with Contrastive Learning Xiyang Hu1 Xinchi Chen2 Peng Qi2 Deguang Kong2 Kunlun Liu2.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:706.18KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注