
Language Agnostic Multilingual Information Retrieval
with Contrastive Learning
Xiyang Hu1, Xinchi Chen2∗, Peng Qi2∗, Deguang Kong2, Kunlun Liu2,
William Yang Wang2, Zhiheng Huang2
1Carnegie Mellon University
2AWS AI Labs
xiyanghu@cmu.edu,{xcc,pengqi,kongdegu,kll}@amazon.com
wyw@amazon.com,zhiheng@amazon.com
Abstract
Multilingual information retrieval (IR) is chal-
lenging since annotated training data is costly
to obtain in many languages. We present an
effective method to train multilingual IR sys-
tems when only English IR training data and
some parallel corpora between English and
other languages are available. We leverage
parallel and non-parallel corpora to improve
the pretrained multilingual language models’
cross-lingual transfer ability. We design a se-
mantic contrastive loss to align representations
of parallel sentences that share the same seman-
tics in different languages, and a new language
contrastive loss to leverage parallel sentence
pairs to remove language-specific information
in sentence representations from non-parallel
corpora. When trained on English IR data with
these losses and evaluated zero-shot on non-
English data, our model demonstrates signifi-
cant improvement to prior work on retrieval per-
formance, while it requires much less computa-
tional effort. We also demonstrate the value of
our model for a practical setting when a paral-
lel corpus is only available for a few languages,
but a lack of parallel corpora resources persists
for many other low-resource languages. Our
model can work well even with a small number
of parallel sentences, and be used as an add-on
module to any backbones and other tasks.
1 Introduction
Information retrieval (IR) is an important natural
language processing task that helps users efficiently
gather information from a large corpus (some rep-
resentative downstream tasks include question an-
swering, summarization, search, recommendation,
etc.), but developing effective IR systems for all
languages is challenging due to the cost of, and
therefore lack of, annotated training data in many
languages. While this problem is not unique to IR
1Work done during an internship at AWS AI Labs.
* These authors contributed equally to this work.
Code https://github.com/xiyanghu/multilingualIR.
:parallel pair of
:negative sample
:any other sample
:pull closer
:push away
:distance
(a) Semantic Contrastive Loss (b) Language Contrastive Loss
Figure 1: (a) The semantic contrastive loss encourages
the embeddings of parallel pairs, i.e. sentences that have
the same semantics but from different languages, to be
close to each other, and away from the rest negative
samples — sentences with different semantics. (b) The
language contrastive loss incorporates the non-parallel
corpora in addition to the parallel ones. It encourages
the distances from a sentence representation, which can
be a sample from both the parallel corpora and the non-
parallel corpora, to the two embeddings of a paralleled
pair to be the same.
research (Joshi et al.,2020), constructing IR data is
often more costly due to the need to either translate
a large text corpus or gather relevancy annotations,
or both, which makes it difficult to generalize IR
models to lower-resource languages.
One solution to this is to leverage the pretrained
multilingual language models to encode queries
and corpora for multilingual IR tasks (Zhang et al.,
2021;Sun and Duh,2020). One series of work
on multilingual representation learning is based
on training a masked language model, some with
the next sentence prediction task, on monolingual
corpora of many languages, such as mBERT and
XLM-R (Conneau et al.,2020). They generally do
not explicitly learn the alignment across different
languages and do not perform effectively in empir-
ical IR experiments. Other works directly leverage
multilingual parallel corpora or translation pairs
to explicitly align the sentences in two languages,
such as InfoXLM (Chi et al.,2021) and LaBSE
(Feng et al.,2022).
In this work, we propose to use the semantic con-
trastive loss and the language contrastive loss to
arXiv:2210.06633v3 [cs.IR] 26 May 2023