Making a MIRACL Multilingual Information Retrieval Across a Continuum of Languages

2025-05-02 0 0 557.21KB 6 页 10玖币
侵权投诉
Making a MIRACL: Multilingual Information Retrieval
Across a Continuum of Languages
Xinyu Zhang1, Nandan Thakur1, Odunayo Ogundepo1, Ehsan Kamalloo2,3,
David Alfonso-Hermelo3, Xiaoguang Li3, Qun Liu3, Mehdi Rezagholizadeh3, Jimmy Lin1
1David R. Cheriton School of Computer Science, University of Waterloo, Canada
2Department of Computing Science, University of Alberta, Canada 3Huawei Noah’s Ark Lab
ABSTRACT
MIRACL (Multilingual Information Retrieval Across a Continuum
of Languages) is a multilingual dataset we have built for the WSDM
2023 Cup challenge that focuses on ad hoc retrieval across 18 dif-
ferent languages, which collectively encompass over three billion
native speakers around the world. These languages have diverse
typologies, originate from many dierent language families, and are
associated with varying amounts of available resources—including
what researchers typically characterize as high-resource as well
as low-resource languages. Our dataset is designed to support the
creation and evaluation of models for monolingual retrieval, where
the queries and the corpora are in the same language. In total,
we have gathered over 700k high-quality relevance judgments for
around 77k queries over Wikipedia in these 18 languages, where all
assessments have been performed by native speakers hired by our
team. Our goal is to spur research that will improve retrieval across
a continuum of languages, thus enhancing information access capa-
bilities for diverse populations around the world, particularly those
that have been traditionally underserved. This overview paper de-
scribes the dataset and baselines that we share with the community.
The MIRACL website is live at http://miracl.ai/.
1 INTRODUCTION
Information access is a fundamental human right. Specically, the
Universal Declaration of Human Rights by the United Nations ar-
ticulates that “everyone has the right to freedom of opinion and
expression”, which includes the right “to seek, receive and impart in-
formation and ideas through any media and regardless of frontiers”
(Article 19). Information access capabilities such as search, question
answering, and recommendation are important technologies for
safeguarding these ideals.
With the advent and dominance of deep learning and approaches
based on neural networks (particularly transformer-based large lan-
guage models) in natural language processing, information retrieval,
and beyond, the importance of large datasets as drivers of progress
is well understood [
15
]. For retrieval models in English, the MS
MARCO datasets [
2
,
5
,
13
] have had a transformative impact in
advancing the eld. Similarly, for question answering (including
the so-called “open-domain” retrieval-based variant), there exist
many resources in English, such as SQuAD [
21
], TriviaQA [
8
],
and Natural Questions [
12
]. We have recently witnessed eorts in
building resources for non-English languages, for example, CLIR-
Matrix [
24
], XTREME [
7
], MKQA [
16
], mMARCO [
3
], TyDi QA [
4
],
XOR-TyDi [
1
], and Mr. TyDi [
28
]. These initiatives complement
cross-lingual retrieval evaluations from TREC, CLEF, NTCIR, and
Equal Contribution.
3&4506).7%6)8"9:5..&;*"<)="#,1>?+%1&34-@.0<1=2
(When was the Final Fantasy game first released?)
ไฟนอลแฟนตาซี หรือรูจักกันในนาม ไฟนอลแฟนตาซี I เป็นเกมภาษา หรือ เกมแนว RPG (Role-
playing game) ทีCสรางขึQนโดยฮิโรโนบุ ซากากุจิ ผลิตและจัดจําหน่ายโดย สแควร สําหรับเล่นบนเครืCอง
เกม Nintendo Entertainment System (NES) หรือทีCรูจักกันในนาม แฟมิคอม วางตลาดครัQ
แรกใน ญีCปุน เมืCอวันทีC MK ธันวาคม พ.ศ. [\]N
(Final Fantasy, also known as Final Fantasy I, is a language game or RPG
(Role-playing game) created by Hironobu Sakaguchi, produced and distributed
by Square for play on the the Nintendo Entertainment System (NES), also
known as Famicom, was first released in Japan on December 18, 1987.
th.wikipedia
นอกจากนีQ ไฟนอลแฟนตาซี ยังไดถูกสรางใหม่ไวสําหรับเล่นบนเครืCองเกมอีกหลายประเภท เช่น MSX 2
WonderSwan และโทรศัพท์มือถือ หลังจากออกจําหน่ายครัQงแรกมาหลายปี
(In addition, Final Fantasy has also been recreated for play on a wide range of
games such as MSX 2 WonderSwan and mobile phones after being released
for the first time for many years)
Relevant
Passages
Irrelevant
Passages
Queries
Figure 1: Examples of annotated query–passage pairs in
Thai (th) from MIRACL. Queries are generated by native
speakers and passages are prepared from the correspond-
ing Wikipedia in the same language (e.g., th.wikipedia.org
in this example).
FIRE that date back many years, largely focused on specic lan-
guage pairs. Nevertheless, there remains a paucity of resources for
languages beyond English. Existing datasets are far from sucient
to fully develop information access capabilities for the 7000+ lan-
guages spoken on our planet [
9
]. Our goal is to take a small step
towards addressing these issues.
To stimulate further advances in multilingual retrieval, we have
built the MIRACL dataset, comprising human-annotated passage-
level relevance judgments on Wikipedia for 18 languages, totaling
over 700k query–passage pairs on 77k queries. These languages—
Arabic (
ar
), Bengali (
bn
), English (
en
), Spanish (
es
), Farsi (
fa
),
Finnish (
fi
), French (
fr
), Hindi (
hi
), Indonesian (
id
), Japanese
(
ja
), Korean (
ko
), Russian (
ru
), Swahili (
sw
), Telugu (
te
), Thai (
th
),
Chinese (
zh
), and two “surprise” languages to be revealed later—are
written using 11 distinct scripts, originate from 10 dierent lan-
guage families, and collectively encompass over three billion native
speakers around the world. They include what the research com-
munity would typically characterize as high-resource languages as
well as low-resource languages.
Figure 1 shows an example of a query from MIRACL in Thai
along with a relevant and a non-relevant passage. Along with the
MIRACL dataset, our broader eorts (i.e., the “MIRACL project”)
include organizing a WSDM 2023 Cup challenge
1
that provides a
1https://www.wsdm-conference.org/2023/program/wsdm-cup
arXiv:2210.09984v1 [cs.IR] 18 Oct 2022
Language ISO Train Dev Test-A Test-B # Passages # Articles In
Mr. TyDi?# Q # J # Q # J # Q # J # Q # J
Arabic ar 3,495 25,382 2,896 29,197 936 9,325 1,405 14,036 2,061,414 656,982
Bengali bn 1,631 16,754 411 4,206 102 1,037 1,130 11,286 297,265 63,762
English en 2,863 29,416 799 8,350 734 5,617 1,790 18,241 32,893,221 5,758,285
Spanish es 2,162 21,531 648 6,443 – 1,515 15,074 10,373,953 1,669,181 ×
Persian fa 2,107 21,844 632 6,571 – 1,476 15,313 2,207,172 857,827 ×
Finnish fi 2,897 20,350 1,271 12,008 1,060 10,586 711 7,100 1,883,509 447,815
French fr 1,143 11,426 343 3,429 801 8,008 14,636,953 2,325,608 ×
Hindi hi 1,169 11,668 350 3,494 819 8,169 506,264 148,107 ×
Indonesian id 4,071 41,358 960 9,668 731 7,430 611 6,098 1,446,315 446,330
Japanese ja 3,477 34,387 860 8,354 650 6,922 1,141 11,410 6,953,614 1,133,444
Korean ko 868 12,767 213 3,057 263 3,855 1,417 14,161 1,486,752 437,373
Russian ru 4,683 33,921 1,252 13,100 911 8,777 718 7,174 9,543,918 1,476,045
Swahili sw 1,901 9,359 482 5,092 638 6,615 465 4,620 131,924 47,793
Telugu te 3,452 18,608 828 1,606 594 5,948 793 7,920 518,079 66,353
Thai th 2,972 21,293 733 7,573 992 10,432 650 6,493 542,166 128,179
Chinese zh 1,312 13,113 393 3,928 920 9,196 4,934,368 1,246,389 ×
Total 40,203 343,177 13,071 126,076 7,611 76,544 16,362 164,299 90,416,887 16,909,473
Surprise Language 1 ? ? ? ? ? ? ? ? ? ? ×
Surprise Language 2 ? ? ? ? ? ? ? ? ? ? ×
Table 1: Descriptive statistics for each language, split combination in MIRACL: # Q denotes the number of queries; # J denotes
the number of judgments (both relevant and non-relevant). Statistics of each Wikipedia corpus are also provided: # Passages
denotes the number of passages in each language; # Articles denotes the number of Wikipedia articles from which the passages
were drawn. The nal column indicates if the language is contained in Mr. TyDi. MIRACL encompasses 18 languages in total:
16 of which are known, with 2 “surprise” languages whose identities will be revealed in the future.
common evaluation methodology, a leaderboard, and a venue for a
competition-style event with prizes. To provide starting points that
the community can rapidly build on, we also share reproducible
BM25, mDPR, and hybrid baselines as part of our Pyserini [
14
] and
Anserini [27] toolkits.
This paper focuses on providing a descriptive overview of the
MIRACL dataset to coincide with our initial data release. It is our
intention to periodically update this document with additional
details about the WSDM 2023 Cup challenge as well as the broader
MIRACL project over time.
2 DATASET OVERVIEW
MIRACL (Multilingual Information Retrieval Across a Continuum
of Languages) is a multilingual retrieval dataset that spans 18 dif-
ferent languages (see Table 1). More precisely, the task we model
is the standard ad hoc retrieval task as dened by the information
retrieval community, where given a corpus
C
, the system’s task
is to return for a given
𝑞
an ordered list of top-
𝑘
documents from
C
that maximizes some standard metric of quality such as nDCG.
In our formulation, a query
𝑞
is a well-formed natural language
question in some language
L𝑛
(one of 18) and the documents draw
from a snapshot of Wikipedia in the same language
C𝑛
that has
been pre-segmented into passages (and thus each passage has a
xed unique identier).
Thus, our focus is monolingual retrieval across diverse languages,
where the queries and the corpora are in the same language (e.g.,
Thai queries searching Thai documents), as opposed to cross-lingual
retrieval, where the queries and the corpora are in dierent lan-
guages (e.g., searching a Swahili corpus with Arabic queries). As a
terminological note, consistent with the parlance in information
retrieval, we use the term “document” to refer generically to the
unit of retrieval, even though in this case the “documents” are in
actuality passages from Wikipedia.
In total, we have gathered over 700k manual relevance judgments
(i.e., query–passage pairs) for around 77k queries across Wikipedia
in these 18 languages, where all assessments have been performed
by native speakers. Section 3 describes the annotation process in
more detail. For each language, these relevance judgments are di-
vided into a training set, a development set, and two test sets (that
we call test-A and test-B); more details below. The MIRACL dataset
is released under an Apache 2.0 License. To evaluate the quality of
the system output, we use standard information retrieval metrics
such as nDCG at a xed cuto and recall at a xed cuto.
The MIRACL dataset is built on the Mr. TyDi multilingual re-
trieval dataset [
28
], which is in turn built on the TyDi QA dataset [
4
].
Beyond the 11 languages originally covered by Mr. TyDi and TyDi
QA, we included 7 additional languages. Details below:
Existing (Known) Languages:
Mr. TyDi and TyDi QA cover
11 languages: Arabic (
ar
), Bengali (
bn
), English (
en
), Finnish (
fi
),
Indonesian (
id
), Japanese (
ja
), Korean (
ko
), Russian (
ru
), Swahili
(
sw
), Telugu (
te
), and Thai (
th
). We take advantage of existing
queries in these languages as a starting point and provide “denser”
摘要:

MakingaMIRACL:MultilingualInformationRetrievalAcrossaContinuumofLanguagesXinyuZhang∗1,NandanThakur∗1,OdunayoOgundepo1,EhsanKamalloo2,3,DavidAlfonso-Hermelo3,XiaoguangLi3,QunLiu3,MehdiRezagholizadeh3,JimmyLin11DavidR.CheritonSchoolofComputerScience,UniversityofWaterloo,Canada2DepartmentofComputingSci...

展开>> 收起<<
Making a MIRACL Multilingual Information Retrieval Across a Continuum of Languages.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:6 页 大小:557.21KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注