Making a MIRACL Multilingual Information Retrieval Across a Continuum of Languages

2025-05-02 0 0 557.21KB 6 页 10玖币

侵权投诉

Making a MIRACL: Multilingual Information Retrieval

Across a Continuum of Languages

Xinyu Zhang∗1, Nandan Thakur∗1, Odunayo Ogundepo1, Ehsan Kamalloo2,3,

David Alfonso-Hermelo3, Xiaoguang Li3, Qun Liu3, Mehdi Rezagholizadeh3, Jimmy Lin1

1David R. Cheriton School of Computer Science, University of Waterloo, Canada

2Department of Computing Science, University of Alberta, Canada 3Huawei Noah’s Ark Lab

ABSTRACT

MIRACL (Multilingual Information Retrieval Across a Continuum

of Languages) is a multilingual dataset we have built for the WSDM

2023 Cup challenge that focuses on ad hoc retrieval across 18 dif-

ferent languages, which collectively encompass over three billion

native speakers around the world. These languages have diverse

typologies, originate from many dierent language families, and are

associated with varying amounts of available resources—including

what researchers typically characterize as high-resource as well

as low-resource languages. Our dataset is designed to support the

creation and evaluation of models for monolingual retrieval, where

the queries and the corpora are in the same language. In total,

we have gathered over 700k high-quality relevance judgments for

around 77k queries over Wikipedia in these 18 languages, where all

assessments have been performed by native speakers hired by our

team. Our goal is to spur research that will improve retrieval across

a continuum of languages, thus enhancing information access capa-

bilities for diverse populations around the world, particularly those

that have been traditionally underserved. This overview paper de-

scribes the dataset and baselines that we share with the community.

The MIRACL website is live at http://miracl.ai/.

1 INTRODUCTION

Information access is a fundamental human right. Specically, the

Universal Declaration of Human Rights by the United Nations ar-

ticulates that “everyone has the right to freedom of opinion and

expression”, which includes the right “to seek, receive and impart in-

formation and ideas through any media and regardless of frontiers”

(Article 19). Information access capabilities such as search, question

answering, and recommendation are important technologies for

safeguarding these ideals.

With the advent and dominance of deep learning and approaches

based on neural networks (particularly transformer-based large lan-

guage models) in natural language processing, information retrieval,

and beyond, the importance of large datasets as drivers of progress

is well understood [

]. For retrieval models in English, the MS

MARCO datasets [

] have had a transformative impact in

advancing the eld. Similarly, for question answering (including

the so-called “open-domain” retrieval-based variant), there exist

many resources in English, such as SQuAD [

], TriviaQA [

and Natural Questions [

]. We have recently witnessed eorts in

building resources for non-English languages, for example, CLIR-

Matrix [

], XTREME [

], MKQA [

], mMARCO [

], TyDi QA [

XOR-TyDi [

], and Mr. TyDi [

]. These initiatives complement

cross-lingual retrieval evaluations from TREC, CLEF, NTCIR, and

∗Equal Contribution.

3&4506).7%6)8"9:5..&;*"<)="#,1>?+%1&34-@.0<1=2

(When was the Final Fantasy game first released?)

ไฟนอลแฟนตาซี หรือรู้จักกันในนาม ไฟนอลแฟนตาซี I เป็นเกมภาษา หรือ เกมแนว RPG (Role-

playing game) ทีCสร้างขึQนโดยฮิโรโนบุ ซากากุจิ ผลิตและจัดจําหน่ายโดย สแควร์ สําหรับเล่นบนเครืCอง

เกม Nintendo Entertainment System (NES) หรือทีCรู้จักกันในนาม แฟมิคอม วางตลาดครัQง

แรกใน ญีCปุ่น เมืCอวันทีC MK ธันวาคม พ.ศ. [\]N

(Final Fantasy, also known as Final Fantasy I, is a language game or RPG

(Role-playing game) created by Hironobu Sakaguchi, produced and distributed

by Square for play on the the Nintendo Entertainment System (NES), also

known as Famicom, was first released in Japan on December 18, 1987.

th.wikipedia

นอกจากนีQ ไฟนอลแฟนตาซี ยังได้ถูกสร้างใหม่ไว้สําหรับเล่นบนเครืCองเกมอีกหลายประเภท เช่น MSX 2

WonderSwan และโทรศัพท์มือถือ หลังจากออกจําหน่ายครัQงแรกมาหลายปี

(In addition, Final Fantasy has also been recreated for play on a wide range of

games such as MSX 2 WonderSwan and mobile phones after being released

for the first time for many years)

Relevant

Passages

Irrelevant

Passages

Queries

Figure 1: Examples of annotated query–passage pairs in

Thai (th) from MIRACL. Queries are generated by native

speakers and passages are prepared from the correspond-

ing Wikipedia in the same language (e.g., th.wikipedia.org

in this example).

FIRE that date back many years, largely focused on specic lan-

guage pairs. Nevertheless, there remains a paucity of resources for

languages beyond English. Existing datasets are far from sucient

to fully develop information access capabilities for the 7000+ lan-

guages spoken on our planet [

]. Our goal is to take a small step

towards addressing these issues.

To stimulate further advances in multilingual retrieval, we have

built the MIRACL dataset, comprising human-annotated passage-

level relevance judgments on Wikipedia for 18 languages, totaling

over 700k query–passage pairs on 77k queries. These languages—

Arabic (

), Bengali (

), English (

), Spanish (

), Farsi (

Finnish (

), French (

), Hindi (

), Indonesian (

), Japanese

(

), Korean (

), Russian (

), Swahili (

), Telugu (

), Thai (

Chinese (

), and two “surprise” languages to be revealed later—are

written using 11 distinct scripts, originate from 10 dierent lan-

guage families, and collectively encompass over three billion native

speakers around the world. They include what the research com-

munity would typically characterize as high-resource languages as

well as low-resource languages.

Figure 1 shows an example of a query from MIRACL in Thai

along with a relevant and a non-relevant passage. Along with the

MIRACL dataset, our broader eorts (i.e., the “MIRACL project”)

include organizing a WSDM 2023 Cup challenge

that provides a

1https://www.wsdm-conference.org/2023/program/wsdm-cup

arXiv:2210.09984v1 [cs.IR] 18 Oct 2022

Language ISO Train Dev Test-A Test-B # Passages # Articles In

Mr. TyDi?# Q # J # Q # J # Q # J # Q # J

Arabic ar 3,495 25,382 2,896 29,197 936 9,325 1,405 14,036 2,061,414 656,982 ✓

Bengali bn 1,631 16,754 411 4,206 102 1,037 1,130 11,286 297,265 63,762 ✓

English en 2,863 29,416 799 8,350 734 5,617 1,790 18,241 32,893,221 5,758,285 ✓

Spanish es 2,162 21,531 648 6,443 – – 1,515 15,074 10,373,953 1,669,181 ×

Persian fa 2,107 21,844 632 6,571 – – 1,476 15,313 2,207,172 857,827 ×

Finnish fi 2,897 20,350 1,271 12,008 1,060 10,586 711 7,100 1,883,509 447,815 ✓

French fr 1,143 11,426 343 3,429 – – 801 8,008 14,636,953 2,325,608 ×

Hindi hi 1,169 11,668 350 3,494 – – 819 8,169 506,264 148,107 ×

Indonesian id 4,071 41,358 960 9,668 731 7,430 611 6,098 1,446,315 446,330 ✓

Japanese ja 3,477 34,387 860 8,354 650 6,922 1,141 11,410 6,953,614 1,133,444 ✓

Korean ko 868 12,767 213 3,057 263 3,855 1,417 14,161 1,486,752 437,373 ✓

Russian ru 4,683 33,921 1,252 13,100 911 8,777 718 7,174 9,543,918 1,476,045 ✓

Swahili sw 1,901 9,359 482 5,092 638 6,615 465 4,620 131,924 47,793 ✓

Telugu te 3,452 18,608 828 1,606 594 5,948 793 7,920 518,079 66,353 ✓

Thai th 2,972 21,293 733 7,573 992 10,432 650 6,493 542,166 128,179 ✓

Chinese zh 1,312 13,113 393 3,928 – – 920 9,196 4,934,368 1,246,389 ×

Total 40,203 343,177 13,071 126,076 7,611 76,544 16,362 164,299 90,416,887 16,909,473

Surprise Language 1 ? ? ? ? ? ? ? ? ? ? ×

Surprise Language 2 ? ? ? ? ? ? ? ? ? ? ×

Table 1: Descriptive statistics for each language, split combination in MIRACL: # Q denotes the number of queries; # J denotes

the number of judgments (both relevant and non-relevant). Statistics of each Wikipedia corpus are also provided: # Passages

denotes the number of passages in each language; # Articles denotes the number of Wikipedia articles from which the passages

were drawn. The nal column indicates if the language is contained in Mr. TyDi. MIRACL encompasses 18 languages in total:

16 of which are known, with 2 “surprise” languages whose identities will be revealed in the future.

common evaluation methodology, a leaderboard, and a venue for a

competition-style event with prizes. To provide starting points that

the community can rapidly build on, we also share reproducible

BM25, mDPR, and hybrid baselines as part of our Pyserini [

] and

Anserini [27] toolkits.

This paper focuses on providing a descriptive overview of the

MIRACL dataset to coincide with our initial data release. It is our

intention to periodically update this document with additional

details about the WSDM 2023 Cup challenge as well as the broader

MIRACL project over time.

2 DATASET OVERVIEW

MIRACL (Multilingual Information Retrieval Across a Continuum

of Languages) is a multilingual retrieval dataset that spans 18 dif-

ferent languages (see Table 1). More precisely, the task we model

is the standard ad hoc retrieval task as dened by the information

retrieval community, where given a corpus

, the system’s task

is to return for a given

𝑞

an ordered list of top-

𝑘

documents from

that maximizes some standard metric of quality such as nDCG.

In our formulation, a query

𝑞

is a well-formed natural language

question in some language

L𝑛

(one of 18) and the documents draw

from a snapshot of Wikipedia in the same language

C𝑛

that has

been pre-segmented into passages (and thus each passage has a

xed unique identier).

Thus, our focus is monolingual retrieval across diverse languages,

where the queries and the corpora are in the same language (e.g.,

Thai queries searching Thai documents), as opposed to cross-lingual

retrieval, where the queries and the corpora are in dierent lan-

guages (e.g., searching a Swahili corpus with Arabic queries). As a

terminological note, consistent with the parlance in information

retrieval, we use the term “document” to refer generically to the

unit of retrieval, even though in this case the “documents” are in

actuality passages from Wikipedia.

In total, we have gathered over 700k manual relevance judgments

(i.e., query–passage pairs) for around 77k queries across Wikipedia

in these 18 languages, where all assessments have been performed

by native speakers. Section 3 describes the annotation process in

more detail. For each language, these relevance judgments are di-

vided into a training set, a development set, and two test sets (that

we call test-A and test-B); more details below. The MIRACL dataset

is released under an Apache 2.0 License. To evaluate the quality of

the system output, we use standard information retrieval metrics

such as nDCG at a xed cuto and recall at a xed cuto.

The MIRACL dataset is built on the Mr. TyDi multilingual re-

trieval dataset [

], which is in turn built on the TyDi QA dataset [

Beyond the 11 languages originally covered by Mr. TyDi and TyDi

QA, we included 7 additional languages. Details below:

•Existing (Known) Languages:

Mr. TyDi and TyDi QA cover

11 languages: Arabic (

), Bengali (

), English (

), Finnish (

Indonesian (

), Japanese (

), Korean (

), Russian (

), Swahili

(

), Telugu (

), and Thai (

). We take advantage of existing

queries in these languages as a starting point and provide “denser”

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MakingaMIRACL:MultilingualInformationRetrievalAcrossaContinuumofLanguagesXinyuZhang∗1,NandanThakur∗1,OdunayoOgundepo1,EhsanKamalloo2,3,DavidAlfonso-Hermelo3,XiaoguangLi3,QunLiu3,MehdiRezagholizadeh3,JimmyLin11DavidR.CheritonSchoolofComputerScience,UniversityofWaterloo,Canada2DepartmentofComputingSci...

展开>> 收起<<

Making a MIRACL Multilingual Information Retrieval Across a Continuum of Languages.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Making a MIRACL Multilingual Information Retrieval Across a Continuum of Languages

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: