
Language ISO Train Dev Test-A Test-B # Passages # Articles In
Mr. TyDi?# Q # J # Q # J # Q # J # Q # J
Arabic ar 3,495 25,382 2,896 29,197 936 9,325 1,405 14,036 2,061,414 656,982 ✓
Bengali bn 1,631 16,754 411 4,206 102 1,037 1,130 11,286 297,265 63,762 ✓
English en 2,863 29,416 799 8,350 734 5,617 1,790 18,241 32,893,221 5,758,285 ✓
Spanish es 2,162 21,531 648 6,443 – – 1,515 15,074 10,373,953 1,669,181 ×
Persian fa 2,107 21,844 632 6,571 – – 1,476 15,313 2,207,172 857,827 ×
Finnish fi 2,897 20,350 1,271 12,008 1,060 10,586 711 7,100 1,883,509 447,815 ✓
French fr 1,143 11,426 343 3,429 – – 801 8,008 14,636,953 2,325,608 ×
Hindi hi 1,169 11,668 350 3,494 – – 819 8,169 506,264 148,107 ×
Indonesian id 4,071 41,358 960 9,668 731 7,430 611 6,098 1,446,315 446,330 ✓
Japanese ja 3,477 34,387 860 8,354 650 6,922 1,141 11,410 6,953,614 1,133,444 ✓
Korean ko 868 12,767 213 3,057 263 3,855 1,417 14,161 1,486,752 437,373 ✓
Russian ru 4,683 33,921 1,252 13,100 911 8,777 718 7,174 9,543,918 1,476,045 ✓
Swahili sw 1,901 9,359 482 5,092 638 6,615 465 4,620 131,924 47,793 ✓
Telugu te 3,452 18,608 828 1,606 594 5,948 793 7,920 518,079 66,353 ✓
Thai th 2,972 21,293 733 7,573 992 10,432 650 6,493 542,166 128,179 ✓
Chinese zh 1,312 13,113 393 3,928 – – 920 9,196 4,934,368 1,246,389 ×
Total 40,203 343,177 13,071 126,076 7,611 76,544 16,362 164,299 90,416,887 16,909,473
Surprise Language 1 ? ? ? ? ? ? ? ? ? ? ×
Surprise Language 2 ? ? ? ? ? ? ? ? ? ? ×
Table 1: Descriptive statistics for each language, split combination in MIRACL: # Q denotes the number of queries; # J denotes
the number of judgments (both relevant and non-relevant). Statistics of each Wikipedia corpus are also provided: # Passages
denotes the number of passages in each language; # Articles denotes the number of Wikipedia articles from which the passages
were drawn. The nal column indicates if the language is contained in Mr. TyDi. MIRACL encompasses 18 languages in total:
16 of which are known, with 2 “surprise” languages whose identities will be revealed in the future.
common evaluation methodology, a leaderboard, and a venue for a
competition-style event with prizes. To provide starting points that
the community can rapidly build on, we also share reproducible
BM25, mDPR, and hybrid baselines as part of our Pyserini [
14
] and
Anserini [27] toolkits.
This paper focuses on providing a descriptive overview of the
MIRACL dataset to coincide with our initial data release. It is our
intention to periodically update this document with additional
details about the WSDM 2023 Cup challenge as well as the broader
MIRACL project over time.
2 DATASET OVERVIEW
MIRACL (Multilingual Information Retrieval Across a Continuum
of Languages) is a multilingual retrieval dataset that spans 18 dif-
ferent languages (see Table 1). More precisely, the task we model
is the standard ad hoc retrieval task as dened by the information
retrieval community, where given a corpus
C
, the system’s task
is to return for a given
𝑞
an ordered list of top-
𝑘
documents from
C
that maximizes some standard metric of quality such as nDCG.
In our formulation, a query
𝑞
is a well-formed natural language
question in some language
L𝑛
(one of 18) and the documents draw
from a snapshot of Wikipedia in the same language
C𝑛
that has
been pre-segmented into passages (and thus each passage has a
xed unique identier).
Thus, our focus is monolingual retrieval across diverse languages,
where the queries and the corpora are in the same language (e.g.,
Thai queries searching Thai documents), as opposed to cross-lingual
retrieval, where the queries and the corpora are in dierent lan-
guages (e.g., searching a Swahili corpus with Arabic queries). As a
terminological note, consistent with the parlance in information
retrieval, we use the term “document” to refer generically to the
unit of retrieval, even though in this case the “documents” are in
actuality passages from Wikipedia.
In total, we have gathered over 700k manual relevance judgments
(i.e., query–passage pairs) for around 77k queries across Wikipedia
in these 18 languages, where all assessments have been performed
by native speakers. Section 3 describes the annotation process in
more detail. For each language, these relevance judgments are di-
vided into a training set, a development set, and two test sets (that
we call test-A and test-B); more details below. The MIRACL dataset
is released under an Apache 2.0 License. To evaluate the quality of
the system output, we use standard information retrieval metrics
such as nDCG at a xed cuto and recall at a xed cuto.
The MIRACL dataset is built on the Mr. TyDi multilingual re-
trieval dataset [
28
], which is in turn built on the TyDi QA dataset [
4
].
Beyond the 11 languages originally covered by Mr. TyDi and TyDi
QA, we included 7 additional languages. Details below:
•Existing (Known) Languages:
Mr. TyDi and TyDi QA cover
11 languages: Arabic (
ar
), Bengali (
bn
), English (
en
), Finnish (
fi
),
Indonesian (
id
), Japanese (
ja
), Korean (
ko
), Russian (
ru
), Swahili
(
sw
), Telugu (
te
), and Thai (
th
). We take advantage of existing
queries in these languages as a starting point and provide “denser”