
Better Than Whitespace: Information Retrieval for Languages
without Custom Tokenizers
Odunayo Ogundepo, Xinyu Zhang, and Jimmy Lin
David R. Cheriton School of Computer Science
University of Waterloo
Abstract
Tokenization is a crucial step in information
retrieval, especially for lexical matching algo-
rithms, where the quality of indexable tokens
directly impacts the effectiveness of a retrieval
system. Since different languages have unique
properties, the design of the tokenization algo-
rithm is usually language-specific and requires
at least some lingustic knowledge. However,
only a handful of the 7000+ languages on the
planet benefit from specialized, custom-built
tokenization algorithms, while the other lan-
guages are stuck with a “default” whitespace
tokenizer, which cannot capture the intrica-
cies of different languages. To address this
challenge, we propose a different approach to
tokenization for lexical matching retrieval al-
gorithms (e.g., BM25): using the WordPiece
tokenizer, which can be built automatically
from unsupervised data. We test the approach
on 11 typologically diverse languages in the
Mr. TyDi collection: results show that the
mBERT tokenizer provides strong relevance
signals for retrieval “out of the box”, outper-
forming whitespace tokenization on most lan-
guages. In many cases, our approach also im-
proves retrieval effectiveness when combined
with existing custom-built tokenizers.
1 Introduction
A fundamental assumption in information retrieval
(IR) is the existence of some mechanism that con-
verts documents into sequences of tokens, typically
referred to as tokenization. These tokens comprise
the index terms that are used to compute query–
document scores when matching search queries to
relevant documents in lexical matching techniques
such as BM25 (Robertson and Zaragoza,2009).
Some of the operations involved in tokenization
for the purposes of IR include case folding, nor-
malization, stemming, lemmatization, stopwords
removal, etc. The algorithms used to perform these
operations do not generalize across languages be-
cause each language has its own unique features,
and are different from one other in terms of their
lexical, semantic, and morphological complexi-
ties. While there has been work on data-driven and
machine-learned techniques—for example, to stem-
ming (Majumder et al.,2007;Hadni et al.,2012;
Jonker et al.,2020)—for the most part researchers
and practitioners have converged on relatively sim-
ple and lightweight tokenization pipelines. For
example, in English, the Porter stemmer is widely
used, and many systems share stopwords list.
This paper tackles information retrieval in low-
resource languages that lack even the most basic
language-specific tokenizer. In this case, the “de-
fault” and usually the only option would be to
simply segment strings into tokens using white-
space. This obviously is suboptimal, as whitespace-
delimited tokens do not capture minor morpholog-
ical variations that are immaterial from the per-
spective of search; typically, stemming algorithms
would perform this normalization.
How common is this scenario? One way to char-
acterize the extent of this challenge is to count the
number of language-specific tokenizers (called “an-
alyzers”
1
) in the Lucene open-source search library,
which underlies search platforms such as Elastic-
search, OpenSearch, and Solr. As of Lucene 9.3.0,
the library provides 42 different language-specific
analyzers,2which cover only a tiny fraction of the
commonly cited figure of 7000+ languages that ex-
ist on this planet. It is clear that for most languages,
language-specific analyzers don’t even exist.
Subword algorithms are actively studied in the
context of pretrained language models to allevi-
ate the out-of-vocabulary issue in NLP model
training. Representatives include WordPiece (Wu
et al.,2016) and SentencePiece (Kudo and Richard-
1
In the remainder of this paper, we use analyzer to refer to
current human-designed tokenization approaches that are
language-specific and heuristic-based, to distinguish from the
WordPiece tokenizer discussed throughout this paper.
2https://lucene.apache.org/core/9_3_0/
analysis/common/index.html
arXiv:2210.05481v1 [cs.CL] 11 Oct 2022