Better Than Whitespace Information Retrieval for Languages without Custom Tokenizers Odunayo Ogundepo Xinyu Zhang andJimmy Lin

2025-04-24 0 0 197.19KB 5 页 10玖币
侵权投诉
Better Than Whitespace: Information Retrieval for Languages
without Custom Tokenizers
Odunayo Ogundepo, Xinyu Zhang, and Jimmy Lin
David R. Cheriton School of Computer Science
University of Waterloo
Abstract
Tokenization is a crucial step in information
retrieval, especially for lexical matching algo-
rithms, where the quality of indexable tokens
directly impacts the effectiveness of a retrieval
system. Since different languages have unique
properties, the design of the tokenization algo-
rithm is usually language-specific and requires
at least some lingustic knowledge. However,
only a handful of the 7000+ languages on the
planet benefit from specialized, custom-built
tokenization algorithms, while the other lan-
guages are stuck with a “default” whitespace
tokenizer, which cannot capture the intrica-
cies of different languages. To address this
challenge, we propose a different approach to
tokenization for lexical matching retrieval al-
gorithms (e.g., BM25): using the WordPiece
tokenizer, which can be built automatically
from unsupervised data. We test the approach
on 11 typologically diverse languages in the
Mr. TyDi collection: results show that the
mBERT tokenizer provides strong relevance
signals for retrieval “out of the box”, outper-
forming whitespace tokenization on most lan-
guages. In many cases, our approach also im-
proves retrieval effectiveness when combined
with existing custom-built tokenizers.
1 Introduction
A fundamental assumption in information retrieval
(IR) is the existence of some mechanism that con-
verts documents into sequences of tokens, typically
referred to as tokenization. These tokens comprise
the index terms that are used to compute query–
document scores when matching search queries to
relevant documents in lexical matching techniques
such as BM25 (Robertson and Zaragoza,2009).
Some of the operations involved in tokenization
for the purposes of IR include case folding, nor-
malization, stemming, lemmatization, stopwords
removal, etc. The algorithms used to perform these
operations do not generalize across languages be-
cause each language has its own unique features,
and are different from one other in terms of their
lexical, semantic, and morphological complexi-
ties. While there has been work on data-driven and
machine-learned techniques—for example, to stem-
ming (Majumder et al.,2007;Hadni et al.,2012;
Jonker et al.,2020)—for the most part researchers
and practitioners have converged on relatively sim-
ple and lightweight tokenization pipelines. For
example, in English, the Porter stemmer is widely
used, and many systems share stopwords list.
This paper tackles information retrieval in low-
resource languages that lack even the most basic
language-specific tokenizer. In this case, the “de-
fault” and usually the only option would be to
simply segment strings into tokens using white-
space. This obviously is suboptimal, as whitespace-
delimited tokens do not capture minor morpholog-
ical variations that are immaterial from the per-
spective of search; typically, stemming algorithms
would perform this normalization.
How common is this scenario? One way to char-
acterize the extent of this challenge is to count the
number of language-specific tokenizers (called “an-
alyzers”
1
) in the Lucene open-source search library,
which underlies search platforms such as Elastic-
search, OpenSearch, and Solr. As of Lucene 9.3.0,
the library provides 42 different language-specific
analyzers,2which cover only a tiny fraction of the
commonly cited figure of 7000+ languages that ex-
ist on this planet. It is clear that for most languages,
language-specific analyzers don’t even exist.
Subword algorithms are actively studied in the
context of pretrained language models to allevi-
ate the out-of-vocabulary issue in NLP model
training. Representatives include WordPiece (Wu
et al.,2016) and SentencePiece (Kudo and Richard-
1
In the remainder of this paper, we use analyzer to refer to
current human-designed tokenization approaches that are
language-specific and heuristic-based, to distinguish from the
WordPiece tokenizer discussed throughout this paper.
2https://lucene.apache.org/core/9_3_0/
analysis/common/index.html
arXiv:2210.05481v1 [cs.CL] 11 Oct 2022
摘要:

BetterThanWhitespace:InformationRetrievalforLanguageswithoutCustomTokenizersOdunayoOgundepo,XinyuZhang,andJimmyLinDavidR.CheritonSchoolofComputerScienceUniversityofWaterlooAbstractTokenizationisacrucialstepininformationretrieval,especiallyforlexicalmatchingalgo-rithms,wherethequalityofindexabletoken...

展开>> 收起<<
Better Than Whitespace Information Retrieval for Languages without Custom Tokenizers Odunayo Ogundepo Xinyu Zhang andJimmy Lin.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:197.19KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注