Better Than Whitespace Information Retrieval for Languages without Custom Tokenizers Odunayo Ogundepo Xinyu Zhang andJimmy Lin

2025-04-24 0 0 197.19KB 5 页 10玖币

侵权投诉

Better Than Whitespace: Information Retrieval for Languages

without Custom Tokenizers

Odunayo Ogundepo, Xinyu Zhang, and Jimmy Lin

David R. Cheriton School of Computer Science

University of Waterloo

Abstract

Tokenization is a crucial step in information

retrieval, especially for lexical matching algo-

rithms, where the quality of indexable tokens

directly impacts the effectiveness of a retrieval

system. Since different languages have unique

properties, the design of the tokenization algo-

rithm is usually language-speciﬁc and requires

at least some lingustic knowledge. However,

only a handful of the 7000+ languages on the

planet beneﬁt from specialized, custom-built

tokenization algorithms, while the other lan-

guages are stuck with a “default” whitespace

tokenizer, which cannot capture the intrica-

cies of different languages. To address this

challenge, we propose a different approach to

tokenization for lexical matching retrieval al-

gorithms (e.g., BM25): using the WordPiece

tokenizer, which can be built automatically

from unsupervised data. We test the approach

on 11 typologically diverse languages in the

Mr. TyDi collection: results show that the

mBERT tokenizer provides strong relevance

signals for retrieval “out of the box”, outper-

forming whitespace tokenization on most lan-

guages. In many cases, our approach also im-

proves retrieval effectiveness when combined

with existing custom-built tokenizers.

1 Introduction

A fundamental assumption in information retrieval

(IR) is the existence of some mechanism that con-

verts documents into sequences of tokens, typically

referred to as tokenization. These tokens comprise

the index terms that are used to compute query–

document scores when matching search queries to

relevant documents in lexical matching techniques

such as BM25 (Robertson and Zaragoza,2009).

Some of the operations involved in tokenization

for the purposes of IR include case folding, nor-

malization, stemming, lemmatization, stopwords

removal, etc. The algorithms used to perform these

operations do not generalize across languages be-

cause each language has its own unique features,

and are different from one other in terms of their

lexical, semantic, and morphological complexi-

ties. While there has been work on data-driven and

machine-learned techniques—for example, to stem-

ming (Majumder et al.,2007;Hadni et al.,2012;

Jonker et al.,2020)—for the most part researchers

and practitioners have converged on relatively sim-

ple and lightweight tokenization pipelines. For

example, in English, the Porter stemmer is widely

used, and many systems share stopwords list.

This paper tackles information retrieval in low-

resource languages that lack even the most basic

language-speciﬁc tokenizer. In this case, the “de-

fault” and usually the only option would be to

simply segment strings into tokens using white-

space. This obviously is suboptimal, as whitespace-

delimited tokens do not capture minor morpholog-

ical variations that are immaterial from the per-

spective of search; typically, stemming algorithms

would perform this normalization.

How common is this scenario? One way to char-

acterize the extent of this challenge is to count the

number of language-speciﬁc tokenizers (called “an-

alyzers”

) in the Lucene open-source search library,

which underlies search platforms such as Elastic-

search, OpenSearch, and Solr. As of Lucene 9.3.0,

the library provides 42 different language-speciﬁc

analyzers,2which cover only a tiny fraction of the

commonly cited ﬁgure of 7000+ languages that ex-

ist on this planet. It is clear that for most languages,

language-speciﬁc analyzers don’t even exist.

Subword algorithms are actively studied in the

context of pretrained language models to allevi-

ate the out-of-vocabulary issue in NLP model

training. Representatives include WordPiece (Wu

et al.,2016) and SentencePiece (Kudo and Richard-

In the remainder of this paper, we use analyzer to refer to

current human-designed tokenization approaches that are

language-speciﬁc and heuristic-based, to distinguish from the

WordPiece tokenizer discussed throughout this paper.

2https://lucene.apache.org/core/9_3_0/

analysis/common/index.html

arXiv:2210.05481v1 [cs.CL] 11 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BetterThanWhitespace:InformationRetrievalforLanguageswithoutCustomTokenizersOdunayoOgundepo,XinyuZhang,andJimmyLinDavidR.CheritonSchoolofComputerScienceUniversityofWaterlooAbstractTokenizationisacrucialstepininformationretrieval,especiallyforlexicalmatchingalgo-rithms,wherethequalityofindexabletoken...

展开>> 收起<<

Better Than Whitespace Information Retrieval for Languages without Custom Tokenizers Odunayo Ogundepo Xinyu Zhang andJimmy Lin.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Better Than Whitespace Information Retrieval for Languages without Custom Tokenizers Odunayo Ogundepo Xinyu Zhang andJimmy Lin

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: