Unsupervised Search Algorithm Configuration using Query Performance Prediction

2025-05-06 0 0 791.02KB 7 页 10玖币
侵权投诉
Unsupervised Search Algorithm Configuration using
Query Performance Prediction
Haggai Roitman*
eBay Research, Netanya, Israel
Abstract
Search engine conguration can be quite dicult for inexpert developers. Instead, an auto-conguration
approach can be used to speed up development time. Yet, such an automatic process usually requires
relevance labels to train a supervised model. In this work, we suggest a simple solution based on query
performance prediction that requires no relevance labels but only a sample of queries in a given domain.
Using two example usecases we demonstrate the merits of our solution.
Keywords
Search engine conguration, Query performance prediction, Evaluation
1. Introduction
Search engine solutions such as the (Apache) Lucene1library and Elasticsearch2have become
a ubiquitous utility for developers of various discovery and data mining applications. Yet,
conguring such solutions for a specic application in mind can be quite challenging and time-
consuming. Commonly, inexpert developers may nd it hard to determine which conguration
would best t their needs. Therefore, many such developers usually prefer to use the “out-of-
the-box” default conguration (e.g., BM25 similarity in Elasticsearch).
Nowadays, contemporary search engine solutions oer a variety of alternative congurations
that can be utilized to improve search eectiveness. As a motivating usecase (which is also
studied later on in our evaluation), the Divergence From Randomness [
1
] (DFR) similarity, one
of the many similarities implemented within the Lucence search library, has more than 100
dierent congurations. Therefore, even for a single search algorithm option, it is highly dicult
for developers to determine in-advance which DFR conguration is best for their application.
Conguring the search algorithm of a search engine so as to optimize its search eectiveness
for a given application in mind usually requires some sort of domain expertise. In addition,
choosing the “best” conguration usually requires enough relevance labels to train a supervised
auto-conguration model [2], which are not always in the expense of developers.
In this work, we propose an
unsupervised
automatic search engine search algorithm con-
guration solution based on query performance prediction [
3
] (QPP). We formally dene the
*Work was done while the author was still in IBM Research
"hroitman@ebay.com (H. Roitman)
©2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
1https://lucene.apache.org/
2https://www.elastic.co/elasticsearch/
arXiv:2210.00767v1 [cs.IR] 3 Oct 2022
search algorithm conguration task as a utility maximization task, where the goal is to maximize
the relevance of documents retrieved from the search engine for a given set of “representative
user-queries in the domain of interest. To estimate the expected relevance that may be obtained
by a given candidate (search algorithm) conguration, we propose a simple, yet highly eective,
extension to the probabilistic QPP framework [4].
Our solution only requires a sample of queries in the domain (e.g., query log) and a set
of candidate search algorithm congurations to choose from; while no other input such as
relevance labels is required. As a proof of concept, we evaluate our proposed solution on two
common search algorithm conguration usecases in Lucene, namely: Similarity model selection
and Similarity parameter auto-tuning.
2. Automatic Configuration Solution
We now describe our automatic-conguration solution. To this end, we assume a corpus of
documents
𝐶
which is searchable through a given search engine interface (e.g., Lucene library
or Elasticsearch service). To derive an “optimal” search algorithm conguration over
𝐶
, all we
require is a sample of queries
𝑄
(hereinafter termed the “query benchmark”) and a set of (search
algorithm) congurations
𝒮={𝑆1, . . . , 𝑆𝑚}
to be evaluated. Our goal is, therefore, to nd a
conguration
𝑆∈ 𝒮
that would “maximize” the search engine’s eectiveness. For each query
𝑞𝑄
, the retrieval quality is expected to be measured relatively to the relevance of documents
in
𝐶
retrieved (and ranked) according to
𝑆
. Yet, in this work, we assume that relevance labels
are unavailable apriori; instead, we aim to estimate such relevance using a QPP approach.
Our solution builds on top of the probabilistic QPP framework [
4
], introducing a simple, yet
highly eective, extension in which instead of considering a single query
𝑞𝑄
eectiveness to
be predicted, we wish to derive a prediction for the whole benchmark
𝑄
. Formally, let
𝑟
denote
the relevance event. For a given conguration
𝑆
, our goal is to estimate
3𝑝(𝑟|𝑄, 𝑆, 𝐶)
the
relevance likelihood given that we retrieve documents from 𝐶for each query in 𝑄using 𝑆.
We now derive
𝑝(𝑟|𝑄, 𝑆, 𝐶)
by rst conditioning over the various queries
𝑞
in the benchmark:
𝑝(𝑟|𝑄, 𝑆, 𝐶)𝑑𝑒𝑓
=
𝑞𝑄
𝑝(𝑟|𝑞, 𝑆, 𝐶)𝑝(𝑞|𝑄),(1)
where
𝑝(𝑟|𝑞, 𝑆, 𝐶)
denotes the relevance likelihood for a specic query
𝑞𝑄
and
𝑝(𝑞|𝑄)
denotes the likelihood of observing query 𝑞. We next estimate these two likelihood terms.
2.1. Estimating 𝑝(𝑟|𝑞, 𝑆, 𝐶)
Estimating the performance of a specic conguration
𝑆
can be done by evaluating that
conguration over
𝐶
using query
𝑞
and obtaining the response (result-list)
𝐷𝑞
𝑆𝐶
. Hence,
in this work, we simply estimate
𝑝^(𝑟|𝑞, 𝑆, 𝐶)𝑑𝑒𝑓
=𝑝(𝑟|𝑞, 𝐷𝑞
𝑆, 𝐶)
. To estimate
𝑝(𝑟|𝑞, 𝐷𝑞
𝑆, 𝐶)
, we
rst note that:
𝑝(𝑟|𝑞, 𝐷𝑞
𝑆, 𝐶)𝑝(𝑟|𝑞, 𝐷𝑞
𝑆)𝑝(𝑟|𝐷𝑞
𝑆, 𝐶)
. Here we further note that, while the
left term
𝑝(𝑟|𝑞, 𝐷𝑞
𝑆)
aims to capture the query-dependent quality eects of documents in
𝐷𝑞
𝑆
,
3We denote by 𝑝(·)a likelihood term and by 𝑝^(·)its corresponding estimate, which is not always normalized.
摘要:

UnsupervisedSearchAlgorithmConfigurationusingQueryPerformancePredictionHaggaiRoitman*eBayResearch,Netanya,IsraelAbstractSearchengineconfigurationcanbequitedifficultforinexpertdevelopers.Instead,anauto-configurationapproachcanbeusedtospeedupdevelopmenttime.Yet,suchanautomaticprocessusuallyrequiresrel...

展开>> 收起<<
Unsupervised Search Algorithm Configuration using Query Performance Prediction.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:791.02KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注