Evaluating Search System Explainability with Psychometrics and Crowdsourcing

2025-05-06 0 0 5.92MB 11 页 10玖币
侵权投诉
Evaluating Search System Explainability with Psychometrics and
Crowdsourcing
Catherine Chen
catherine_s_chen@brown.edu
Brown University
Providence, Rhode Island, USA
Carsten Eickho
carsten.eickho@uni-tuebingen.de
University of Tübingen
Tübingen, Germany
ABSTRACT
As information retrieval (IR) systems, such as search engines and
conversational agents, become ubiquitous in various domains, the
need for transparent and explainable systems grows to ensure ac-
countability, fairness, and unbiased results. Despite recent advances
in explainable AI and IR techniques, there is no consensus on the
denition of explainability. Existing approaches often treat it as
a singular notion, disregarding the multidimensional denition
postulated in the literature. In this paper, we use psychometrics
and crowdsourcing to identify human-centered factors of explain-
ability in Web search systems and introduce SSE (Search System
Explainability), an evaluation metric for explainable IR (XIR) search
systems. In a crowdsourced user study, we demonstrate SSE’s ability
to distinguish between explainable and non-explainable systems,
showing that systems with higher scores indeed indicate greater in-
terpretability. We hope that aside from these concrete contributions
to XIR, this line of work will serve as a blueprint for similar explain-
ability evaluation eorts in other domains of machine learning and
natural language processing.
CCS CONCEPTS
Information systems
Evaluation of retrieval results;
Search interfaces;Web search engines.
KEYWORDS
explainability; search; crowdsourcing; psychometrics
ACM Reference Format:
Catherine Chen and Carsten Eickho. 2024. Evaluating Search System
Explainability with Psychometrics and Crowdsourcing. In Proceedings of
the 47th International ACM SIGIR Conference on Research and Development
in Information Retrieval (SIGIR ’24), July 14–18, 2024, Washington, DC, USA.
ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3626772.3657796
1 INTRODUCTION
Explainable information retrieval (XIR) research aims to develop
methods that increase the transparency and reliability of infor-
mation retrieval systems. XIR systems are designed to provide
end-users with a deeper understanding of the rationale underlying
ranking decisions. Besides casual Web search, these systems hold
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
SIGIR ’24, July 14–18, 2024, Washington, DC, USA
©2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0431-4/24/07.
https://doi.org/10.1145/3626772.3657796
promising potential for impactful real-world information needs,
such as matching patients to clinical trials, retrieving case law for
legal research, and detecting misinformation in news and media
analysis. Despite several advancements in their development, there
is a lack of empirical, standardized techniques for evaluating the
ecacy of XIR systems.
Current approaches toward evaluating XIR systems are limited
by a lack of consensus in the broader explainable articial intelli-
gence (XAI) community on a denition of explainability. Explain-
ability has often been treated as a monolithic concept although
recent literature suggests it to be an amalgamation of several sub-
factors [
19
,
38
,
42
]. As a result, evaluation has occurred on a binary
scale, considering systems as either explainable or black-box, hin-
dering direct system comparison. Furthermore, explainability is
often declared with anecdotal evidence rather than measured quan-
titatively. To address these shortcomings, we (1) identify individual
factors of explainability and integrate them into a multidimensional
model, and (2) provide a continuous-scale evaluation metric for ex-
plainable search systems.
Inspired by previous work on multidimensional relevance mod-
eling [
64
], we leverage psychometrics [
21
] and crowdsourcing to do
so. Psychometrics is a well-established eld in psychology used to
develop measurement models for cognitive constructs that cannot
be directly measured. Our approach involves several phases. First,
we identied an exhaustive list of well-discussed explainability as-
pects in the community to quantitatively test. Next, we designed a
user study to conrm a multidimensional model, utilizing crowd-
sourcing as a data-driven and ecient means of collecting diverse
results from laypeople, consistent with the assumption of Web
search not requiring domain-specic knowledge. Finally, we used
the outcomes from our crowdsourced study to establish a metric
via structural equation modeling.
This paper empowers users to understand the search systems that
cater to their daily information needs in an environment potentially
fraught with biases and misinformation. Specically we contribute
the following:
Leverage psychometrics and crowdsourcing to test well-
discussed aspects of explainable Web search systems from
the literature
Introduce SSE
1
, a quantitative evaluation metric for measur-
ing Search System Explainability on a continuous scale
Conduct a crowdsourced user study to validate SSE, gain
practical insights into implementing human-centered evalu-
ation tools, and assess the impact such tools have on human
annotators
1Pronounced ‘es-es-e
arXiv:2210.09430v3 [cs.IR] 3 May 2024
SIGIR ’24, July 14–18, 2024, Washington, DC, USA Catherine Chen and Carsten Eickho
The remainder of this paper is structured as follows: Section 2
presents background and related work on psychometric studies,
the multidimensionality of explainability, and previous attempts
to evaluate explainable systems. Section 3 outlines steps taken to
develop our measuring instrument and crowdsourcing task setup.
Section 4 presents the results of our data collection and model cre-
ation eorts. Section 5 proposes SSE and examines its eectiveness
in evaluating search system explainability. Finally, Section 6 ana-
lyzes the dimensions of explainability users found important and
discusses implications for future XIR system design and evaluation.
Section 7 concludes with an overview of future work.
2 RELATED WORK
2.1 Psychometrics, SEM, and Crowdsourcing
Psychometrics uses Structural Equation Modeling (SEM) to con-
struct models from observed data by measuring the presence of
latent variables (factors) through observed variables (questionnaire
items) [
21
,
58
]. SEM consists of two parts: (1) Exploratory Factor
Analysis (EFA) to produce a hypothesized model structure and (2)
Conrmatory Factor Analysis (CFA) to conrm the EFA-derived
model t on a held-out dataset. EFA identies the number of latent
factors and which items load on the discovered dimensions using
a statistical technique that iteratively groups and prunes items to
reach a high-quality estimate of covariance in the observed data set.
CFA re-estimates model parameters using maximum likelihood on
a held-out set of observed data and assesses model t via statistical
signicance testing of multiple alternative models.
Since SEM requires large amounts of user response data, crowd-
sourcing is often used for data collection due to its convenience and
eciency in quickly recruiting a large number of participants. How-
ever, ensuring data quality in crowdsourcing is challenging, since
the payout may be the main motivator for workers to complete
tasks and platforms become more saturated with low-quality work-
ers. To mitigate these issues, preventative measures can be taken
by setting high worker qualications, enabling rigorous quality
control checks, and post-processing data for inattentive responses
to verify quality work [
5
,
23
,
32
,
41
]. We describe the quality control
checks we employ in our study in Section 3.
2.2 Evaluation of Explainable Systems
Explainability is still often considered to be a binary concept despite
recent literature that suggests that it may be best measured as a
combination of several factors [
19
,
38
,
42
]. Lipton
[38]
and Doshi-
Velez and Kim
[18]
suggest that explainability is (1) ill-dened with
no consensus and (2) an amalgamation of several factors rather
than a monolithic concept. Specically, both recognize the need to
ground the explainability in the context of certain desiderata, such
as trustworthiness or causality. Nauta et al
. [42]
additionally identify
12 such conceptual properties for the systematic, multidimensional
evaluation of explainability.
Due to the lack of consensus and recognition of explainability as
a multi-faceted concept, evaluation falls short in two ways. Firstly,
current methods do not consider a multidimensional denition.
Authors of frameworks such as LIME and SHAP demonstrate the
utility of their methods through user evaluations but fail to quan-
tify the degree of explainability provided by their methods [
39
,
51
].
Relying solely on declarations of explainability without measure-
ment hinders targeted system improvements and a more holistic
denition is needed for robustness. Secondly, while there are many
ML evaluation metrics for system performance comparison, there
is no standard metric for evaluating explainability. Several authors
acknowledge the importance of quantitative evaluation metrics
[
1
,
15
,
18
,
42
] and while Nguyen and Martínez
[43]
come close by
introducing a suite of metrics to quantify interpretability, they fail
to quantify interactions between facets and measure the importance
of each individual facet.
Existing approaches for explainability in IR often focus on a
singular aspect of explainability or lack evaluation of explanation
quality, relying on anecdotal evidence [
47
,
48
,
53
,
63
]. In this paper,
we propose a data-driven approach for a more ne-grained rep-
resentation of explainability and propose a user study evaluation
instrument to create a metric that models explainability as a func-
tion of several sub-factors, enabling direct comparison between
systems and targeted improvements.
3 STUDY DESIGN
3.1 Questionnaire Design
First, to compile a list of candidate aspects that may potentially
contribute to the composite notion of explainability, we conducted
a comprehensive structured literature review, and include the most
commonly discussed aspects of explainability. We included the
proceedings of ML, IR, natural language processing (NLP), and
human-computer interaction (HCI) venues (i.e. ACL, CHI, ICML,
NeurIPS, SIGIR) and noted papers for further review if titles in-
cluded the keywords interpretability,explainability, or transparency,
and cross-referenced papers using connectedpapers.com to nd sim-
ilar papers, resulting in 44 papers (37 of which were published
within the last 7 years). We then read abstracts and conclusions for
this pool to retain only those papers that examined some concrete
element or aspect of explainability/interpretability, leaving us with
14 papers covering 26 unique aspects of explainability (i.e., trust-
worthiness,uncertainty,faithfulness) (full list in Table 1). Our nal
number of candidate aspects is consistent with, and perhaps more
encompassing than, other survey papers such as Nauta et al
. [42]
,
who nd 12 explainability factors from the literature. Given the
exibility of our framework, future work could easily investigate
additional aspects from broader literature.
Next, these aspects were turned into a set of concrete questions
(referred to as “items” in psychometrics) to be included in the ques-
tionnaire. We recorded responses on a 7-point Likert scale ranging
from 1 (Strongly Disagree), via 4 (Neutral), to 7 (Strongly Agree).
Our questionnaire was created using the following guidelines [
20
]:
(1) items should use clear language and avoid complex words, (2)
items should not be leading or presumptuous, and (3) the instrument
should include both positively and negatively keyed items.
Additionally, as explainability relies on both system and expla-
nation perception, we created items taking both into account, so
our nal evaluation would reect these desiderata. To combat fa-
tigue eects, we chose to create 2 items per aspect (one positively
and one negatively worded), for a total of 52 items presented in
fully randomized order, with the expectation that the discovery of
latent factor representations during factor analysis would establish
Evaluating Search System Explainability with Psychometrics and Crowdsourcing SIGIR ’24, July 14–18, 2024, Washington, DC, USA
groupings of multiple related items. 11 doctoral and post-doctoral
researchers reviewed our questionnaire for clarity and accuracy
given aspect denitions. From this evaluation, we were able to iden-
tify and correct potential inconsistencies before our pilot study.
3.2 Task Setup
Participants performed 3 search tasks distributed across 3 topics.
To motivate and guide their search, users were asked to answer a
multiple-choice question for each topic. We provided users with a
mock search engine based on current, non-transparent, commercial
search engines that displayed a query and a list of results. We hosted
our site on Netlify and displayed the task on MTurk.
Topics and questions were selected from the TREC 2004 Robust
Track Dataset [
59
], comprising documents from the Federal Register,
Financial Times, Foreign Broadcast Information Service, and LA
Times. Multiple choice answers were created by the authors to
ensure they were not on the rst page of results and required
clicking into documents. To ensure there would be enough relevant
documents to populate the results page, we randomly sampled 9
topics that contained at least 50 relevant documents. Topics were
grouped as follows: (A) industrial espionage; income tax evasion;
in vitro fertilization, (B) radioactive waste; behavioral genetics;
drugs in the Golden Triangle, (C) law enforcement, dogs; non-US
media bias; gasoline tax in US. For each topic, we presented 100
pre-selected documents with a 50/50 random sample of relevant and
non-relevant documents in a randomized ranking order, requiring
workers to interact with the search system in order to successfully
complete the multiple-choice quiz.
Figure 1: Search interface shown to Group A (modeled on
the basis of the system presented by Ramos and Eickho
[
50
]). On the left-hand side, the stacked bar graphs depict
hypothetical scores of each keyword in the query for each
respective search result. The larger the stacked bar graph,
the more relevant that result is to the query.
Participants were randomly assigned a topic grouping. After
completing the search task, we presented each group with another
mock search interface based on existing transparent search systems
from the literature [
11
,
50
], designed to test varying degrees of
explainability, and participants were asked to complete our ques-
tionnaire for this second system. To avoid priming eects and other
potential biases, we employed a between-subjects study design.
Group Aparticipants received an interface that provided visual
explanation aids, with stacked bar graphs displayed next to each
search result that informed users how much each query term in-
uenced the corresponding document ranking (Figure 1). Group
Bparticipants received an interface that displayed relevance and
condence scores for each result, where condence was modeled
as a function of uncertainty (i.e., condence and relevance percent-
ages shown next to each result). Group Cparticipants received
the same non-transparent system they interacted with during the
screening search task. Each condition was accompanied by brief
usage instructions explaining the novel (if any) interface features.
We included multiple quality control checks to verify worker
attentiveness. Specically, we monitored site interactions (number
of clicks, documents viewed, time spent), employed a multiple-
choice quiz, and provided a unique code for the worker to submit
on MTurk for task completion verication. In addition to serving
as a form of quality control, the multiple choice quiz was used to
help guide the workers through the search task and foster a search
mindset, enhancing the accuracy of the questionnaire completion.
While users’ familiarity with topics might impact their expe-
rience during the search task, the main results of this study are
drawn from the questionnaire experience, which (1) was separate
from the search task where the topics were presented and (2) users
were asked to comment on the nature of a system, not the search
task they previously performed. The questionnaire was intended
to capture the extent of perceived system explainability.
3.3 Pilot Study
We collected 62 responses from MTurk workers over a two-month
period during our pilot study. Observations during this phase inu-
enced changes in our task design. We added pop-up conrmations
to remind workers of experiment rules due to some misreading or
skipping of instructions. Additionally, we implemented an early
exit in the workow and treated the search task and multiple choice
quiz as a prerequisite for our survey to lter out workers who did
not faithfully attempt our task. Finally, we added a uniqueness con-
straint to block workers from attempting our task multiple times.
Additionally, we received feedback from workers that our initial
time limit (45 min) felt too rushed, leading us to increase the timer
to 1 hour for our full study. However, we found that most workers
spent less than the initial time limit on our task (averaging approx-
imately 30 min). We paid workers $9.20 for the original expected
work time of 45 minutes, the equivalent of the legal minimum wage
in our location. We required that workers have more than 10,000
prior approved HITs with an approval rate greater than 98%.
4 DATA ANALYSIS
We collected a total of 540 responses from our main study (Group
A: 202, Group B: 134, Group C: 201)
2
. We ltered out 81 responses
(15%) during our preprocessing stage to account for workers who
passed our initial quality control checks during the search task but
recorded inattentive or careless responses in the subsequent survey.
Following guidelines for identifying careless responses [
5
,
23
,
32
,
41], we analyzed response patterns and self-consistency.
Concretely, we ltered out responses that had (1) abnormally
long unbroken strings (i.e., length
>
8) of identical responses (e.g., a
respondent answering a series of 18 consecutive questions with the
same Likert-scale rating), (2) high overall numbers of inconsistent
2
There is a slight imbalance despite conditions being randomly assigned, but distribu-
tions are roughly preserved across groups before and after preprocessing.
摘要:

EvaluatingSearchSystemExplainabilitywithPsychometricsandCrowdsourcingCatherineChencatherine_s_chen@brown.eduBrownUniversityProvidence,RhodeIsland,USACarstenEickhoffcarsten.eickhoff@uni-tuebingen.deUniversityofTübingenTübingen,GermanyABSTRACTAsinformationretrieval(IR)systems,suchassearchenginesandcon...

展开>> 收起<<
Evaluating Search System Explainability with Psychometrics and Crowdsourcing.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:5.92MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注