Evaluating Search System Explainability with Psychometrics and Crowdsourcing

2025-05-06 0 0 5.92MB 11 页 10玖币

侵权投诉

Evaluating Search System Explainability with Psychometrics and

Crowdsourcing

Catherine Chen

catherine_s_chen@brown.edu

Brown University

Providence, Rhode Island, USA

Carsten Eickho

carsten.eickho@uni-tuebingen.de

University of Tübingen

Tübingen, Germany

ABSTRACT

As information retrieval (IR) systems, such as search engines and

conversational agents, become ubiquitous in various domains, the

need for transparent and explainable systems grows to ensure ac-

countability, fairness, and unbiased results. Despite recent advances

in explainable AI and IR techniques, there is no consensus on the

denition of explainability. Existing approaches often treat it as

a singular notion, disregarding the multidimensional denition

postulated in the literature. In this paper, we use psychometrics

and crowdsourcing to identify human-centered factors of explain-

ability in Web search systems and introduce SSE (Search System

Explainability), an evaluation metric for explainable IR (XIR) search

systems. In a crowdsourced user study, we demonstrate SSE’s ability

to distinguish between explainable and non-explainable systems,

showing that systems with higher scores indeed indicate greater in-

terpretability. We hope that aside from these concrete contributions

to XIR, this line of work will serve as a blueprint for similar explain-

ability evaluation eorts in other domains of machine learning and

natural language processing.

CCS CONCEPTS

•Information systems

→

Evaluation of retrieval results;

Search interfaces;Web search engines.

KEYWORDS

explainability; search; crowdsourcing; psychometrics

ACM Reference Format:

Catherine Chen and Carsten Eickho. 2024. Evaluating Search System

Explainability with Psychometrics and Crowdsourcing. In Proceedings of

the 47th International ACM SIGIR Conference on Research and Development

in Information Retrieval (SIGIR ’24), July 14–18, 2024, Washington, DC, USA.

ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3626772.3657796

1 INTRODUCTION

Explainable information retrieval (XIR) research aims to develop

methods that increase the transparency and reliability of infor-

mation retrieval systems. XIR systems are designed to provide

end-users with a deeper understanding of the rationale underlying

ranking decisions. Besides casual Web search, these systems hold

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

SIGIR ’24, July 14–18, 2024, Washington, DC, USA

ACM ISBN 979-8-4007-0431-4/24/07.

https://doi.org/10.1145/3626772.3657796

promising potential for impactful real-world information needs,

such as matching patients to clinical trials, retrieving case law for

legal research, and detecting misinformation in news and media

analysis. Despite several advancements in their development, there

is a lack of empirical, standardized techniques for evaluating the

ecacy of XIR systems.

Current approaches toward evaluating XIR systems are limited

by a lack of consensus in the broader explainable articial intelli-

gence (XAI) community on a denition of explainability. Explain-

ability has often been treated as a monolithic concept although

recent literature suggests it to be an amalgamation of several sub-

factors [

]. As a result, evaluation has occurred on a binary

scale, considering systems as either explainable or black-box, hin-

dering direct system comparison. Furthermore, explainability is

often declared with anecdotal evidence rather than measured quan-

titatively. To address these shortcomings, we (1) identify individual

factors of explainability and integrate them into a multidimensional

model, and (2) provide a continuous-scale evaluation metric for ex-

plainable search systems.

Inspired by previous work on multidimensional relevance mod-

eling [

], we leverage psychometrics [

] and crowdsourcing to do

so. Psychometrics is a well-established eld in psychology used to

develop measurement models for cognitive constructs that cannot

be directly measured. Our approach involves several phases. First,

we identied an exhaustive list of well-discussed explainability as-

pects in the community to quantitatively test. Next, we designed a

user study to conrm a multidimensional model, utilizing crowd-

sourcing as a data-driven and ecient means of collecting diverse

results from laypeople, consistent with the assumption of Web

search not requiring domain-specic knowledge. Finally, we used

the outcomes from our crowdsourced study to establish a metric

via structural equation modeling.

This paper empowers users to understand the search systems that

cater to their daily information needs in an environment potentially

fraught with biases and misinformation. Specically we contribute

the following:

•

Leverage psychometrics and crowdsourcing to test well-

discussed aspects of explainable Web search systems from

the literature

•

Introduce SSE

, a quantitative evaluation metric for measur-

ing Search System Explainability on a continuous scale

•

Conduct a crowdsourced user study to validate SSE, gain

practical insights into implementing human-centered evalu-

ation tools, and assess the impact such tools have on human

annotators

1Pronounced ‘es-es-e’

arXiv:2210.09430v3 [cs.IR] 3 May 2024

SIGIR ’24, July 14–18, 2024, Washington, DC, USA Catherine Chen and Carsten Eickho

The remainder of this paper is structured as follows: Section 2

presents background and related work on psychometric studies,

the multidimensionality of explainability, and previous attempts

to evaluate explainable systems. Section 3 outlines steps taken to

develop our measuring instrument and crowdsourcing task setup.

Section 4 presents the results of our data collection and model cre-

ation eorts. Section 5 proposes SSE and examines its eectiveness

in evaluating search system explainability. Finally, Section 6 ana-

lyzes the dimensions of explainability users found important and

discusses implications for future XIR system design and evaluation.

Section 7 concludes with an overview of future work.

2 RELATED WORK

2.1 Psychometrics, SEM, and Crowdsourcing

Psychometrics uses Structural Equation Modeling (SEM) to con-

struct models from observed data by measuring the presence of

latent variables (factors) through observed variables (questionnaire

items) [

]. SEM consists of two parts: (1) Exploratory Factor

Analysis (EFA) to produce a hypothesized model structure and (2)

Conrmatory Factor Analysis (CFA) to conrm the EFA-derived

model t on a held-out dataset. EFA identies the number of latent

factors and which items load on the discovered dimensions using

a statistical technique that iteratively groups and prunes items to

reach a high-quality estimate of covariance in the observed data set.

CFA re-estimates model parameters using maximum likelihood on

a held-out set of observed data and assesses model t via statistical

signicance testing of multiple alternative models.

Since SEM requires large amounts of user response data, crowd-

sourcing is often used for data collection due to its convenience and

eciency in quickly recruiting a large number of participants. How-

ever, ensuring data quality in crowdsourcing is challenging, since

the payout may be the main motivator for workers to complete

tasks and platforms become more saturated with low-quality work-

ers. To mitigate these issues, preventative measures can be taken

by setting high worker qualications, enabling rigorous quality

control checks, and post-processing data for inattentive responses

to verify quality work [

]. We describe the quality control

checks we employ in our study in Section 3.

2.2 Evaluation of Explainable Systems

Explainability is still often considered to be a binary concept despite

recent literature that suggests that it may be best measured as a

combination of several factors [

]. Lipton

[38]

and Doshi-

Velez and Kim

[18]

suggest that explainability is (1) ill-dened with

no consensus and (2) an amalgamation of several factors rather

than a monolithic concept. Specically, both recognize the need to

ground the explainability in the context of certain desiderata, such

as trustworthiness or causality. Nauta et al

. [42]

additionally identify

12 such conceptual properties for the systematic, multidimensional

evaluation of explainability.

Due to the lack of consensus and recognition of explainability as

a multi-faceted concept, evaluation falls short in two ways. Firstly,

current methods do not consider a multidimensional denition.

Authors of frameworks such as LIME and SHAP demonstrate the

utility of their methods through user evaluations but fail to quan-

tify the degree of explainability provided by their methods [

Relying solely on declarations of explainability without measure-

ment hinders targeted system improvements and a more holistic

denition is needed for robustness. Secondly, while there are many

ML evaluation metrics for system performance comparison, there

is no standard metric for evaluating explainability. Several authors

acknowledge the importance of quantitative evaluation metrics

[

] and while Nguyen and Martínez

[43]

come close by

introducing a suite of metrics to quantify interpretability, they fail

to quantify interactions between facets and measure the importance

of each individual facet.

Existing approaches for explainability in IR often focus on a

singular aspect of explainability or lack evaluation of explanation

quality, relying on anecdotal evidence [

]. In this paper,

we propose a data-driven approach for a more ne-grained rep-

resentation of explainability and propose a user study evaluation

instrument to create a metric that models explainability as a func-

tion of several sub-factors, enabling direct comparison between

systems and targeted improvements.

3 STUDY DESIGN

3.1 Questionnaire Design

First, to compile a list of candidate aspects that may potentially

contribute to the composite notion of explainability, we conducted

a comprehensive structured literature review, and include the most

commonly discussed aspects of explainability. We included the

proceedings of ML, IR, natural language processing (NLP), and

human-computer interaction (HCI) venues (i.e. ACL, CHI, ICML,

NeurIPS, SIGIR) and noted papers for further review if titles in-

cluded the keywords interpretability,explainability, or transparency,

and cross-referenced papers using connectedpapers.com to nd sim-

ilar papers, resulting in 44 papers (37 of which were published

within the last 7 years). We then read abstracts and conclusions for

this pool to retain only those papers that examined some concrete

element or aspect of explainability/interpretability, leaving us with

14 papers covering 26 unique aspects of explainability (i.e., trust-

worthiness,uncertainty,faithfulness) (full list in Table 1). Our nal

number of candidate aspects is consistent with, and perhaps more

encompassing than, other survey papers such as Nauta et al

. [42]

who nd 12 explainability factors from the literature. Given the

exibility of our framework, future work could easily investigate

additional aspects from broader literature.

Next, these aspects were turned into a set of concrete questions

(referred to as “items” in psychometrics) to be included in the ques-

tionnaire. We recorded responses on a 7-point Likert scale ranging

from 1 (Strongly Disagree), via 4 (Neutral), to 7 (Strongly Agree).

Our questionnaire was created using the following guidelines [

(1) items should use clear language and avoid complex words, (2)

items should not be leading or presumptuous, and (3) the instrument

should include both positively and negatively keyed items.

Additionally, as explainability relies on both system and expla-

nation perception, we created items taking both into account, so

our nal evaluation would reect these desiderata. To combat fa-

tigue eects, we chose to create 2 items per aspect (one positively

and one negatively worded), for a total of 52 items presented in

fully randomized order, with the expectation that the discovery of

latent factor representations during factor analysis would establish

Evaluating Search System Explainability with Psychometrics and Crowdsourcing SIGIR ’24, July 14–18, 2024, Washington, DC, USA

groupings of multiple related items. 11 doctoral and post-doctoral

researchers reviewed our questionnaire for clarity and accuracy

given aspect denitions. From this evaluation, we were able to iden-

tify and correct potential inconsistencies before our pilot study.

3.2 Task Setup

Participants performed 3 search tasks distributed across 3 topics.

To motivate and guide their search, users were asked to answer a

multiple-choice question for each topic. We provided users with a

mock search engine based on current, non-transparent, commercial

search engines that displayed a query and a list of results. We hosted

our site on Netlify and displayed the task on MTurk.

Topics and questions were selected from the TREC 2004 Robust

Track Dataset [

], comprising documents from the Federal Register,

Financial Times, Foreign Broadcast Information Service, and LA

Times. Multiple choice answers were created by the authors to

ensure they were not on the rst page of results and required

clicking into documents. To ensure there would be enough relevant

documents to populate the results page, we randomly sampled 9

topics that contained at least 50 relevant documents. Topics were

grouped as follows: (A) industrial espionage; income tax evasion;

in vitro fertilization, (B) radioactive waste; behavioral genetics;

drugs in the Golden Triangle, (C) law enforcement, dogs; non-US

media bias; gasoline tax in US. For each topic, we presented 100

pre-selected documents with a 50/50 random sample of relevant and

non-relevant documents in a randomized ranking order, requiring

workers to interact with the search system in order to successfully

complete the multiple-choice quiz.

Figure 1: Search interface shown to Group A (modeled on

the basis of the system presented by Ramos and Eickho

[

]). On the left-hand side, the stacked bar graphs depict

hypothetical scores of each keyword in the query for each

respective search result. The larger the stacked bar graph,

the more relevant that result is to the query.

Participants were randomly assigned a topic grouping. After

completing the search task, we presented each group with another

mock search interface based on existing transparent search systems

from the literature [

], designed to test varying degrees of

explainability, and participants were asked to complete our ques-

tionnaire for this second system. To avoid priming eects and other

potential biases, we employed a between-subjects study design.

Group Aparticipants received an interface that provided visual

explanation aids, with stacked bar graphs displayed next to each

search result that informed users how much each query term in-

uenced the corresponding document ranking (Figure 1). Group

Bparticipants received an interface that displayed relevance and

condence scores for each result, where condence was modeled

as a function of uncertainty (i.e., condence and relevance percent-

ages shown next to each result). Group Cparticipants received

the same non-transparent system they interacted with during the

screening search task. Each condition was accompanied by brief

usage instructions explaining the novel (if any) interface features.

We included multiple quality control checks to verify worker

attentiveness. Specically, we monitored site interactions (number

of clicks, documents viewed, time spent), employed a multiple-

choice quiz, and provided a unique code for the worker to submit

on MTurk for task completion verication. In addition to serving

as a form of quality control, the multiple choice quiz was used to

help guide the workers through the search task and foster a search

mindset, enhancing the accuracy of the questionnaire completion.

While users’ familiarity with topics might impact their expe-

rience during the search task, the main results of this study are

drawn from the questionnaire experience, which (1) was separate

from the search task where the topics were presented and (2) users

were asked to comment on the nature of a system, not the search

task they previously performed. The questionnaire was intended

to capture the extent of perceived system explainability.

3.3 Pilot Study

We collected 62 responses from MTurk workers over a two-month

period during our pilot study. Observations during this phase inu-

enced changes in our task design. We added pop-up conrmations

to remind workers of experiment rules due to some misreading or

skipping of instructions. Additionally, we implemented an early

exit in the workow and treated the search task and multiple choice

quiz as a prerequisite for our survey to lter out workers who did

not faithfully attempt our task. Finally, we added a uniqueness con-

straint to block workers from attempting our task multiple times.

Additionally, we received feedback from workers that our initial

time limit (45 min) felt too rushed, leading us to increase the timer

to 1 hour for our full study. However, we found that most workers

spent less than the initial time limit on our task (averaging approx-

imately 30 min). We paid workers $9.20 for the original expected

work time of 45 minutes, the equivalent of the legal minimum wage

in our location. We required that workers have more than 10,000

prior approved HITs with an approval rate greater than 98%.

4 DATA ANALYSIS

We collected a total of 540 responses from our main study (Group

A: 202, Group B: 134, Group C: 201)

. We ltered out 81 responses

(15%) during our preprocessing stage to account for workers who

passed our initial quality control checks during the search task but

recorded inattentive or careless responses in the subsequent survey.

Following guidelines for identifying careless responses [

41], we analyzed response patterns and self-consistency.

Concretely, we ltered out responses that had (1) abnormally

long unbroken strings (i.e., length

8) of identical responses (e.g., a

respondent answering a series of 18 consecutive questions with the

same Likert-scale rating), (2) high overall numbers of inconsistent

There is a slight imbalance despite conditions being randomly assigned, but distribu-

tions are roughly preserved across groups before and after preprocessing.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EvaluatingSearchSystemExplainabilitywithPsychometricsandCrowdsourcingCatherineChencatherine_s_chen@brown.eduBrownUniversityProvidence,RhodeIsland,USACarstenEickhoffcarsten.eickhoff@uni-tuebingen.deUniversityofTübingenTübingen,GermanyABSTRACTAsinformationretrieval(IR)systems,suchassearchenginesandcon...

展开>> 收起<<

Evaluating Search System Explainability with Psychometrics and Crowdsourcing.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Evaluating Search System Explainability with Psychometrics and Crowdsourcing

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: