SIGIR ’24, July 14–18, 2024, Washington, DC, USA Catherine Chen and Carsten Eickho
The remainder of this paper is structured as follows: Section 2
presents background and related work on psychometric studies,
the multidimensionality of explainability, and previous attempts
to evaluate explainable systems. Section 3 outlines steps taken to
develop our measuring instrument and crowdsourcing task setup.
Section 4 presents the results of our data collection and model cre-
ation eorts. Section 5 proposes SSE and examines its eectiveness
in evaluating search system explainability. Finally, Section 6 ana-
lyzes the dimensions of explainability users found important and
discusses implications for future XIR system design and evaluation.
Section 7 concludes with an overview of future work.
2 RELATED WORK
2.1 Psychometrics, SEM, and Crowdsourcing
Psychometrics uses Structural Equation Modeling (SEM) to con-
struct models from observed data by measuring the presence of
latent variables (factors) through observed variables (questionnaire
items) [
21
,
58
]. SEM consists of two parts: (1) Exploratory Factor
Analysis (EFA) to produce a hypothesized model structure and (2)
Conrmatory Factor Analysis (CFA) to conrm the EFA-derived
model t on a held-out dataset. EFA identies the number of latent
factors and which items load on the discovered dimensions using
a statistical technique that iteratively groups and prunes items to
reach a high-quality estimate of covariance in the observed data set.
CFA re-estimates model parameters using maximum likelihood on
a held-out set of observed data and assesses model t via statistical
signicance testing of multiple alternative models.
Since SEM requires large amounts of user response data, crowd-
sourcing is often used for data collection due to its convenience and
eciency in quickly recruiting a large number of participants. How-
ever, ensuring data quality in crowdsourcing is challenging, since
the payout may be the main motivator for workers to complete
tasks and platforms become more saturated with low-quality work-
ers. To mitigate these issues, preventative measures can be taken
by setting high worker qualications, enabling rigorous quality
control checks, and post-processing data for inattentive responses
to verify quality work [
5
,
23
,
32
,
41
]. We describe the quality control
checks we employ in our study in Section 3.
2.2 Evaluation of Explainable Systems
Explainability is still often considered to be a binary concept despite
recent literature that suggests that it may be best measured as a
combination of several factors [
19
,
38
,
42
]. Lipton
[38]
and Doshi-
Velez and Kim
[18]
suggest that explainability is (1) ill-dened with
no consensus and (2) an amalgamation of several factors rather
than a monolithic concept. Specically, both recognize the need to
ground the explainability in the context of certain desiderata, such
as trustworthiness or causality. Nauta et al
. [42]
additionally identify
12 such conceptual properties for the systematic, multidimensional
evaluation of explainability.
Due to the lack of consensus and recognition of explainability as
a multi-faceted concept, evaluation falls short in two ways. Firstly,
current methods do not consider a multidimensional denition.
Authors of frameworks such as LIME and SHAP demonstrate the
utility of their methods through user evaluations but fail to quan-
tify the degree of explainability provided by their methods [
39
,
51
].
Relying solely on declarations of explainability without measure-
ment hinders targeted system improvements and a more holistic
denition is needed for robustness. Secondly, while there are many
ML evaluation metrics for system performance comparison, there
is no standard metric for evaluating explainability. Several authors
acknowledge the importance of quantitative evaluation metrics
[
1
,
15
,
18
,
42
] and while Nguyen and Martínez
[43]
come close by
introducing a suite of metrics to quantify interpretability, they fail
to quantify interactions between facets and measure the importance
of each individual facet.
Existing approaches for explainability in IR often focus on a
singular aspect of explainability or lack evaluation of explanation
quality, relying on anecdotal evidence [
47
,
48
,
53
,
63
]. In this paper,
we propose a data-driven approach for a more ne-grained rep-
resentation of explainability and propose a user study evaluation
instrument to create a metric that models explainability as a func-
tion of several sub-factors, enabling direct comparison between
systems and targeted improvements.
3 STUDY DESIGN
3.1 Questionnaire Design
First, to compile a list of candidate aspects that may potentially
contribute to the composite notion of explainability, we conducted
a comprehensive structured literature review, and include the most
commonly discussed aspects of explainability. We included the
proceedings of ML, IR, natural language processing (NLP), and
human-computer interaction (HCI) venues (i.e. ACL, CHI, ICML,
NeurIPS, SIGIR) and noted papers for further review if titles in-
cluded the keywords interpretability,explainability, or transparency,
and cross-referenced papers using connectedpapers.com to nd sim-
ilar papers, resulting in 44 papers (37 of which were published
within the last 7 years). We then read abstracts and conclusions for
this pool to retain only those papers that examined some concrete
element or aspect of explainability/interpretability, leaving us with
14 papers covering 26 unique aspects of explainability (i.e., trust-
worthiness,uncertainty,faithfulness) (full list in Table 1). Our nal
number of candidate aspects is consistent with, and perhaps more
encompassing than, other survey papers such as Nauta et al
. [42]
,
who nd 12 explainability factors from the literature. Given the
exibility of our framework, future work could easily investigate
additional aspects from broader literature.
Next, these aspects were turned into a set of concrete questions
(referred to as “items” in psychometrics) to be included in the ques-
tionnaire. We recorded responses on a 7-point Likert scale ranging
from 1 (Strongly Disagree), via 4 (Neutral), to 7 (Strongly Agree).
Our questionnaire was created using the following guidelines [
20
]:
(1) items should use clear language and avoid complex words, (2)
items should not be leading or presumptuous, and (3) the instrument
should include both positively and negatively keyed items.
Additionally, as explainability relies on both system and expla-
nation perception, we created items taking both into account, so
our nal evaluation would reect these desiderata. To combat fa-
tigue eects, we chose to create 2 items per aspect (one positively
and one negatively worded), for a total of 52 items presented in
fully randomized order, with the expectation that the discovery of
latent factor representations during factor analysis would establish