
1 Introduction
Sentiment Analysis (SA) is instrumental to the financial services industry [
28
,
29
] as
it develops techniques to interpret customer feedback, monitor product reputations,
understand the customers’ needs, and conduct market research. Harnessing the power of
Deep Learning (DL) in understanding general contexts, the performance of SA models
is considerably boosted [
12
,
11
,
23
]. However, the non-linearity and the black-box nature
of such models hinder the interpretation of the predictions [
20
,
32
]. Besides providing
guarantees on reliability, generalization, robustness, and fairness, the interpretability of
the SA models can be of service to behavioral marketing and personalized advertisement.
Recently,
Ex
plainable
A
rtificial
I
ntelligence (ExAI) algorithms are breathing a
new flexibility in general AI applications by developing methods to explain model’s
prediction [
5
,
41
,
30
]. Numerical data frameworks and computer vision applications have
witnessed an explosive growth of ExAI nurtured by the ease of expression of features as
interpretable components [
26
,
22
,
10
]. However, only a few ExAI methods are applied
to textual classifiers, embeddings, and language models [
7
]. In the SA framework,
researchers integrated data augmentation techniques to improve the interpretability of
SA models [
6
], studied attention mechanisms in SA through an explainability lens [
3
]
and applied ExAI on aspect-based SA models [
37
]. To date, ExAI methods on Natural
Language Processing (NLP) tasks are not evaluated on standardized benchmarking
datasets through common metrics which hinders the progress and adoption of such
methods in the NLP field. Evaluating explainability methods is two-fold. First, it helps
assess the extent to which a deep model can be made explainable. Second, it provides
a common ground to measure the contrast between explanations produced by diverse
ExAI approaches.
In this work, we inspect two human aspects of explainability methods: (1) faithfulness
to the model being explained and (2) plausibility from a human lens. For this purpose,
we select eight state-of-the-art SA models with underlying architectures of recurrent,
convolutional, and attention layers. We generate explanations of the predictions of
these models on three ExAI methods that can be applied in NLP; mainly LIME [
30
],
anchors [
31
] and SHAP [
18
]. The generated explanations are then evaluated through
two procedures. First, faithfulness
2
is evaluated by examining the degradation in the
model’s performance when only extracted rationales are fed to the model. Second, the
plausibility
1
of extracted rationales is evaluated via comparison to the human judgment of
what a sufficient explanation is. This experiment entails a homegrown dataset of manually
labeled explanations on SA data aggregated through conjunction and disjunction means.
The comparison is achieved on six proposed metrics, inspired by information retrieval,
to evaluate the precision and fallout of exAI methods on the SA models. Hence, our
evaluation is carried out over four different dimensions: (1) SA model, (2) ExAI method,
(3) reasoning complexity, and (4) human judgment homogeneity.
The contributions of this work are: (1) a dataset for SA explainability labeled on
different dimensions (2) the first faithfulness and plausibility evaluation inspired from
information retrieval (3) a thorough four-dimensional ExAI evaluation on SA models.
2refers to the metric hereafter