On the Impact of Temporal Concept Drift on Model Explanations Zhixue Zhao George Chrysostomou Kalina Bontcheva Nikolaos Aletras Department of Computer Science University of Sheffield

2025-05-02 0 0 4.65MB 16 页 10玖币
侵权投诉
On the Impact of Temporal Concept Drift on Model Explanations
Zhixue Zhao George Chrysostomou Kalina Bontcheva Nikolaos Aletras
Department of Computer Science, University of Sheffield
United Kingdom
{zhixue.zhao, gchrysostomou1, k.bontcheva, n.aletras}@sheffield.ac.uk
Abstract
Explanation faithfulness of model predictions
in natural language processing is typically
evaluated on held-out data from the same tem-
poral distribution as the training data (i.e. syn-
chronous settings). While model performance
often deteriorates due to temporal variation
(i.e. temporal concept drift), it is currently
unknown how explanation faithfulness is im-
pacted when the time span of the target data
is different from the data used to train the
model (i.e. asynchronous settings). For this
purpose, we examine the impact of tempo-
ral variation on model explanations extracted
by eight feature attribution methods and three
select-then-predict models across six text clas-
sification tasks. Our experiments show that
(i) faithfulness is not consistent under tempo-
ral variations across feature attribution meth-
ods (e.g. it decreases or increases depend-
ing on the method), with an attention-based
method demonstrating the most robust faith-
fulness scores across datasets; and (ii) select-
then-predict models are mostly robust in asyn-
chronous settings with only small degrada-
tion in predictive performance. Finally, fea-
ture attribution methods show conflicting be-
havior when used in FRESH (i.e. a select-
and-predict model) and for measuring suffi-
ciency/comprehensiveness (i.e. as post-hoc
methods), suggesting that we need more robust
metrics to evaluate post-hoc explanation faith-
fulness.1
1 Introduction
One way of improving the transparency of deep
learning models in natural language processing
(NLP) is by extracting explanations that justify
model predictions (Lipton,2018;Guidotti et al.,
2018). An explanation (i.e. rationale) consists
of a subset of the input and is considered faithful
1
Code for replicating the experiments in this study:
https:
//github.com/casszhao/temporal-drift-on-explanat
ion
when it accurately shows the reasoning behind a
model’s prediction (Zaidan et al.,2007;Ribeiro
et al.,2016a;DeYoung et al.,2019;Jacovi and
Goldberg,2020). For example, removing a faithful
rationale from the input should result into a predic-
tion change. Two widely used methods for extract-
ing rationales are (i) feature attribution methods
that produce a distribution over the input tokens, in-
dicating their contribution (i.e. importance) to the
model’s prediction (Ribeiro et al.,2016b;Wiegr-
effe and Pinter,2019); and (ii) select-then-predict
models that consist of two components, a rationale
extractor and a predictor. The rationale extractor
extracts rationales, and the predictor is trained on
extracted rationales so that its predictions are inher-
ently faithful (Lei et al.,2016;Jain et al.,2020).
Previous work has focused on evaluating expla-
nation faithfulness in synchronous settings where
the training and testing data come from the same
temporal distribution (Serrano and Smith,2019a;
Jain and Wallace,2019;Atanasova et al.,2020;
Guerreiro and Martins,2021), or out-of-domain
settings (Chrysostomou and Aletras,2022a) where
the training and testing data come from a different
domain regardless of temporal drifts in the testing
data. However, human languages evolve (Weinre-
ich et al.,1968;Kim et al.,2014;Carrier,2019) as
manifested by novel usages developed for existing
words (e.g. mouse is a mammal or a computer ac-
cessory) and new words and topics (e.g. covidiot
during the COVID-19 pandemic) that appear over
time. Language evolution leads to temporal con-
cept drifts and a diachronic degradation of model
performance in many NLP tasks when these are
evaluated in asynchronous settings, i.e. training
and testing data come from different time peri-
ods (Jaidka et al.,2018;Agarwal and Nenkova,
2021;Lazaridou et al.,2021;Søgaard et al.,2021;
Chalkidis and Søgaard,2022).
In this paper, for the first time, we extensively
analyze the impact of temporal concept drift on
arXiv:2210.09197v1 [cs.CL] 17 Oct 2022
model explanations. We evaluate the faithfulness of
rationales extracted using eight feature attribution
approaches and three select-then-predict models
over six text classification tasks with chronological
data splits. Our contributions are as follows:
We find that faithfulness is not consistent
under temporal concept drift for rationales
extracted with feature attribution methods
(e.g. it decreases or increases depending on
the method), with an attention-based method
demonstrating the most robust faithfulness
scores across datasets;
We empirically show that select-then-predict
models can be used in asynchronous settings
when it achieves comparable performance to
the full-text model;
We demonstrate that sufficiency is not trust-
worthy evaluation metrics for explanation
faithfulness, regardless of a synchronous or
an asynchronous setting.
2 Related Work
2.1 Temporal Concept Drift in NLP
Temporal model deterioration describes the true
difference in system performance when a system
is evaluated on chronologically newer data (Jaidka
et al.,2018;Gorman and Bedrick,2019). This
has been linked to changes in the data distribu-
tion, also known as concept drift in early studies
(Schlimmer and Granger,1986;Widmer and Ku-
bat,1993). Previous work has demonstrated the
impact of temporal concept drift on model perfor-
mance by assessing the temporal generalization
(Lazaridou et al.,2021;Søgaard et al.,2021;Agar-
wal and Nenkova,2021;Röttger and Pierrehum-
bert,2021). Søgaard et al. (2021) has studied sev-
eral factors that affect the true difference in system
performance such as temporal drift, variations in
text length and adversarial data distributions. They
found that temporal variation is the most important
factor for performance degradation and suggest
including chronological data splits in model eval-
uation. Chalkidis and Søgaard (2022) also noted
that evaluating on random splits with the same tem-
poral distribution as the training data consistently
over-estimates model performance at test time in
multi-label classification problems.
Previous work on mitigating temporal concept
drift includes automatically identifying semantic
drift of words over time (Tsakalidis et al.,2019;
Giulianelli et al.,2020;Rosin and Radinsky,2022;
Montariol et al.,2021). Efforts have also been
made to mitigate the impact of temporal concept
drift on model prediction performance (Lukes and
Søgaard,2018;Röttger and Pierrehumbert,2021;
Loureiro et al.,2022;Chalkidis and Søgaard,2022)
and develop time-aware models (Dhingra et al.,
2022;Rijhwani and Preotiuc-Pietro,2020;Dhingra
et al.,2021;Rosin and Radinsky,2022). For ex-
ample, both Röttger and Pierrehumbert (2021) and
Loureiro et al. (2022) observed performance im-
provements when continue fine-tuning their models
with chronologically newer data. While the impact
of temporal concept drift on model performance
has received particular attention, to the best of our
knowledge, no previous work has examined its im-
pact on model explanations.
2.2 Concept Drift and Model Explanations
Poerner et al. (2018) has compared the explana-
tion quality between tasks that contain short and
long textual context. More recently, Chrysostomou
and Aletras (2022a) have studied model explana-
tions in out-of-domain settings (i.e. under con-
cept drift) using train and test data from different
domains. Their results showed that the faithful-
ness of out-of-domain explanations unexpectedly
increases, i.e. outperforming in-domain explana-
tions’ faithfulness. This is interesting given that
performance degradation due to concept drift is of-
ten expected in domain adaptation (Schlimmer and
Granger,1986;Widmer and Kubat,1993;Chan
and Ng,2006;Gama et al.,2014).
3 Extracting Explanations
We extract explanations using two standard ap-
proaches: (i) post-hoc methods; and (ii) select-
then-predict models.
3.1 Post-hoc Explanation Methods
For post-hoc explanations, we fine-tune a BERT-
base model on each task on the synchronous train-
ing set and extract explanations using post-hoc
feature attribution methods for all synchronous
and asynchronous testing sets. We use eight
widely used feature attribution methods follow-
ing Chrysostomou and Aletras (2021a,b).
Attention (α):
Token importance is com-
puted using the corresponding normalized at-
tention scores (Jain et al.,2020).
Scaled attention (αα)
Attention scores
scaled by their corresponding gradients (Ser-
rano and Smith,2019a).
InputXGrad (xx)
Attributes importance
by multiplying the input with its gradient com-
puted with respect to the predicted class (Kin-
dermans et al.,2016;Atanasova et al.,2020).
Integrated Gradients (IG)
Ranks input to-
kens by computing the integral of the gradi-
ents taken along a straight path from a baseline
input (zero embedding vector) to the original
input (Sundararajan et al.,2017).
GradientSHAP (Gsp)
A gradient-based
method to compute SHapley Additive exPla-
nations (SHAP) values for assigning token
importance (Lundberg and Lee,2017). Gsp
computes the gradient of outputs with respect
to randomly selected points between the in-
puts and a baseline distribution.
LIME
Ranks input tokens by learning a lin-
ear surrogate model using data points ran-
domly sampled locally around the prediction
(Ribeiro et al.,2016b).
DeepLift (DL)
Computes token importance
according to the difference between the activa-
tion of each neuron and a reference activation
(i.e. zero embedding vector) (Shrikumar et al.,
2017).
DeepLiftSHAP (DLsp)
Similar to Gsp,
DLsp computes the expected value of at-
tributions based on DL across all input-
baseline pairs, considering a baseline distribu-
tion (Lundberg and Lee,2017).
3.2 Select-then-predict Models
We also use three state-of-the-art select-then-
predict models. Two are trained end-to-end (Bast-
ings et al.,2019;Guerreiro and Martins,2021)
while the other one uses a feature attribution
method as the rationale extractor (Jain et al.,2020)
with a separate predictor component, trained on the
extracted rationales.
HardKUMA:
Bastings et al. (2019) proposed
a modified version of the end-to-end ratio-
nale extraction model introduced by Lei et al.
(2016). Choosing rationales in a binary fash-
ion by sampling from a Bernoulli distribution
is replaced with a Kumaraswamy distribution
(Kumaraswamy,1980) to support continuous
random variables. This way, the model is dif-
ferentiable and easier to train.
SPECTRA:
HardKUMA provides stochastic
rationales due to the marginalization over all
possible rationales and the sampling process.
Guerreiro and Martins (2021) proposed SPEC-
TRA, a model that uses LP-SparseMAP (Nic-
ulae and Martins,2020) to obtain a determin-
istic rationale extraction process. Niculae and
Martins (2020) have experimented with three
different factor graphs showing that XorAt-
MostOne outperforms the other two (i.e. Bud-
get, AtMostOne2). We use SPECTRA with
XorAtMostOne in our experiments. For Hard-
KUMA and SPECTRA, we use a Bi-LSTM
(Hochreiter and Schmidhuber,1997) because
it has been shown to outperform BERT-based
models (Guerreiro and Martins,2021).
FRESH:
Jain et al. (2020) proposed FRESH,
a model that first extracts rationales from a
trained model (e.g. using a feature attribution
method) and subsequently trains a classifier
on the extracted rationales. We extract the
top 20% rationales using
αα
that achieved
the best performance in early experimenta-
tion. We also use BERT-base for the extrac-
tion and predictor components following Jain
et al. (2020).
4 Experimental Setup
4.1 Tasks and Data
Tasks
We evaluate all methods on three diverse
text classification tasks including six different
datasets: (1) topic classification; (2) misinforma-
tion detection; and (3) sentiment analysis:
AGNews:
Topic classification across four
topics (Business, Sports, Science/Technology
and World) from AG News (Del Corso et al.,
2005);
X-FACT:
Factual correctness classification of
short statements into five classes (Gupta and
Srikumar,2021): True, Mostly-True, Partly-
True, Mostly-False and False;
FactCheck:
Binary classification of potential
misinformation stories as truthful or misinfor-
mation (Jiang and Wilson,2021);
Figure 1: Density curves for time distribution across
temporal splits and the original full size dataset for each
task.
Task #Classes Splits Start Date End Date Span
(Days)
#Data
AGNews 4
Train 2004-08-18 2006-12-20 854 9358
Syn Test 2004-08-18 2006-12-20 854 9358
Asy1 Test 2007-01-30 2007-12-31 335 9358
Asy2 Test 2008-01-01 2008-02-20 50 9358
X-FACT 6
Train 1995-04-01 2016-08-31 7823 7232
Syn Test 2007-01-04 2016-08-31 3527 1204
Asy1 Test 2016-08-31 2017-09-30 395 1205
Asy2 Test 2017-09-30 2018-11-12 408 1204
FactCheck 2
Train 1995-09-25 2019-05-01 8619 7446
Syn Test 1996-08-02 2019-05-01 8307 1241
Asy1 Test 2019-05-02 2020-05-15 379 1368
Asy2 Test 2020-05-15 2021-07-19 430 1368
AmazDigiMu 3
Train 1998-08-21 2016-05-07 6469 101774
Syn Test 1998-12-20 2016-05-07 6351 16963
Asy1 Test 2016-05-07 2016-12-30 237 16962
Asy2 Test 2016-12-30 2018-09-26 635 16962
AmazPantry 3
Train 2006-04-28 2017-07-30 4111 82566
Syn Test 2006-12-22 2017-07-30 3873 13762
Asy1 Test 2017-07-30 2018-01-21 175 13761
Asy2 Test 2018-01-21 2018-10-04 256 13761
Yelp 5
Train 2005-02-16 2018-12-31 5066 8540
Syn Test 2005-02-16 2018-12-24 5059 1708
Asy1 Test 2019-01-01 2020-12-31 730 1708
Asy2 Test 2021-01-01 2022-01-19 383 1708
Table 1: Data statistics and the temporal splits for each
task.
Amazon Reviews:
We predict the senti-
ment (negative, neutral, positive) of Ama-
zon product reviews from digital music
(
AmazDigiMu
) and pantry (
AmazPantry
) as
Ni et al. (2019);
Yelp: Multi-class sentiment classification
(positive, negative) following Zhang et al.
(2015).
Data Splits
To simulate temporal concept drifts,
we create different chronological splits according
to the time-stamps of the data points in each dataset.
We split each dataset into a training set and three
different test sets. The time spans of the three test
sets follow a chronological order without any over-
lapping. The test set with the earliest time span
(Syn) has the exact same time span as the training
data (i.e. a synchronous setting). The other two
splits denoted as Asy1 and Asy2 that are chronolog-
ically newer correspond to asynchronous settings.
Figure 1shows the temporal distribution of each
data split compared to the original data. Table 1
summarizes the key statistics for each split. More
details for the data and tasks can be found in the
Appendix A. We also provide results of all models
on the original (synchronous) test set (OSyn).
4.2 Evaluation
For each task, we train a model on the training set
and then evaluate post-hoc explanations and select-
then-predict performance on our three chronologi-
cal splits, namely Syn,Asy1 and Asy2.
Post-hoc Explanations
We evaluate the faithful-
ness of post-hoc explanations using two popular
metrics (DeYoung et al.,2019;Carton et al.,2020):
Normalized Sufficiency
quantifies how suffi-
cient a rationale is for making the same predic-
tion
p(ˆy|R)
to the prediction of the full text
model
p(ˆy|x)
. We use the normalized version
to allow a fairer comparison across models
and tasks:
Suff(x,ˆy, R) = 1 max(0, p(ˆy|x)p(ˆy|R))
NormSuff(x,ˆy, R) = Suff(x,ˆy, R)Suff(x,ˆy, 0)
1Suff(x,ˆy, 0) (1)
Normalized Comprehensiveness
assesses
how much information the rationale holds,
measuring changes in predictions when mask-
ing the rationale
p(ˆy|x\R)
. Similar to suffi-
ciency, we use the normalized version:
Comp(x,ˆy, R) = max(0, p(ˆy|x)p(ˆy|x\R))
NormComp(x,ˆy, R) = Comp(x,ˆy, R)
1Suff(x,ˆy, 0)
(2)
Further, we evaluate explanations of different
lengths (top 2%, 10%, 20% and 50% of tokens ex-
tracted) and report the “Area Over the Perturbation
摘要:

OntheImpactofTemporalConceptDriftonModelExplanationsZhixueZhaoGeorgeChrysostomouKalinaBontchevaNikolaosAletrasDepartmentofComputerScience,UniversityofShefeldUnitedKingdom{zhixue.zhao,gchrysostomou1,k.bontcheva,n.aletras}@sheffield.ac.ukAbstractExplanationfaithfulnessofmodelpredictionsinnaturallangu...

展开>> 收起<<
On the Impact of Temporal Concept Drift on Model Explanations Zhixue Zhao George Chrysostomou Kalina Bontcheva Nikolaos Aletras Department of Computer Science University of Sheffield.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:4.65MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注