On the Impact of Temporal Concept Drift on Model Explanations Zhixue Zhao George Chrysostomou Kalina Bontcheva Nikolaos Aletras Department of Computer Science University of Shefﬁeld

2025-05-02 0 0 4.65MB 16 页 10玖币

侵权投诉

On the Impact of Temporal Concept Drift on Model Explanations

Zhixue Zhao George Chrysostomou Kalina Bontcheva Nikolaos Aletras

Department of Computer Science, University of Shefﬁeld

United Kingdom

{zhixue.zhao, gchrysostomou1, k.bontcheva, n.aletras}@sheffield.ac.uk

Abstract

Explanation faithfulness of model predictions

in natural language processing is typically

evaluated on held-out data from the same tem-

poral distribution as the training data (i.e. syn-

chronous settings). While model performance

often deteriorates due to temporal variation

(i.e. temporal concept drift), it is currently

unknown how explanation faithfulness is im-

pacted when the time span of the target data

is different from the data used to train the

model (i.e. asynchronous settings). For this

purpose, we examine the impact of tempo-

ral variation on model explanations extracted

by eight feature attribution methods and three

select-then-predict models across six text clas-

siﬁcation tasks. Our experiments show that

(i) faithfulness is not consistent under tempo-

ral variations across feature attribution meth-

ods (e.g. it decreases or increases depend-

ing on the method), with an attention-based

method demonstrating the most robust faith-

fulness scores across datasets; and (ii) select-

then-predict models are mostly robust in asyn-

chronous settings with only small degrada-

tion in predictive performance. Finally, fea-

ture attribution methods show conﬂicting be-

havior when used in FRESH (i.e. a select-

and-predict model) and for measuring sufﬁ-

ciency/comprehensiveness (i.e. as post-hoc

methods), suggesting that we need more robust

metrics to evaluate post-hoc explanation faith-

fulness.1

1 Introduction

One way of improving the transparency of deep

learning models in natural language processing

(NLP) is by extracting explanations that justify

model predictions (Lipton,2018;Guidotti et al.,

2018). An explanation (i.e. rationale) consists

of a subset of the input and is considered faithful

Code for replicating the experiments in this study:

https:

//github.com/casszhao/temporal-drift-on-explanat

ion

when it accurately shows the reasoning behind a

model’s prediction (Zaidan et al.,2007;Ribeiro

et al.,2016a;DeYoung et al.,2019;Jacovi and

Goldberg,2020). For example, removing a faithful

rationale from the input should result into a predic-

tion change. Two widely used methods for extract-

ing rationales are (i) feature attribution methods

that produce a distribution over the input tokens, in-

dicating their contribution (i.e. importance) to the

model’s prediction (Ribeiro et al.,2016b;Wiegr-

effe and Pinter,2019); and (ii) select-then-predict

models that consist of two components, a rationale

extractor and a predictor. The rationale extractor

extracts rationales, and the predictor is trained on

extracted rationales so that its predictions are inher-

ently faithful (Lei et al.,2016;Jain et al.,2020).

Previous work has focused on evaluating expla-

nation faithfulness in synchronous settings where

the training and testing data come from the same

temporal distribution (Serrano and Smith,2019a;

Jain and Wallace,2019;Atanasova et al.,2020;

Guerreiro and Martins,2021), or out-of-domain

settings (Chrysostomou and Aletras,2022a) where

the training and testing data come from a different

domain regardless of temporal drifts in the testing

data. However, human languages evolve (Weinre-

ich et al.,1968;Kim et al.,2014;Carrier,2019) as

manifested by novel usages developed for existing

words (e.g. mouse is a mammal or a computer ac-

cessory) and new words and topics (e.g. covidiot

during the COVID-19 pandemic) that appear over

time. Language evolution leads to temporal con-

cept drifts and a diachronic degradation of model

performance in many NLP tasks when these are

evaluated in asynchronous settings, i.e. training

and testing data come from different time peri-

ods (Jaidka et al.,2018;Agarwal and Nenkova,

2021;Lazaridou et al.,2021;Søgaard et al.,2021;

Chalkidis and Søgaard,2022).

In this paper, for the ﬁrst time, we extensively

analyze the impact of temporal concept drift on

arXiv:2210.09197v1 [cs.CL] 17 Oct 2022

model explanations. We evaluate the faithfulness of

rationales extracted using eight feature attribution

approaches and three select-then-predict models

over six text classiﬁcation tasks with chronological

data splits. Our contributions are as follows:

•

We ﬁnd that faithfulness is not consistent

under temporal concept drift for rationales

extracted with feature attribution methods

(e.g. it decreases or increases depending on

the method), with an attention-based method

demonstrating the most robust faithfulness

scores across datasets;

•

We empirically show that select-then-predict

models can be used in asynchronous settings

when it achieves comparable performance to

the full-text model;

•

We demonstrate that sufﬁciency is not trust-

worthy evaluation metrics for explanation

faithfulness, regardless of a synchronous or

an asynchronous setting.

2 Related Work

2.1 Temporal Concept Drift in NLP

Temporal model deterioration describes the true

difference in system performance when a system

is evaluated on chronologically newer data (Jaidka

et al.,2018;Gorman and Bedrick,2019). This

has been linked to changes in the data distribu-

tion, also known as concept drift in early studies

(Schlimmer and Granger,1986;Widmer and Ku-

bat,1993). Previous work has demonstrated the

impact of temporal concept drift on model perfor-

mance by assessing the temporal generalization

(Lazaridou et al.,2021;Søgaard et al.,2021;Agar-

wal and Nenkova,2021;Röttger and Pierrehum-

bert,2021). Søgaard et al. (2021) has studied sev-

eral factors that affect the true difference in system

performance such as temporal drift, variations in

text length and adversarial data distributions. They

found that temporal variation is the most important

factor for performance degradation and suggest

including chronological data splits in model eval-

uation. Chalkidis and Søgaard (2022) also noted

that evaluating on random splits with the same tem-

poral distribution as the training data consistently

over-estimates model performance at test time in

multi-label classiﬁcation problems.

Previous work on mitigating temporal concept

drift includes automatically identifying semantic

drift of words over time (Tsakalidis et al.,2019;

Giulianelli et al.,2020;Rosin and Radinsky,2022;

Montariol et al.,2021). Efforts have also been

made to mitigate the impact of temporal concept

drift on model prediction performance (Lukes and

Søgaard,2018;Röttger and Pierrehumbert,2021;

Loureiro et al.,2022;Chalkidis and Søgaard,2022)

and develop time-aware models (Dhingra et al.,

2022;Rijhwani and Preotiuc-Pietro,2020;Dhingra

et al.,2021;Rosin and Radinsky,2022). For ex-

ample, both Röttger and Pierrehumbert (2021) and

Loureiro et al. (2022) observed performance im-

provements when continue ﬁne-tuning their models

with chronologically newer data. While the impact

of temporal concept drift on model performance

has received particular attention, to the best of our

knowledge, no previous work has examined its im-

pact on model explanations.

2.2 Concept Drift and Model Explanations

Poerner et al. (2018) has compared the explana-

tion quality between tasks that contain short and

long textual context. More recently, Chrysostomou

and Aletras (2022a) have studied model explana-

tions in out-of-domain settings (i.e. under con-

cept drift) using train and test data from different

domains. Their results showed that the faithful-

ness of out-of-domain explanations unexpectedly

increases, i.e. outperforming in-domain explana-

tions’ faithfulness. This is interesting given that

performance degradation due to concept drift is of-

ten expected in domain adaptation (Schlimmer and

Granger,1986;Widmer and Kubat,1993;Chan

and Ng,2006;Gama et al.,2014).

3 Extracting Explanations

We extract explanations using two standard ap-

proaches: (i) post-hoc methods; and (ii) select-

then-predict models.

3.1 Post-hoc Explanation Methods

For post-hoc explanations, we ﬁne-tune a BERT-

base model on each task on the synchronous train-

ing set and extract explanations using post-hoc

feature attribution methods for all synchronous

and asynchronous testing sets. We use eight

widely used feature attribution methods follow-

ing Chrysostomou and Aletras (2021a,b).

•Attention (α):

Token importance is com-

puted using the corresponding normalized at-

tention scores (Jain et al.,2020).

•Scaled attention (α∇α)

Attention scores

scaled by their corresponding gradients (Ser-

rano and Smith,2019a).

•InputXGrad (x∇x)

Attributes importance

by multiplying the input with its gradient com-

puted with respect to the predicted class (Kin-

dermans et al.,2016;Atanasova et al.,2020).

•Integrated Gradients (IG)

Ranks input to-

kens by computing the integral of the gradi-

ents taken along a straight path from a baseline

input (zero embedding vector) to the original

input (Sundararajan et al.,2017).

•GradientSHAP (Gsp)

A gradient-based

method to compute SHapley Additive exPla-

nations (SHAP) values for assigning token

importance (Lundberg and Lee,2017). Gsp

computes the gradient of outputs with respect

to randomly selected points between the in-

puts and a baseline distribution.

•LIME

Ranks input tokens by learning a lin-

ear surrogate model using data points ran-

domly sampled locally around the prediction

(Ribeiro et al.,2016b).

•DeepLift (DL)

Computes token importance

according to the difference between the activa-

tion of each neuron and a reference activation

(i.e. zero embedding vector) (Shrikumar et al.,

2017).

•DeepLiftSHAP (DLsp)

Similar to Gsp,

DLsp computes the expected value of at-

tributions based on DL across all input-

baseline pairs, considering a baseline distribu-

tion (Lundberg and Lee,2017).

3.2 Select-then-predict Models

We also use three state-of-the-art select-then-

predict models. Two are trained end-to-end (Bast-

ings et al.,2019;Guerreiro and Martins,2021)

while the other one uses a feature attribution

method as the rationale extractor (Jain et al.,2020)

with a separate predictor component, trained on the

extracted rationales.

•HardKUMA:

Bastings et al. (2019) proposed

a modiﬁed version of the end-to-end ratio-

nale extraction model introduced by Lei et al.

(2016). Choosing rationales in a binary fash-

ion by sampling from a Bernoulli distribution

is replaced with a Kumaraswamy distribution

(Kumaraswamy,1980) to support continuous

random variables. This way, the model is dif-

ferentiable and easier to train.

•SPECTRA:

HardKUMA provides stochastic

rationales due to the marginalization over all

possible rationales and the sampling process.

Guerreiro and Martins (2021) proposed SPEC-

TRA, a model that uses LP-SparseMAP (Nic-

ulae and Martins,2020) to obtain a determin-

istic rationale extraction process. Niculae and

Martins (2020) have experimented with three

different factor graphs showing that XorAt-

MostOne outperforms the other two (i.e. Bud-

get, AtMostOne2). We use SPECTRA with

XorAtMostOne in our experiments. For Hard-

KUMA and SPECTRA, we use a Bi-LSTM

(Hochreiter and Schmidhuber,1997) because

it has been shown to outperform BERT-based

models (Guerreiro and Martins,2021).

•FRESH:

Jain et al. (2020) proposed FRESH,

a model that ﬁrst extracts rationales from a

trained model (e.g. using a feature attribution

method) and subsequently trains a classiﬁer

on the extracted rationales. We extract the

top 20% rationales using

α∇α

that achieved

the best performance in early experimenta-

tion. We also use BERT-base for the extrac-

tion and predictor components following Jain

et al. (2020).

4 Experimental Setup

4.1 Tasks and Data

Tasks

We evaluate all methods on three diverse

text classiﬁcation tasks including six different

datasets: (1) topic classiﬁcation; (2) misinforma-

tion detection; and (3) sentiment analysis:

•AGNews:

Topic classiﬁcation across four

topics (Business, Sports, Science/Technology

and World) from AG News (Del Corso et al.,

2005);

•X-FACT:

Factual correctness classiﬁcation of

short statements into ﬁve classes (Gupta and

Srikumar,2021): True, Mostly-True, Partly-

True, Mostly-False and False;

•FactCheck:

Binary classiﬁcation of potential

misinformation stories as truthful or misinfor-

mation (Jiang and Wilson,2021);

Figure 1: Density curves for time distribution across

temporal splits and the original full size dataset for each

task.

Task #Classes Splits Start Date End Date Span

(Days)

#Data

AGNews 4

Train 2004-08-18 2006-12-20 854 9358

Syn Test 2004-08-18 2006-12-20 854 9358

Asy1 Test 2007-01-30 2007-12-31 335 9358

Asy2 Test 2008-01-01 2008-02-20 50 9358

X-FACT 6

Train 1995-04-01 2016-08-31 7823 7232

Syn Test 2007-01-04 2016-08-31 3527 1204

Asy1 Test 2016-08-31 2017-09-30 395 1205

Asy2 Test 2017-09-30 2018-11-12 408 1204

FactCheck 2

Train 1995-09-25 2019-05-01 8619 7446

Syn Test 1996-08-02 2019-05-01 8307 1241

Asy1 Test 2019-05-02 2020-05-15 379 1368

Asy2 Test 2020-05-15 2021-07-19 430 1368

AmazDigiMu 3

Train 1998-08-21 2016-05-07 6469 101774

Syn Test 1998-12-20 2016-05-07 6351 16963

Asy1 Test 2016-05-07 2016-12-30 237 16962

Asy2 Test 2016-12-30 2018-09-26 635 16962

AmazPantry 3

Train 2006-04-28 2017-07-30 4111 82566

Syn Test 2006-12-22 2017-07-30 3873 13762

Asy1 Test 2017-07-30 2018-01-21 175 13761

Asy2 Test 2018-01-21 2018-10-04 256 13761

Yelp 5

Train 2005-02-16 2018-12-31 5066 8540

Syn Test 2005-02-16 2018-12-24 5059 1708

Asy1 Test 2019-01-01 2020-12-31 730 1708

Asy2 Test 2021-01-01 2022-01-19 383 1708

Table 1: Data statistics and the temporal splits for each

task.

•Amazon Reviews:

We predict the senti-

ment (negative, neutral, positive) of Ama-

zon product reviews from digital music

(

AmazDigiMu

) and pantry (

AmazPantry

) as

Ni et al. (2019);

•

Yelp: Multi-class sentiment classiﬁcation

(positive, negative) following Zhang et al.

(2015).

Data Splits

To simulate temporal concept drifts,

we create different chronological splits according

to the time-stamps of the data points in each dataset.

We split each dataset into a training set and three

different test sets. The time spans of the three test

sets follow a chronological order without any over-

lapping. The test set with the earliest time span

(Syn) has the exact same time span as the training

data (i.e. a synchronous setting). The other two

splits denoted as Asy1 and Asy2 that are chronolog-

ically newer correspond to asynchronous settings.

Figure 1shows the temporal distribution of each

data split compared to the original data. Table 1

summarizes the key statistics for each split. More

details for the data and tasks can be found in the

Appendix A. We also provide results of all models

on the original (synchronous) test set (OSyn).

4.2 Evaluation

For each task, we train a model on the training set

and then evaluate post-hoc explanations and select-

then-predict performance on our three chronologi-

cal splits, namely Syn,Asy1 and Asy2.

Post-hoc Explanations

We evaluate the faithful-

ness of post-hoc explanations using two popular

metrics (DeYoung et al.,2019;Carton et al.,2020):

•Normalized Sufﬁciency

quantiﬁes how sufﬁ-

cient a rationale is for making the same predic-

tion

p(ˆy|R)

to the prediction of the full text

model

p(ˆy|x)

. We use the normalized version

to allow a fairer comparison across models

and tasks:

Suff(x,ˆy, R) = 1 −max(0, p(ˆy|x)−p(ˆy|R))

NormSuff(x,ˆy, R) = Suff(x,ˆy, R)−Suff(x,ˆy, 0)

1−Suff(x,ˆy, 0) (1)

•Normalized Comprehensiveness

assesses

how much information the rationale holds,

measuring changes in predictions when mask-

ing the rationale

p(ˆy|x\R)

. Similar to sufﬁ-

ciency, we use the normalized version:

Comp(x,ˆy, R) = max(0, p(ˆy|x)−p(ˆy|x\R))

NormComp(x,ˆy, R) = Comp(x,ˆy, R)

1−Suff(x,ˆy, 0)

(2)

Further, we evaluate explanations of different

lengths (top 2%, 10%, 20% and 50% of tokens ex-

tracted) and report the “Area Over the Perturbation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OntheImpactofTemporalConceptDriftonModelExplanationsZhixueZhaoGeorgeChrysostomouKalinaBontchevaNikolaosAletrasDepartmentofComputerScience,UniversityofShefeldUnitedKingdom{zhixue.zhao,gchrysostomou1,k.bontcheva,n.aletras}@sheffield.ac.ukAbstractExplanationfaithfulnessofmodelpredictionsinnaturallangu...

展开>> 收起<<

On the Impact of Temporal Concept Drift on Model Explanations Zhixue Zhao George Chrysostomou Kalina Bontcheva Nikolaos Aletras Department of Computer Science University of Shefﬁeld.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the Impact of Temporal Concept Drift on Model Explanations Zhixue Zhao George Chrysostomou Kalina Bontcheva Nikolaos Aletras Department of Computer Science University of Shefﬁeld

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: