Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications Swaroop Mishra Anjana Arunkumar Chitta Baral

2025-05-03 0 0 7.69MB 20 页 10玖币

侵权投诉

Investigating the Failure Modes of the AUC metric and Exploring

Alternatives for Evaluating Systems in Safety Critical Applications

Swaroop Mishra Anjana Arunkumar Chitta Baral

Arizona State University

Abstract

With the increasing importance of safety re-

quirements associated with the use of black

box models, evaluation of selective answering

capability of models has been critical. Area

under the curve (AUC) is used as a metric for

this purpose. We ﬁnd limitations in AUC; e.g.,

a model having higher AUC is not always bet-

ter in performing selective answering. We pro-

pose three alternate metrics that ﬁx the iden-

tiﬁed limitations. On experimenting with ten

models, our results using the new metrics show

that newer and larger pre-trained models do

not necessarily show better performance in se-

lective answering. We hope our insights will

help develop better models tailored for safety-

critical applications.

1 Introduction

Humans have the capability to gauge how much

they know; this leads them to abstain from

answering whenever they are not conﬁdent about

an answer. Such a selective answering capability

(Kamath et al.,2020;Varshney et al.,2020;

Garg and Moschitti,2021) is also essential for

machine learning systems, especially in the case of

safety critical applications like healthcare, where

incorrect answering can result in critically negative

consequences.

Area under the curve (AUC) has been used

as a metric to evaluate models based on their

selective answering capability. AUC involves

ﬁnding model coverage and accuracy, at various

conﬁdence values. MaxProb– maximum softmax

probability of models’ predictions– has been used

as a strong baseline to decide whether to answer

a question or abstain (Hendrycks and Gimpel,

2016a;Lakshminarayanan et al.,2017;Varshney

et al.,2022). However, if Model Ahas higher AUC

than Model B, can we always say that Ais better

at selective answering and thus better suited for

safety critical applications than B?

We experiment across 10 models– ranging from

bag-of-words models to pre-trained transformers–

and ﬁnd that a model having higher AUC does

not necessarily have a higher selective answering

capability. We adversarially attack AUC and ﬁnd

several limitations. These limitations prevent AUC

from evaluating the efﬁcacy of models in safety

critical applications.

In pursuit of ﬁxing the limitations, we propose

an evaluation metric– ‘DiSCA Score’ i.e. Deploy-

ment in Safety Critical Applications. Disaster

management is a high-impact and regularly oc-

curring safety critical application. Deployment of

Machine Learning systems in disaster management

will often be through hand-held devices, such

as smartphones, hence requiring light-weight

models. Subsequently, we incorporate computation

requirements in ‘DiSCA Score’, and propose

‘DiDMA Score’ i.e. Deployment in Disaster

Management Applications.

NLP applications in disaster management often

involve interaction with users (Phengsuwan et al.,

2021;Mishra et al.,2022a) whose questions

may diverge from a training set, due to rich

variations in natural language. We therefore further

add the evaluation of abstaining capability on

Out-of-Distribution (OOD) datasets as part of the

metric and propose ‘NiDMA Score’ i.e. NLP in

Disaster Management Applications. Summarily,

(i) NiDMA covers interactive NLP applications,

(ii) DiDMA is speciﬁcally tailored to disaster

management but can involve non-interactive NLP

applications where OOD data is not frequent,

and (iii) DiSCA can be used in any safety critical

domain.

Our analysis across ten models sheds light on

arXiv:2210.04466v1 [cs.CL] 10 Oct 2022

Table 1: In each case, AUC ranks Model A as having higher performance than Model B. However, B has higher

selective answering capability, as explained in Section 2.1. X axis: Coverage and Y axis: Accuracy.

the strengths and weaknesses of various language

models. For example, we observe that newer and

larger pre-trained models are not necessarily better

in performing selective answering. We also ob-

serve that model ranking based on the accuracy

metric does not match with their ranking based on

selective answering capability, similar to the ob-

servations by Mishra and Arunkumar (2021). We

hope our insights will bring more attention to de-

veloping better models and evaluation metrics for

safety-critical applications.

2 Investigating AUC

Experiment Details:

Our experiments span over

ten different models– Bag-of-Words Sum (BOW-

SUM) (Harris,1954), Word2Vec Sum (W2V SUM)

(Mikolov et al.,2013), GloVe Sum (GLOVE

SUM) (Pennington et al.,2014), Word2Vec CNN

(W2V CNN) (LeCun et al.,1995), GloVe CNN

(GLOVE CNN), Word2Vec LSTM (W2V LSTM)

(Hochreiter and Schmidhuber,1997), GloVe LSTM

(GLOVE LSTM), BERT Base (BERT BASE) (De-

vlin et al.,2018), BERT Large (BERT LARGE)

with GELU (Hendrycks and Gimpel,2016b) and

RoBERTa Large ((ROBERTA LARGE) (Liu et al.,

2019), in line with recent works (Hendrycks et al.,

2020;Mishra et al.,2020;Mishra and Arunkumar,

2021). We analyze these models over two movie

review datasets– (i) SST-2 (Socher et al.,2013) that

contains short expert movie reviews and (ii) IMDb

(Maas et al.,2011) which consists of full-length

inexpert movie reviews. We train models on IMDb

and evaluate on both SST-2 and IMDb. Our intu-

ition behind this is to ensure both IID (IMDb test

set) and OOD (SST-2 test set) evaluation.

2.1 Adversarial Attack on AUC:

AUC Tail:

Consider a case where Ahas higher

overall AUC and lower accuracy than B, in regions

of higher accuracy and lower coverage (Table 1,

Case 1). Then, Bis better because safety critical

applications have a lower tolerance for incorrect

answering and so most often will operate in

the region of higher accuracy. This is seen in

the case of the AUCs of BERT-BASE (A) and

BERT-LARGE (B) for the SST-2 dataset 1.

Curve Fluctuations:

Another case is when

accuracy does not vary in a monotonically decreas-

ing fashion. Even though the model has higher

AUC, a non-monotonically decreasing curve shape

shows that conﬁdence and correctness of answer

are not correlated, making the corresponding

model comparatively undesirable (Table 1, Cases

2-6)– this is seen across all models over both

datasets, especially at regions of low coverage. The

ﬂuctuations for the OOD dataset are more frequent

and have higher magnitude for most models 1.

Plateau Formation:

The range of maximum

softmax probability values that models associate

with predictions varies. For example we see that

LSTM models have a wide maxprob range, while

transformer models (BERT-LARGE, ROBERTA-

LARGE) answer all questions with high maxprob

values. In the latter case, model accuracy stays

above values of 90% in regions of high coverage;

this limited accuracy range forms a plateau in

the AUC curve. This plateau is not indicative of

model performance as the maxprob values (of

incorrect answers) in this region are high and

relatively unvarying compared to other models; it

is therefore undesirable. Such a plateau also makes

it difﬁcult to decide which portion of the AUC

curve to ignore and ﬁnd an operating point, while

deploying in disaster management applications

(where the tolerance for incorrect answering is

low). Plateau formation is acceptable when models

answer with low maxprob, either always or in

regions of high coverage, irrespective of the level

of accuracy (though the range should be limited).

However, this acceptable curve condition is not

See Supplementary Material:DiSCA, NiDMA for more

details

observed in any of the models examined, over both

datasets.

3 Alternative Metrics

3.1 DiSCA Score

Let

be the maxprob value when accuracy ﬁrst

drops below 100%. Let

be the lower bound max-

prob value which when used as a cutoff for an-

swering questions, results in an accuracy that is the

worst possible accuracy admissible by the tolerance

level of the domain. Let

represent the number of

times the slope of the curve is seen to increase with

increasing coverage, and

d1, d2

and

c1, c2

respec-

tively represent the accuracy and maxprob values

at the two points where this increase in slope oc-

curs, such that

d1ăd2

and

c1ą“ c2

. Let

represent weights that ﬂexibly deﬁne the region

of interest depending on application requirements,

such that

x`y`z“1

is a hyperparameter;

we experiment with a range of bvalues, where the

worst possible accuracy varies from 100% to 50%.

The DiSCA Score is deﬁned as:

DiSCA “x

a`y

b´z¨řn

ipn´i`1q ¨ pd2´d1q

řn

ipn´i`1q

(1)

where

c“#pc1´c2qif pc1´c2q ą 0.001

0.001 otherwise

Our deﬁnition of

is based on the observation of

very low order differences in maxprob values for

CNN and transformer models.

3.1.1 Observations:

First Term:

From Figure 2(A), we see that the

values are lower for the BERT-BASE, W2V-LSTM

and RoBERTA-LARGE models, indicating

that the ﬁrst incorrect classiﬁcation occurs at

lower maxprob values than in other models.

Based on equation 1, BERT-BASE, W2V-LSTM

and RoBERTA-LARGE are the top-3 models

respectively based on the ﬁrst term of DiSCA Score.

Figure 2: avalues for (A)IMDb and (B)SST-2 datasets.

IMDb, the IID dataset is used in DisCA Score, while

SST-2, the OOD dataset is used in NiDMA Score.

Second Term:

Figure 1(A) illustrates the

values of various models, over a range of worst

possible accuracies. In RoBERTA-LARGE, when

the accuracy drops below 95%, the maxprob

value is 0.53; it is above 0.8 for other models.

This shows that RoBERTA-LARGE is better

than other models for the 90-95% accuracy bin.

GLOVE-SUM is found to be relatively better

overall, than other models, as its maxprob is a

better indicator of accuracy (seen from sharper

decrease in the ﬁgure). BOW-SUM is the worst

model as a signiﬁcant amount of samples with the

highest conﬁdence values are classiﬁed incorrectly,

causing

for all accuracy bins to be extremely

high and relatively uniform.

Third Term:

The number of times accuracy

increases with increase in coverage is highest for

the BOW-SUM model, making it the worst model

(Figure 3). Word averaging models (W2V-SUM,

BOW-SUM, and GLOVE-SUM) are seen to

have a higher number of ﬂuctuations on average;

other models have near-zero ﬂuctuations, mostly

occurring at the highest maxprob samples

The magnitude-number ratio of ﬂuctuations in

RoBERTA-LARGE is high, in comparison to other

models.

Overall Ranking:

In Table 2(a), BERT-BASE

is ranked highest, as it has no ﬂuctuation penalty

and also has a lower maxprob value when accuracy

Figure 1: bfor (A)IMDb, (B) SST-2 datasets. Values from 1-0.5 on the x-axis represent the accuracy thresholds

considered, while ‘Lowest Maxprob’ represents the lowest maxprob value a model assigns at 100% coverage.

MODEl <0.95 <0.90 <0.85 <0.80 <0.75 <0.70 <0.65 <0.60 <0.55

BERT-BASE 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69

W2V-LSTM 0.63 0.65 0.65 0.65 0.65 0.65 0.65 0.65 0.65

GLOVE-CNN 0.54 0.54 0.54 0.54 0.54 0.54 0.54 0.54 0.54

GLOVE-SUM 0.52 0.54 0.56 0.56 0.56 0.56 0.56 0.56 0.56

BERT-LARGE 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48

ROBERTA-LARGE 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45

GLOVE-LSTM 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37

W2V-SUM 0.35 0.37 0.39 0.39 0.39 0.39 0.39 0.39 0.39

W2V-CNN 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36

BOW-SUM -6.71 -6.71 -6.71 -6.71 -6.71 -6.71 -6.71 -6.71 -6.71

MODEL <0.95 <0.90 <0.85 <0.80 <0.75 <0.70 <0.65 <0.60 <0.55

W2V-SUM 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71

GLOVE-LSTM 0.65 0.67 0.70 0.70 0.70 0.70 0.70 0.70 0.70

GLOVE-SUM 0.60 0.61 0.63 0.65 0.68 0.68 0.68 0.68 0.68

BERT-LARGE 0.60 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62

BERT-BASE 0.43 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45

ROBERTA-LARGE 0.22 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24

W2V-CNN -0.68 -0.66 -0.64 -0.64 -0.64 -0.64 -0.64 -0.64 -0.64

BOW-SUM -0.71 -0.69 -0.68 -0.66 -0.63 -0.63 -0.63 -0.63 -0.63

GLOVE-CNN -2.43 -2.43 -2.42 -2.42 -2.42 -2.42 -2.42 -2.42 -2.42

W2V-LSTM -3.53 -3.53 -3.53 -3.53 -3.53 -3.53 -3.53 -3.53 -3.53

Table 2: (a) LHS: DiSCA (IID)– (IMDb), (b) RHS: DiSCA (OOD)– SST-2; x=y=z=0.33. The column header

indicates the accuracy threshold used to calculate b. Green: Model accuracy falls below the threshold indicated

by the column head, but is higher than the next column threshold. Yellow: If the latter is false. White: Model’s

overall accuracy is higher than the threshold of that column, and therefore never falls below that threshold.

ﬁrst drops below 100% (

). Since the magnitude-

number ﬂuctuation ratio of BERT-LARGE and

RoBERTA-LARGE are higher than that seen for

W2V-LSTM, GLOVE-CNN, and GLOVE-SUM,

the former are ranked lower. Despite GLOVE-

SUM having the overall best maxprob-accuracy

correlation, its higher ﬂuctuation number lowers

the score. BOW-SUM is seen to be the worst

(and only negatively scoring model), in line with

observations made across all individual terms.

3.2 DiDMA Score

DiDMA “p˚DiSCA`q˚Computation_Score

(2)

where

Computation_Score “1

Energy

DiDMA Score is obtained by summing weighted

DiSCA Score (

)and Computation Score (

) based

on application requirements (such that

=1). In

Computation Score, models are scored based on

the energy usage. Higher energy usage implies

lower computation score. We suggest using equa-

tions

of a recent work (Henderson et al.,2020)

for the energy calculation. Since, energy usage is

hardware dependent, we do not calculate the term

here; it should be calculated based on the device of

deployment in disaster management.

3.3 NiDMA Score

NiDMA “u˚DiDMA `v˚DiSCApOODq

(3)

NiDMA Score is obtained by summing weighted

DiDMA Score(

) and DiSCA Score on OOD data

(

) based on application requirements (where

=1). Figure 2(B), 1(B) and 3(SST-2) illustrate

the DiSCA Score on OOD datasets.

From Figure 2(B), we see that ﬁrst term based

ranking of the DiSCA(OOD)Score is not preserved

with respect to DiSCA Score in Figure 2(A);

values are lowest for the GLOVE-SUM and

2See Supplementary: DiDMA for more details

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InvestigatingtheFailureModesoftheAUCmetricandExploringAlternativesforEvaluatingSystemsinSafetyCriticalApplicationsSwaroopMishraAnjanaArunkumarChittaBaralArizonaStateUniversityAbstractWiththeincreasingimportanceofsafetyre-quirementsassociatedwiththeuseofblackboxmodels,evaluationofselectiveansweringca...

展开>> 收起<<

Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications Swaroop Mishra Anjana Arunkumar Chitta Baral.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications Swaroop Mishra Anjana Arunkumar Chitta Baral

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: