Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications Swaroop Mishra Anjana Arunkumar Chitta Baral

2025-05-03 0 0 7.69MB 20 页 10玖币
侵权投诉
Investigating the Failure Modes of the AUC metric and Exploring
Alternatives for Evaluating Systems in Safety Critical Applications
Swaroop Mishra Anjana Arunkumar Chitta Baral
Arizona State University
Abstract
With the increasing importance of safety re-
quirements associated with the use of black
box models, evaluation of selective answering
capability of models has been critical. Area
under the curve (AUC) is used as a metric for
this purpose. We find limitations in AUC; e.g.,
a model having higher AUC is not always bet-
ter in performing selective answering. We pro-
pose three alternate metrics that fix the iden-
tified limitations. On experimenting with ten
models, our results using the new metrics show
that newer and larger pre-trained models do
not necessarily show better performance in se-
lective answering. We hope our insights will
help develop better models tailored for safety-
critical applications.
1 Introduction
Humans have the capability to gauge how much
they know; this leads them to abstain from
answering whenever they are not confident about
an answer. Such a selective answering capability
(Kamath et al.,2020;Varshney et al.,2020;
Garg and Moschitti,2021) is also essential for
machine learning systems, especially in the case of
safety critical applications like healthcare, where
incorrect answering can result in critically negative
consequences.
Area under the curve (AUC) has been used
as a metric to evaluate models based on their
selective answering capability. AUC involves
finding model coverage and accuracy, at various
confidence values. MaxProb– maximum softmax
probability of models’ predictions– has been used
as a strong baseline to decide whether to answer
a question or abstain (Hendrycks and Gimpel,
2016a;Lakshminarayanan et al.,2017;Varshney
et al.,2022). However, if Model Ahas higher AUC
than Model B, can we always say that Ais better
at selective answering and thus better suited for
safety critical applications than B?
We experiment across 10 models– ranging from
bag-of-words models to pre-trained transformers–
and find that a model having higher AUC does
not necessarily have a higher selective answering
capability. We adversarially attack AUC and find
several limitations. These limitations prevent AUC
from evaluating the efficacy of models in safety
critical applications.
In pursuit of fixing the limitations, we propose
an evaluation metric– ‘DiSCA Score’ i.e. Deploy-
ment in Safety Critical Applications. Disaster
management is a high-impact and regularly oc-
curring safety critical application. Deployment of
Machine Learning systems in disaster management
will often be through hand-held devices, such
as smartphones, hence requiring light-weight
models. Subsequently, we incorporate computation
requirements in ‘DiSCA Score’, and propose
‘DiDMA Score’ i.e. Deployment in Disaster
Management Applications.
NLP applications in disaster management often
involve interaction with users (Phengsuwan et al.,
2021;Mishra et al.,2022a) whose questions
may diverge from a training set, due to rich
variations in natural language. We therefore further
add the evaluation of abstaining capability on
Out-of-Distribution (OOD) datasets as part of the
metric and propose ‘NiDMA Score’ i.e. NLP in
Disaster Management Applications. Summarily,
(i) NiDMA covers interactive NLP applications,
(ii) DiDMA is specifically tailored to disaster
management but can involve non-interactive NLP
applications where OOD data is not frequent,
and (iii) DiSCA can be used in any safety critical
domain.
Our analysis across ten models sheds light on
arXiv:2210.04466v1 [cs.CL] 10 Oct 2022
Table 1: In each case, AUC ranks Model A as having higher performance than Model B. However, B has higher
selective answering capability, as explained in Section 2.1. X axis: Coverage and Y axis: Accuracy.
the strengths and weaknesses of various language
models. For example, we observe that newer and
larger pre-trained models are not necessarily better
in performing selective answering. We also ob-
serve that model ranking based on the accuracy
metric does not match with their ranking based on
selective answering capability, similar to the ob-
servations by Mishra and Arunkumar (2021). We
hope our insights will bring more attention to de-
veloping better models and evaluation metrics for
safety-critical applications.
2 Investigating AUC
Experiment Details:
Our experiments span over
ten different models– Bag-of-Words Sum (BOW-
SUM) (Harris,1954), Word2Vec Sum (W2V SUM)
(Mikolov et al.,2013), GloVe Sum (GLOVE
SUM) (Pennington et al.,2014), Word2Vec CNN
(W2V CNN) (LeCun et al.,1995), GloVe CNN
(GLOVE CNN), Word2Vec LSTM (W2V LSTM)
(Hochreiter and Schmidhuber,1997), GloVe LSTM
(GLOVE LSTM), BERT Base (BERT BASE) (De-
vlin et al.,2018), BERT Large (BERT LARGE)
with GELU (Hendrycks and Gimpel,2016b) and
RoBERTa Large ((ROBERTA LARGE) (Liu et al.,
2019), in line with recent works (Hendrycks et al.,
2020;Mishra et al.,2020;Mishra and Arunkumar,
2021). We analyze these models over two movie
review datasets– (i) SST-2 (Socher et al.,2013) that
contains short expert movie reviews and (ii) IMDb
(Maas et al.,2011) which consists of full-length
inexpert movie reviews. We train models on IMDb
and evaluate on both SST-2 and IMDb. Our intu-
ition behind this is to ensure both IID (IMDb test
set) and OOD (SST-2 test set) evaluation.
2.1 Adversarial Attack on AUC:
AUC Tail:
Consider a case where Ahas higher
overall AUC and lower accuracy than B, in regions
of higher accuracy and lower coverage (Table 1,
Case 1). Then, Bis better because safety critical
applications have a lower tolerance for incorrect
answering and so most often will operate in
the region of higher accuracy. This is seen in
the case of the AUCs of BERT-BASE (A) and
BERT-LARGE (B) for the SST-2 dataset 1.
Curve Fluctuations:
Another case is when
accuracy does not vary in a monotonically decreas-
ing fashion. Even though the model has higher
AUC, a non-monotonically decreasing curve shape
shows that confidence and correctness of answer
are not correlated, making the corresponding
model comparatively undesirable (Table 1, Cases
2-6)– this is seen across all models over both
datasets, especially at regions of low coverage. The
fluctuations for the OOD dataset are more frequent
and have higher magnitude for most models 1.
Plateau Formation:
The range of maximum
softmax probability values that models associate
with predictions varies. For example we see that
LSTM models have a wide maxprob range, while
transformer models (BERT-LARGE, ROBERTA-
LARGE) answer all questions with high maxprob
values. In the latter case, model accuracy stays
above values of 90% in regions of high coverage;
this limited accuracy range forms a plateau in
the AUC curve. This plateau is not indicative of
model performance as the maxprob values (of
incorrect answers) in this region are high and
relatively unvarying compared to other models; it
is therefore undesirable. Such a plateau also makes
it difficult to decide which portion of the AUC
curve to ignore and find an operating point, while
deploying in disaster management applications
(where the tolerance for incorrect answering is
low). Plateau formation is acceptable when models
answer with low maxprob, either always or in
regions of high coverage, irrespective of the level
of accuracy (though the range should be limited).
However, this acceptable curve condition is not
1
See Supplementary Material:DiSCA, NiDMA for more
details
observed in any of the models examined, over both
datasets.
3 Alternative Metrics
3.1 DiSCA Score
Let
a
be the maxprob value when accuracy first
drops below 100%. Let
b
be the lower bound max-
prob value which when used as a cutoff for an-
swering questions, results in an accuracy that is the
worst possible accuracy admissible by the tolerance
level of the domain. Let
n
represent the number of
times the slope of the curve is seen to increase with
increasing coverage, and
d1, d2
and
c1, c2
respec-
tively represent the accuracy and maxprob values
at the two points where this increase in slope oc-
curs, such that
d1ăd2
and
c1ą“ c2
. Let
x
,
y
,
z
represent weights that flexibly define the region
of interest depending on application requirements,
such that
x`y`z1
.
b
is a hyperparameter;
we experiment with a range of bvalues, where the
worst possible accuracy varies from 100% to 50%.
The DiSCA Score is defined as:
DiSCA x
a`y
b´z¨řn
ipn´i`1q ¨ pd2´d1q
c
řn
ipn´i`1q
(1)
where
c#pc1´c2qif pc1´c2q ą 0.001
0.001 otherwise
Our definition of
c
is based on the observation of
very low order differences in maxprob values for
CNN and transformer models.
3.1.1 Observations:
First Term:
From Figure 2(A), we see that the
a
values are lower for the BERT-BASE, W2V-LSTM
and RoBERTA-LARGE models, indicating
that the first incorrect classification occurs at
lower maxprob values than in other models.
Based on equation 1, BERT-BASE, W2V-LSTM
and RoBERTA-LARGE are the top-3 models
respectively based on the first term of DiSCA Score.
Figure 2: avalues for (A)IMDb and (B)SST-2 datasets.
IMDb, the IID dataset is used in DisCA Score, while
SST-2, the OOD dataset is used in NiDMA Score.
Second Term:
Figure 1(A) illustrates the
b
values of various models, over a range of worst
possible accuracies. In RoBERTA-LARGE, when
the accuracy drops below 95%, the maxprob
value is 0.53; it is above 0.8 for other models.
This shows that RoBERTA-LARGE is better
than other models for the 90-95% accuracy bin.
GLOVE-SUM is found to be relatively better
overall, than other models, as its maxprob is a
better indicator of accuracy (seen from sharper
decrease in the figure). BOW-SUM is the worst
model as a significant amount of samples with the
highest confidence values are classified incorrectly,
causing
b
for all accuracy bins to be extremely
high and relatively uniform.
Third Term:
The number of times accuracy
increases with increase in coverage is highest for
the BOW-SUM model, making it the worst model
(Figure 3). Word averaging models (W2V-SUM,
BOW-SUM, and GLOVE-SUM) are seen to
have a higher number of fluctuations on average;
other models have near-zero fluctuations, mostly
occurring at the highest maxprob samples
1
.
The magnitude-number ratio of fluctuations in
RoBERTA-LARGE is high, in comparison to other
models.
Overall Ranking:
In Table 2(a), BERT-BASE
is ranked highest, as it has no fluctuation penalty
and also has a lower maxprob value when accuracy
Figure 1: bfor (A)IMDb, (B) SST-2 datasets. Values from 1-0.5 on the x-axis represent the accuracy thresholds
considered, while ‘Lowest Maxprob’ represents the lowest maxprob value a model assigns at 100% coverage.
MODEl <0.95 <0.90 <0.85 <0.80 <0.75 <0.70 <0.65 <0.60 <0.55
BERT-BASE 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69
W2V-LSTM 0.63 0.65 0.65 0.65 0.65 0.65 0.65 0.65 0.65
GLOVE-CNN 0.54 0.54 0.54 0.54 0.54 0.54 0.54 0.54 0.54
GLOVE-SUM 0.52 0.54 0.56 0.56 0.56 0.56 0.56 0.56 0.56
BERT-LARGE 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48
ROBERTA-LARGE 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45
GLOVE-LSTM 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37 0.37
W2V-SUM 0.35 0.37 0.39 0.39 0.39 0.39 0.39 0.39 0.39
W2V-CNN 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36 0.36
BOW-SUM -6.71 -6.71 -6.71 -6.71 -6.71 -6.71 -6.71 -6.71 -6.71
MODEL <0.95 <0.90 <0.85 <0.80 <0.75 <0.70 <0.65 <0.60 <0.55
W2V-SUM 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71 0.71
GLOVE-LSTM 0.65 0.67 0.70 0.70 0.70 0.70 0.70 0.70 0.70
GLOVE-SUM 0.60 0.61 0.63 0.65 0.68 0.68 0.68 0.68 0.68
BERT-LARGE 0.60 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62
BERT-BASE 0.43 0.45 0.45 0.45 0.45 0.45 0.45 0.45 0.45
ROBERTA-LARGE 0.22 0.24 0.24 0.24 0.24 0.24 0.24 0.24 0.24
W2V-CNN -0.68 -0.66 -0.64 -0.64 -0.64 -0.64 -0.64 -0.64 -0.64
BOW-SUM -0.71 -0.69 -0.68 -0.66 -0.63 -0.63 -0.63 -0.63 -0.63
GLOVE-CNN -2.43 -2.43 -2.42 -2.42 -2.42 -2.42 -2.42 -2.42 -2.42
W2V-LSTM -3.53 -3.53 -3.53 -3.53 -3.53 -3.53 -3.53 -3.53 -3.53
Table 2: (a) LHS: DiSCA (IID)– (IMDb), (b) RHS: DiSCA (OOD)– SST-2; x=y=z=0.33. The column header
indicates the accuracy threshold used to calculate b. Green: Model accuracy falls below the threshold indicated
by the column head, but is higher than the next column threshold. Yellow: If the latter is false. White: Model’s
overall accuracy is higher than the threshold of that column, and therefore never falls below that threshold.
first drops below 100% (
a
). Since the magnitude-
number fluctuation ratio of BERT-LARGE and
RoBERTA-LARGE are higher than that seen for
W2V-LSTM, GLOVE-CNN, and GLOVE-SUM,
the former are ranked lower. Despite GLOVE-
SUM having the overall best maxprob-accuracy
correlation, its higher fluctuation number lowers
the score. BOW-SUM is seen to be the worst
(and only negatively scoring model), in line with
observations made across all individual terms.
3.2 DiDMA Score
DiDMA p˚DiSCA`q˚Computation_Score
(2)
where
Computation_Score 1
Energy
DiDMA Score is obtained by summing weighted
DiSCA Score (
p
)and Computation Score (
q
) based
on application requirements (such that
p
+
q
=1). In
Computation Score, models are scored based on
the energy usage. Higher energy usage implies
lower computation score. We suggest using equa-
tions
2
of a recent work (Henderson et al.,2020)
for the energy calculation. Since, energy usage is
hardware dependent, we do not calculate the term
here; it should be calculated based on the device of
deployment in disaster management.
3.3 NiDMA Score
NiDMA u˚DiDMA `v˚DiSCApOODq
(3)
NiDMA Score is obtained by summing weighted
DiDMA Score(
u
) and DiSCA Score on OOD data
(
v
) based on application requirements (where
u
+
v
=1). Figure 2(B), 1(B) and 3(SST-2) illustrate
the DiSCA Score on OOD datasets.
From Figure 2(B), we see that first term based
ranking of the DiSCA(OOD)Score is not preserved
with respect to DiSCA Score in Figure 2(A);
a
values are lowest for the GLOVE-SUM and
2See Supplementary: DiDMA for more details
摘要:

InvestigatingtheFailureModesoftheAUCmetricandExploringAlternativesforEvaluatingSystemsinSafetyCriticalApplicationsSwaroopMishraAnjanaArunkumarChittaBaralArizonaStateUniversityAbstractWiththeincreasingimportanceofsafetyre-quirementsassociatedwiththeuseofblackboxmodels,evaluationofselectiveansweringca...

展开>> 收起<<
Investigating the Failure Modes of the AUC metric and Exploring Alternatives for Evaluating Systems in Safety Critical Applications Swaroop Mishra Anjana Arunkumar Chitta Baral.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:7.69MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注