
Table 1: In each case, AUC ranks Model A as having higher performance than Model B. However, B has higher
selective answering capability, as explained in Section 2.1. X axis: Coverage and Y axis: Accuracy.
the strengths and weaknesses of various language
models. For example, we observe that newer and
larger pre-trained models are not necessarily better
in performing selective answering. We also ob-
serve that model ranking based on the accuracy
metric does not match with their ranking based on
selective answering capability, similar to the ob-
servations by Mishra and Arunkumar (2021). We
hope our insights will bring more attention to de-
veloping better models and evaluation metrics for
safety-critical applications.
2 Investigating AUC
Experiment Details:
Our experiments span over
ten different models– Bag-of-Words Sum (BOW-
SUM) (Harris,1954), Word2Vec Sum (W2V SUM)
(Mikolov et al.,2013), GloVe Sum (GLOVE
SUM) (Pennington et al.,2014), Word2Vec CNN
(W2V CNN) (LeCun et al.,1995), GloVe CNN
(GLOVE CNN), Word2Vec LSTM (W2V LSTM)
(Hochreiter and Schmidhuber,1997), GloVe LSTM
(GLOVE LSTM), BERT Base (BERT BASE) (De-
vlin et al.,2018), BERT Large (BERT LARGE)
with GELU (Hendrycks and Gimpel,2016b) and
RoBERTa Large ((ROBERTA LARGE) (Liu et al.,
2019), in line with recent works (Hendrycks et al.,
2020;Mishra et al.,2020;Mishra and Arunkumar,
2021). We analyze these models over two movie
review datasets– (i) SST-2 (Socher et al.,2013) that
contains short expert movie reviews and (ii) IMDb
(Maas et al.,2011) which consists of full-length
inexpert movie reviews. We train models on IMDb
and evaluate on both SST-2 and IMDb. Our intu-
ition behind this is to ensure both IID (IMDb test
set) and OOD (SST-2 test set) evaluation.
2.1 Adversarial Attack on AUC:
AUC Tail:
Consider a case where Ahas higher
overall AUC and lower accuracy than B, in regions
of higher accuracy and lower coverage (Table 1,
Case 1). Then, Bis better because safety critical
applications have a lower tolerance for incorrect
answering and so most often will operate in
the region of higher accuracy. This is seen in
the case of the AUCs of BERT-BASE (A) and
BERT-LARGE (B) for the SST-2 dataset 1.
Curve Fluctuations:
Another case is when
accuracy does not vary in a monotonically decreas-
ing fashion. Even though the model has higher
AUC, a non-monotonically decreasing curve shape
shows that confidence and correctness of answer
are not correlated, making the corresponding
model comparatively undesirable (Table 1, Cases
2-6)– this is seen across all models over both
datasets, especially at regions of low coverage. The
fluctuations for the OOD dataset are more frequent
and have higher magnitude for most models 1.
Plateau Formation:
The range of maximum
softmax probability values that models associate
with predictions varies. For example we see that
LSTM models have a wide maxprob range, while
transformer models (BERT-LARGE, ROBERTA-
LARGE) answer all questions with high maxprob
values. In the latter case, model accuracy stays
above values of 90% in regions of high coverage;
this limited accuracy range forms a plateau in
the AUC curve. This plateau is not indicative of
model performance as the maxprob values (of
incorrect answers) in this region are high and
relatively unvarying compared to other models; it
is therefore undesirable. Such a plateau also makes
it difficult to decide which portion of the AUC
curve to ignore and find an operating point, while
deploying in disaster management applications
(where the tolerance for incorrect answering is
low). Plateau formation is acceptable when models
answer with low maxprob, either always or in
regions of high coverage, irrespective of the level
of accuracy (though the range should be limited).
However, this acceptable curve condition is not
1
See Supplementary Material:DiSCA, NiDMA for more
details