
regression (Kuleshov et al.,2018;Cui et al.,2020;
Chung et al.,2021) tasks. Recent work has inves-
tigated connections between uncertainty and other
properties, such as model interpretability (Antoran
et al.,2021;Ley et al.,2022), selective prediction
(Xin et al.,2021;Varshney et al.,2022a,b), and
out-of-domain generalization (Wald et al.,2021;
Qin et al.,2021).
PLMs (Qiu et al.,2020;Min et al.,2021) have
achieved state-of-the-art prediction performance
on diverse NLP benchmarks (Rajpurkar et al.,
2016,2018;Wang et al.,2019a,b) and demon-
strated many desired properties like stronger out-
of-domain robustness (Hendrycks et al.,2020) and
better uncertainty calibration (Desai and Durrett,
2020). They typically leverage a Transformer archi-
tecture (Vaswani et al.,2017) and are pre-trained
by self-supervised learning (Jaiswal et al.,2021).
Although Guo et al. (2017) report that larger
models tend to calibrate worse, PLMs have been
shown to produce well-calibrated uncertainty in
practice (Desai and Durrett,2020), albeit for giant
model sizes. Their unusual calibration behavior
puts the observations drawn on traditional neural
networks (Ovadia et al.,2019;Mukhoti et al.,2020)
or pre-trained vision models (Minderer et al.,2021)
in doubt. Prior work (Desai and Durrett,2020;Dan
and Roth,2021) on the calibration of PLMs often
explores only one or two types of PLMs and ig-
nores uncertainty quantifiers and fine-tuning losses
beyond Temp Scaling and Cross Entropy, respec-
tively. As a result, there lacks a holistic analysis
that explores the full set of these considerations in a
PLM-based pipeline. Therefore, our paper aspires
to fill this void via extensive empirical studies.
3 Which Pre-trained Language Model?
3.1 Experiment Setup
To evaluate the calibration performance of PLMs,
we consider a series of NLP classification tasks:
1. Sentiment Analysis
identifies the binary sen-
timent of a text sequence. We treat the IMDb
movie review dataset (Maas et al.,2011) as in-
domain and the Yelp restaurant review dataset
(Zhang et al.,2015) as out-of-domain.
2. Natural Language Inference
predicts the re-
lationship between a hypothesis and a premise.
We regard the Multi-Genre Natural Language
Inference (MNLI) dataset (Williams et al.,
2018) covering a range of genres of spoken
and written text as in-domain and the Stanford
Sentiment Natural Language Commonsense
Analysis Inference Reasoning
Xin IMDb MNLI SWAG
Xout Yelp SNLI HellaSWAG
|Y| 2 3 4
|Dtrain|25,000 392,702 73,546
|Dval|12,500 4,907 10,003
|Din|12,500 4,908 10,003
|Dout|19,000 4,923 5,021
Table 1: In- and out-of-domain datasets, label space
size, and each data split size of the three NLP tasks.
Hugging Face Model Pre-training Pre-training
Name Size Corpus Size Task
bert-base-cased 109M 16G Masked LM, NSP
xlnet-base-cased 110M 161G Permuted LM
electra-base-discriminator 110M 161G Replacement Detection
roberta-base 125M 161G Dynamic Masked LM
deberta-base 140M 85G Dynamic Masked LM
bert-large-cased 335M 16G Masked LM, NSP
xlnet-large-cased 340M 161G Permuted LM
electra-large-discriminator 335M 161G Replacement Detection
roberta-large 335M 161G Dynamic Masked LM
deberta-large 350M 85G Dynamic Masked LM
Table 2: Model size, pre-training corpus size, and pre-
training task of the five PLMs, separated into the base
(upper) and the large (lower) versions.
Natural Language Inference (SNLI) dataset
(Bowman et al.,2015) derived from image
captions only as out-of-domain.
3. Commonsense Reasoning
determines the
most reasonable continuation of a sentence
among four candidates. We view the Situa-
tions With Adversarial Generations (SWAG)
dataset (Zellers et al.,2018) as in-domain and
its adversarial variant (HellaSWAG) (Zellers
et al.,2019) as out-of-domain.
For each task, we construct
Dtrain
,
Dval
, and
Din
from the corresponding in-domain dataset, and
Dout
from the corresponding out-of-domain dataset. The
original validation set of each dataset is split in half
randomly to form a held-out non-blind testing set
(i.e.,
Din
or
Dout
). Table 1describes the task details.
To understand which PLM delivers the lowest
calibration error, we examine five popular options:
1. BERT
(Devlin et al.,2019) utilizes a bidi-
rectional Transformer architecture pre-trained
by masked language modeling (LM) and next
sentence prediction (NSP).
2. XLNet
(Yang et al.,2019) proposes a two-
stream self-attention mechanism and a pre-
training objective of permuted LM.
3. ELECTRA
(Clark et al.,2020) pre-trains a