
Gales (2021), and we leave the extension and ex-
ploration of such methods for different uncertainty
metrics, models and tasks to future work.
3.3 Dataset Selection & Creation
In-distribution training sets
We choose three
different languages, namely English (Clinc Plus;
Larson et al.,2019), Danish in the form of the Dan+
dataset (Plank et al.,2020) based on News texts
from PAROLE-DK (Bilgram and Keson,1998),
Finnish (UD Treebank; Haverinen et al.,2014;
Pyysalo et al.,2015;Kanerva and Ginter,2022),
corresponding to NLP tasks such as sequence
classification, named entity recognition and
part-of-speech tagging. An overview over the used
the data is given in Table 1. We do use standardized
low-resource languages in the case of Finnish and
Danish, and simulate a low-resource setting using
English data.
4
Starting with a sufficiently-sized
training set and then sub-sampling allows us to
create training sets of arbitrary sizes. By using
languages from different families, we hope to be
able draw conclusions that generalize across a
single language. We employ a specific sampling
scheme that tries to maintain the sequence length
and class distribution of the original corpus, which
we explain and verify in Appendix A.2.
Out-of-distribution Test Sets
While it is possi-
ble to create OOD text by for instance withholding
classes from the training set or appending text from
a different source (Arora et al.,2021), we choose
to pick entirely new OOD test sets that are quali-
tatively different: Out-of-scope voice commands
by users in Larson et al. (2019),
5
the Twitter split
of the Dan+ dataset (Plank et al.,2020), and the
Finnish OOD treebank (Kanerva and Ginter,2022).
In similar works for the image domain, OOD test
sets are often chosen to be convincingly different
from the training distribution, for instance MNIST
versus Fashion-MNIST (Nalisnick et al.,2019;van
4
The definition of low-resource actually differs greatly be-
tween works. One definition by Bird (2022) advocates the us-
age for (would-be) standardized languages with a large amount
of speakers and a written tradition, but a lack of resources for
language technologies. Another way is a task-dependent defi-
nition: For dependency parsing, Müller-Eberstein et al. (2021)
define low-resource as providing less than
5000
annotated sen-
tences in the Universal Dependencies Treebank. Hedderich
et al. (2021); Lignos et al. (2022) lay out a task-dependent
spectrum, from a several hundred to thousands of instances.
5
Since all instances in this test set correspond to out-of-
scope inputs and not to classes the model was trained on, we
cannot evaluate certain metrics in Table 2.
Amersfoort et al.,2021). While there exist a va-
riety of formalizations of types of distributional
shift (Moreno-Torres et al.,2012;Wald et al.,2021;
Arora et al.,2021;Federici et al.,2021), it is often
hard to determine if and what kind of shift is taking
place. Winkens et al. (2020) define near OOD as a
scenario in which the inlier and outlier distribution
are meaningfully related, and far OOD as a case in
which they are unrelated. Unfortunately, this dis-
tinction is somewhat arbitrary and hard to apply in
a language context, where OOD could be defined
as anything ranging from a different language or
dialect to a different demographic on an author or
speaker or a new genre. Therefore, we use a similar
methodology to the validation of the sub-sampled
training sets to make an argument that the selected
OOD splits are sufficiently different in nature from
the training splits. The exact procedure along some
more detailed results is described in Appendix A.3.
3.4 Model Training
Unfortunately, our datasets do not contain enough
data to train transformer-based models from
scratch. Therefore, we only fully train LSTM-
based models, while using pre-trained transform-
ers, namely BERT (English; Devlin et al.,2019),
Danish BERT (Hvingelby et al.,2020), and Fin-
BERT (Finnish; Virtanen et al.,2019), for the other
approaches. The whole procedure is depicted in
Figure 1. The way we optimize models is provided
in Appendix C.3. We list training hardware, hyper-
parameter information in Appendix C.2, with the
environmental impact described in Appendix C.5.
3.5 Evaluation
Apart from evaluating models on the task perfor-
mance, we also evaluate the following calibration
and uncertainty, painting a multi-faceted picture
of the reliability of models. In all cases, we use
the Almost Stochastic Order test (ASO; del Bar-
rio et al.,2018;Dror et al.,2019) for significance
testing, which is elaborated on in Appendix C.1.
Evaluation of Calibration
First, we measure
the calibration of models using the adaptive cal-
ibration error (ACE; Nixon et al.,2019), which is
an extension of the expected calibration error (ECE;
Naeini et al.,2015;Guo et al.,2017).
6
Furthermore,
we use the frequentist measure of coverage (Larry,
2004;Kompa et al.,2021). Coverage is based on
6
See Appendix B for a short overview over the differences.