On the Calibration of Massively Multilingual Language Models
Kabir Ahuja1Sunayana Sitaram1Sandipan Dandapat2Monojit Choudhury2
1Microsoft Research, India
2Microsoft R&D, India
{t-kabirahuja,sadandap,sunayana.sitaram,monojitc}@microsoft.com
Abstract
Massively Multilingual Language Models
(MMLMs) have recently gained popularity
due to their surprising effectiveness in cross-
lingual transfer. While there has been much
work in evaluating these models for their per-
formance on a variety of tasks and languages,
little attention has been paid on how well cal-
ibrated these models are with respect to the
confidence in their predictions. We first in-
vestigate the calibration of MMLMs in the
zero-shot setting and observe a clear case of
miscalibration in low-resource languages or
those which are typologically diverse from En-
glish. Next, we empirically show that calibra-
tion methods like temperature scaling and la-
bel smoothing do reasonably well towards im-
proving calibration in the zero-shot scenario.
We also find that few-shot examples in the
language can further help reduce the calibra-
tion errors, often substantially. Overall, our
work contributes towards building more reli-
able multilingual models by highlighting the
issue of their miscalibration, understanding
what language and model specific factors in-
fluence it, and pointing out the strategies to im-
prove the same.
1 Introduction
Massively Multilingual Language Models
(MMLMs) like mBERT (Devlin et al.,2019),
XLMR (Conneau et al.,2020), mT5 (Xue et al.,
2021) and mBART (Liu et al.,2020) have been
surprisingly effective at zero-shot cross lingual
transfer i.e. when fine-tuned on an NLP task in one
language, they often tend to generalize reasonably
well in languages unseen during fine-tuning.
These models have been evaluated for their per-
formance across a range of multilingual tasks (Pan
et al.,2017;Nivre et al.,2018;Conneau et al.,2018)
and numerous methods like adapters (Pfeiffer et al.,
2020), sparse fine-tuning (Ansell et al.,2022) and
few-shot learning (Lauscher et al.,2020) have been
proposed to further improve performance of cross
lingual transfer.
Despite these developments, there has been lit-
tle to no attention paid to the calibration of these
models across languages i.e. how reliable the con-
fidence predictions of these models are. As these
models find their way more and more into the
real word applications with safety implications,
like Hate Speech Detection (Davidson et al.,2017;
Deshpande et al.,2022) it becomes important to
only take extreme actions for high confidence pre-
dictions by the model (Sarkar and KhudaBukhsh,
2021). Hence, calibrated confidences are desirable
to have when deploying such systems in practice.
Guo et al. (2017) showed that modern neural
networks used for Image Recognition (He et al.,
2016) perform much better than the ones intro-
duced decades ago (Lecun et al.,1998), but are sig-
nificantly worse calibrated and often over-estimate
their confidence on incorrect predictions. For
NLP tasks specifically, Desai and Durrett (2020)
showed that classifiers trained using pre-trained
transformer based models (Devlin et al.,2019) are
well calibrated both in-domain and out-of-domain
settings compared to non-pre-trained model base-
lines (Chen et al.,2017). Notably, Ponti et al.
(2021) highlights, since zero-shot cross lingual
transfer represents shifts in the data distribution
the point estimates are likely to be miscalibrated,
which forms the core setting of our work.
In light of this, our work has three main contri-
butions. First, we investigate the calibration of two
commonly used MMLMs: mBERT and XLM-R on
four NLU tasks under zero-shot setting where the
models are fine-tuned in English and calibration
errors are computed on unseen languages. We find
a clear increase in calibration errors compared to
English as can be seen in Figures 1a and 1b, with
calibration being significantly worse for Swahili
compared to English.
Second, we look for factors that might affect the
arXiv:2210.12265v1 [cs.CL] 21 Oct 2022