On the Calibration of Massively Multilingual Language Models Kabir Ahuja1Sunayana Sitaram1Sandipan Dandapat2Monojit Choudhury2 1Microsoft Research India

2025-04-29 0 0 742.17KB 14 页 10玖币
侵权投诉
On the Calibration of Massively Multilingual Language Models
Kabir Ahuja1Sunayana Sitaram1Sandipan Dandapat2Monojit Choudhury2
1Microsoft Research, India
2Microsoft R&D, India
{t-kabirahuja,sadandap,sunayana.sitaram,monojitc}@microsoft.com
Abstract
Massively Multilingual Language Models
(MMLMs) have recently gained popularity
due to their surprising effectiveness in cross-
lingual transfer. While there has been much
work in evaluating these models for their per-
formance on a variety of tasks and languages,
little attention has been paid on how well cal-
ibrated these models are with respect to the
confidence in their predictions. We first in-
vestigate the calibration of MMLMs in the
zero-shot setting and observe a clear case of
miscalibration in low-resource languages or
those which are typologically diverse from En-
glish. Next, we empirically show that calibra-
tion methods like temperature scaling and la-
bel smoothing do reasonably well towards im-
proving calibration in the zero-shot scenario.
We also find that few-shot examples in the
language can further help reduce the calibra-
tion errors, often substantially. Overall, our
work contributes towards building more reli-
able multilingual models by highlighting the
issue of their miscalibration, understanding
what language and model specific factors in-
fluence it, and pointing out the strategies to im-
prove the same.
1 Introduction
Massively Multilingual Language Models
(MMLMs) like mBERT (Devlin et al.,2019),
XLMR (Conneau et al.,2020), mT5 (Xue et al.,
2021) and mBART (Liu et al.,2020) have been
surprisingly effective at zero-shot cross lingual
transfer i.e. when fine-tuned on an NLP task in one
language, they often tend to generalize reasonably
well in languages unseen during fine-tuning.
These models have been evaluated for their per-
formance across a range of multilingual tasks (Pan
et al.,2017;Nivre et al.,2018;Conneau et al.,2018)
and numerous methods like adapters (Pfeiffer et al.,
2020), sparse fine-tuning (Ansell et al.,2022) and
few-shot learning (Lauscher et al.,2020) have been
proposed to further improve performance of cross
lingual transfer.
Despite these developments, there has been lit-
tle to no attention paid to the calibration of these
models across languages i.e. how reliable the con-
fidence predictions of these models are. As these
models find their way more and more into the
real word applications with safety implications,
like Hate Speech Detection (Davidson et al.,2017;
Deshpande et al.,2022) it becomes important to
only take extreme actions for high confidence pre-
dictions by the model (Sarkar and KhudaBukhsh,
2021). Hence, calibrated confidences are desirable
to have when deploying such systems in practice.
Guo et al. (2017) showed that modern neural
networks used for Image Recognition (He et al.,
2016) perform much better than the ones intro-
duced decades ago (Lecun et al.,1998), but are sig-
nificantly worse calibrated and often over-estimate
their confidence on incorrect predictions. For
NLP tasks specifically, Desai and Durrett (2020)
showed that classifiers trained using pre-trained
transformer based models (Devlin et al.,2019) are
well calibrated both in-domain and out-of-domain
settings compared to non-pre-trained model base-
lines (Chen et al.,2017). Notably, Ponti et al.
(2021) highlights, since zero-shot cross lingual
transfer represents shifts in the data distribution
the point estimates are likely to be miscalibrated,
which forms the core setting of our work.
In light of this, our work has three main contri-
butions. First, we investigate the calibration of two
commonly used MMLMs: mBERT and XLM-R on
four NLU tasks under zero-shot setting where the
models are fine-tuned in English and calibration
errors are computed on unseen languages. We find
a clear increase in calibration errors compared to
English as can be seen in Figures 1a and 1b, with
calibration being significantly worse for Swahili
compared to English.
Second, we look for factors that might affect the
arXiv:2210.12265v1 [cs.CL] 21 Oct 2022
0.0 0.5 1.0
0.0
0.5
1.0
% of Samples
English
(Out Of Box)
Avg.
Accuracy
Avg.
Confidence
0.0 0.5 1.0
Confidence
0.0
0.5
1.0
Accuracy
Error: 6.66
Output
Gap
(a)
0.0 0.5 1.0
0.0
0.5
1.0
% of Samples
Swahili
(Out Of Box)
Avg.
Accuracy
Avg.
Confidence
0.0 0.5 1.0
Confidence
0.0
0.5
1.0
Accuracy
Error: 16.99
Output
Gap
(b)
0.0 0.5 1.0
0.0
0.5
1.0
% of Samples
Swahili
(Calibrate using en data)
Avg.
Accuracy
Avg.
Confidence
0.0 0.5 1.0
Confidence
0.0
0.5
1.0
Accuracy
Error: 11.54
Output
Gap
(c)
0.0 0.5 1.0
0.0
0.5
1.0
% of Samples
Swahili
(Calibrate using sw data)
Avg.
Accuracy
Avg.
Confidence
0.0 0.5 1.0
Confidence
0.0
0.5
1.0
Accuracy
Error: 5.69
Output
Gap
(d)
Figure 1: Reliability Diagrams for XLMR fine-tuned on XNLI using training data in English. 1a and 1b provides
the calibration of XLMR fine-tuned without any calibration technique on English and Swahilli. 1c and 1d are
obtained by calibrating the model using TS + LS and Self-TS + LS techniques respectively as described in Section
2.1
zero-shot calibration of MMLMs and find in most
cases that the calibration error is strongly correlated
with pre-training data size, syntactic similarity and
sub-word overlap between the unseen language
and English. This reveals that MMLMs are mis-
calibrated in the zero-shot setting for low-resource
languages and the languages that are typologically
distant from English.
Finally, we show that model calibration across
different languages can be substantially improved
by utilizing standard calibration techniques like
Temperature Scaling (Guo et al.,2017) and Label
Smoothing (Pereyra et al.,2017) without collecting
any data in the language (see Figure 1c). Using
a few examples in a language (the few-shot set-
ting), we see even more significant drops in the
calibration errors as can be seen in Figure 1d.
To the best of our knowledge, ours is the first
work to investigate and improve the calibration of
MMLMs. We expect this study to be a significant
contribution towards building reliable and linguisti-
cally fair multilingual models. To encourage future
research in the area we make our code publicly
available1.
2 Calibration of Pre-trained MMLMs
Consider a classifier
h:X [K]
obtained by fine-
tuning an MMLM for some task with training data
in a pivot language
p
, where
[K]
denotes the set
of labels
{1,2,··· , K}
. We assume
h
can predict
confidence of each of the
[K]
labels and is given
1https://github.com/microsoft/MMLMCalibration
by
hk(x)[0,1]
for the
kth
label.
h
is said to be
calibrated if the following equality holds:
p(y=k|hk(x) = q) = q
In other words, for a perfectly calibrated classi-
fier, if the predicted confidence for a label
k
on
an input
x
is
q
, then with a probability
q
the input
should actually be labelled
k
. Naturally, in practi-
cal settings the equality does not hold, and neural
network based classifiers are often miscalibrated
(Guo et al.,2017). One way of quantifying this
notion of miscalibration is through the Expected
Calibration Error (ECE) which is defined as the
difference in expectation between the confidence
in classifier’s predictions and their accuracies (Re-
fer Appendix A.1 for details). In our experiments
we compute ECE on each language
l
s test data
and denote their corresponding calibration errors
as ECE(l).
2.1 Calibration Methods
We briefly review some commonly used methods
for calibrating neural network based classifiers.
1. Temperature Scaling (TS and Self-TS)
(Guo
et al.,2017) is applied by scaling the output logits
using a temperature parameter
T
before applying
softmax i.e. :
hk(x) = exp ok(x)/T
Pk0Kexp ok0(x)/T
, where
ok
denotes the logits corresponding to the
kth
class.
T
is a learnable parameter obtained post-
training by maximizing the log-likelihood on the
dev set while keeping other network parameters
fixed. We experiment with two settings for improv-
ing calibration on a target language: using dev data
in English to perform temperature scaling (TS) and
using the target language’s dev data (Self-TS).
2. Label Smoothing (LS)
(Pereyra et al.,2017) is
a regularization technique that penalizes low en-
tropy distributions by using soft labels that are ob-
tained by assigning a fixed probability
q= 1 α
to the true label (
0< α < 1
), and distributing the
remaining probability mass uniformly across the
remaining classes. Label smoothing has been em-
pirically shown to be competitive with temperature
scaling for calibration (Müller et al.,2019) espe-
cially in out of domain settings (Desai and Durrett,
2020)
3. Few-Shot Learning (FSL)
We also investigate
if fine-tuning the MMLM on a few examples in a
target language in addition to the data in English,
leads to any improvement in calibration as it does
in the performance (Lauscher et al.,2020). Since
these models are expected to be calibrated worse
for out-of-domain data compared to in-domain data
(Desai and Durrett,2020), we try to improve cali-
bration by reducing the domain shift through few-
shot learning.
Apart from these, we also consider combinations
of different calibration methods in our experiments,
including Label Smoothing with Temperature Scal-
ing (TS + LS or Self-TS + LS) and Few-Shot Learn-
ing with Label Smoothing (FSL + LS).
3 Experiments
We seek to answer the following research questions:
a) How well calibrated are fine-tuned MMLMs in
the zero-shot cross lingual setting? b) What linguis-
tic and model-specific factors influence calibration
errors across languages? c) Can we improve the
calibration of fine-tuned models across languages?
3.1 Experimental Setup
Datasets
We consider 4 multilingual classification
datasets to study calibration of MMLMs which in-
clude: i) The Cross-Lingual NLI Corpus (XNLI)
(Conneau et al.,2018), ii) Multilingual Dataset
for Causal Commonsense Reasoning (XCOPA)
(Ponti et al.,2020), iii) Multilingual Amazon Re-
views Corpus (MARC) (Keung et al.,2020) and, iv)
Cross-lingual Adversarial Dataset for Paraphrase
Identification (PAWS-X) (Yang et al.,2019). Statis-
tics of these datasets can be found in Table 5.
Dataset
MMLM ECE(en)E
l∈L0
[ECE(l)] max
l∈L0ECE(l)
XNLI XLM-R 7.32 13.34 19.07 (sw)
mBERT 5.44 12.34 45.15 (th)
XCOPA XLM-R 14.54 20.07 29.33 (sw)
mBERT 23.4 23.51 29.02 (sw)
MARC XLM-R 7.15 9.65 13.45 (zh)
mBERT 9.38 11.11 17.33 (ja)
PAWS-X XLM-R 1.93 4.28 5.88 (ja)
mBERT 3.57 10.32 15.65 (ko)
Table 1: Calibration Errors across tasks for XLM-R and
mBERT. L0in the fourth column denotes the set of sup-
ported languages in a task other than English. The lan-
guage in parenthesis in the column 5 denotes the lan-
guage with maximum calibration error.
Dataset SIZE SYN SWO
XNLI -0.8 -0.88 -0.85
XCOPA -0.85 -0.73 -0.62
MARC -0.41 -0.46 -0.27
PAWS-X -0.48 -0.93 -0.92
Table 2: Pearson correlation coefficient of ECE with
SIZE, SYN, and SWO features of different languages
in the test set for XLMR.
Training setup
We consider two commonly
used MMLMs in our experiments i.e. Multilin-
gual BERT (mBERT) (Devlin et al.,2019), and
XLM-RoBERTa (XLMR) (Conneau et al.,2020).
mBERT is only available in the base variant with
12 layers and for XLMR we use the large variant
with 24 layers. We use English training data to
fine-tune the two MMLMs on all the tasks and eval-
uate the accuracies and ECEs on the test data for
different languages. For the few-shot case we use
the validation data in target languages to do con-
tinued fine-tuning (FSL) and temperature scaling
(Self-TS). Refer to Section A.3 in the Appendix for
more details.
3.2 Results
Out of Box Zero-Shot Calibration (OOB)
We
first investigate how well calibrated MMLMs are
on the languages unseen during fine-tuning without
applying any calibration techniques. As can be seen
in the Table 1, the average calibration error on lan-
guages other than English (column 4) is almost al-
ways significantly worse than the errors on English
test data (column 3) for both mBERT and XLMR
across the 4 tasks. Along with the expected calibra-
tion errors across unseen languages we also report
the worst case ECE (in column 5), where we see
摘要:

OntheCalibrationofMassivelyMultilingualLanguageModelsKabirAhuja1SunayanaSitaram1SandipanDandapat2MonojitChoudhury21MicrosoftResearch,India2MicrosoftR&D,India{t-kabirahuja,sadandap,sunayana.sitaram,monojitc}@microsoft.comAbstractMassivelyMultilingualLanguageModels(MMLMs)haverecentlygainedpopularitydu...

展开>> 收起<<
On the Calibration of Massively Multilingual Language Models Kabir Ahuja1Sunayana Sitaram1Sandipan Dandapat2Monojit Choudhury2 1Microsoft Research India.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:742.17KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注