On the Calibration of Massively Multilingual Language Models Kabir Ahuja1Sunayana Sitaram1Sandipan Dandapat2Monojit Choudhury2 1Microsoft Research India

2025-04-29 0 0 742.17KB 14 页 10玖币

侵权投诉

On the Calibration of Massively Multilingual Language Models

Kabir Ahuja1Sunayana Sitaram1Sandipan Dandapat2Monojit Choudhury2

1Microsoft Research, India

2Microsoft R&D, India

{t-kabirahuja,sadandap,sunayana.sitaram,monojitc}@microsoft.com

Abstract

Massively Multilingual Language Models

(MMLMs) have recently gained popularity

due to their surprising effectiveness in cross-

lingual transfer. While there has been much

work in evaluating these models for their per-

formance on a variety of tasks and languages,

little attention has been paid on how well cal-

ibrated these models are with respect to the

conﬁdence in their predictions. We ﬁrst in-

vestigate the calibration of MMLMs in the

zero-shot setting and observe a clear case of

miscalibration in low-resource languages or

those which are typologically diverse from En-

glish. Next, we empirically show that calibra-

tion methods like temperature scaling and la-

bel smoothing do reasonably well towards im-

proving calibration in the zero-shot scenario.

We also ﬁnd that few-shot examples in the

language can further help reduce the calibra-

tion errors, often substantially. Overall, our

work contributes towards building more reli-

able multilingual models by highlighting the

issue of their miscalibration, understanding

what language and model speciﬁc factors in-

ﬂuence it, and pointing out the strategies to im-

prove the same.

1 Introduction

Massively Multilingual Language Models

(MMLMs) like mBERT (Devlin et al.,2019),

XLMR (Conneau et al.,2020), mT5 (Xue et al.,

2021) and mBART (Liu et al.,2020) have been

surprisingly effective at zero-shot cross lingual

transfer i.e. when ﬁne-tuned on an NLP task in one

language, they often tend to generalize reasonably

well in languages unseen during ﬁne-tuning.

These models have been evaluated for their per-

formance across a range of multilingual tasks (Pan

et al.,2017;Nivre et al.,2018;Conneau et al.,2018)

and numerous methods like adapters (Pfeiffer et al.,

2020), sparse ﬁne-tuning (Ansell et al.,2022) and

few-shot learning (Lauscher et al.,2020) have been

proposed to further improve performance of cross

lingual transfer.

Despite these developments, there has been lit-

tle to no attention paid to the calibration of these

models across languages i.e. how reliable the con-

ﬁdence predictions of these models are. As these

models ﬁnd their way more and more into the

real word applications with safety implications,

like Hate Speech Detection (Davidson et al.,2017;

Deshpande et al.,2022) it becomes important to

only take extreme actions for high conﬁdence pre-

dictions by the model (Sarkar and KhudaBukhsh,

2021). Hence, calibrated conﬁdences are desirable

to have when deploying such systems in practice.

Guo et al. (2017) showed that modern neural

networks used for Image Recognition (He et al.,

2016) perform much better than the ones intro-

duced decades ago (Lecun et al.,1998), but are sig-

niﬁcantly worse calibrated and often over-estimate

their conﬁdence on incorrect predictions. For

NLP tasks speciﬁcally, Desai and Durrett (2020)

showed that classiﬁers trained using pre-trained

transformer based models (Devlin et al.,2019) are

well calibrated both in-domain and out-of-domain

settings compared to non-pre-trained model base-

lines (Chen et al.,2017). Notably, Ponti et al.

(2021) highlights, since zero-shot cross lingual

transfer represents shifts in the data distribution

the point estimates are likely to be miscalibrated,

which forms the core setting of our work.

In light of this, our work has three main contri-

butions. First, we investigate the calibration of two

commonly used MMLMs: mBERT and XLM-R on

four NLU tasks under zero-shot setting where the

models are ﬁne-tuned in English and calibration

errors are computed on unseen languages. We ﬁnd

a clear increase in calibration errors compared to

English as can be seen in Figures 1a and 1b, with

calibration being signiﬁcantly worse for Swahili

compared to English.

Second, we look for factors that might affect the

arXiv:2210.12265v1 [cs.CL] 21 Oct 2022

0.0 0.5 1.0

0.0

0.5

1.0

% of Samples

English

(Out Of Box)

Avg.

Accuracy

Avg.

Conﬁdence

0.0 0.5 1.0

Conﬁdence

0.0

0.5

1.0

Accuracy

Error: 6.66

Output

Gap

(a)

0.0 0.5 1.0

0.0

0.5

1.0

% of Samples

Swahili

(Out Of Box)

Avg.

Accuracy

Avg.

Conﬁdence

0.0 0.5 1.0

Conﬁdence

0.0

0.5

1.0

Accuracy

Error: 16.99

Output

Gap

(b)

0.0 0.5 1.0

0.0

0.5

1.0

% of Samples

Swahili

(Calibrate using en data)

Avg.

Accuracy

Avg.

Conﬁdence

0.0 0.5 1.0

Conﬁdence

0.0

0.5

1.0

Accuracy

Error: 11.54

Output

Gap

(c)

0.0 0.5 1.0

0.0

0.5

1.0

% of Samples

Swahili

(Calibrate using sw data)

Avg.

Accuracy

Avg.

Conﬁdence

0.0 0.5 1.0

Conﬁdence

0.0

0.5

1.0

Accuracy

Error: 5.69

Output

Gap

(d)

Figure 1: Reliability Diagrams for XLMR ﬁne-tuned on XNLI using training data in English. 1a and 1b provides

the calibration of XLMR ﬁne-tuned without any calibration technique on English and Swahilli. 1c and 1d are

obtained by calibrating the model using TS + LS and Self-TS + LS techniques respectively as described in Section

2.1

zero-shot calibration of MMLMs and ﬁnd in most

cases that the calibration error is strongly correlated

with pre-training data size, syntactic similarity and

sub-word overlap between the unseen language

and English. This reveals that MMLMs are mis-

calibrated in the zero-shot setting for low-resource

languages and the languages that are typologically

distant from English.

Finally, we show that model calibration across

different languages can be substantially improved

by utilizing standard calibration techniques like

Temperature Scaling (Guo et al.,2017) and Label

Smoothing (Pereyra et al.,2017) without collecting

any data in the language (see Figure 1c). Using

a few examples in a language (the few-shot set-

ting), we see even more signiﬁcant drops in the

calibration errors as can be seen in Figure 1d.

To the best of our knowledge, ours is the ﬁrst

work to investigate and improve the calibration of

MMLMs. We expect this study to be a signiﬁcant

contribution towards building reliable and linguisti-

cally fair multilingual models. To encourage future

research in the area we make our code publicly

available1.

2 Calibration of Pre-trained MMLMs

Consider a classiﬁer

h:X → [K]

obtained by ﬁne-

tuning an MMLM for some task with training data

in a pivot language

, where

[K]

denotes the set

of labels

{1,2,··· , K}

. We assume

can predict

conﬁdence of each of the

[K]

labels and is given

1https://github.com/microsoft/MMLMCalibration

hk(x)∈[0,1]

for the

kth

label.

is said to be

calibrated if the following equality holds:

p(y=k|hk(x) = q) = q

In other words, for a perfectly calibrated classi-

ﬁer, if the predicted conﬁdence for a label

an input

, then with a probability

the input

should actually be labelled

. Naturally, in practi-

cal settings the equality does not hold, and neural

network based classiﬁers are often miscalibrated

(Guo et al.,2017). One way of quantifying this

notion of miscalibration is through the Expected

Calibration Error (ECE) which is deﬁned as the

difference in expectation between the conﬁdence

in classiﬁer’s predictions and their accuracies (Re-

fer Appendix A.1 for details). In our experiments

we compute ECE on each language

’s test data

and denote their corresponding calibration errors

as ECE(l).

2.1 Calibration Methods

We brieﬂy review some commonly used methods

for calibrating neural network based classiﬁers.

1. Temperature Scaling (TS and Self-TS)

(Guo

et al.,2017) is applied by scaling the output logits

using a temperature parameter

before applying

softmax i.e. :

hk(x) = exp ok(x)/T

Pk0∈Kexp ok0(x)/T

, where

denotes the logits corresponding to the

kth

class.

is a learnable parameter obtained post-

training by maximizing the log-likelihood on the

dev set while keeping other network parameters

ﬁxed. We experiment with two settings for improv-

ing calibration on a target language: using dev data

in English to perform temperature scaling (TS) and

using the target language’s dev data (Self-TS).

2. Label Smoothing (LS)

(Pereyra et al.,2017) is

a regularization technique that penalizes low en-

tropy distributions by using soft labels that are ob-

tained by assigning a ﬁxed probability

q= 1 −α

to the true label (

0< α < 1

), and distributing the

remaining probability mass uniformly across the

remaining classes. Label smoothing has been em-

pirically shown to be competitive with temperature

scaling for calibration (Müller et al.,2019) espe-

cially in out of domain settings (Desai and Durrett,

2020)

3. Few-Shot Learning (FSL)

We also investigate

if ﬁne-tuning the MMLM on a few examples in a

target language in addition to the data in English,

leads to any improvement in calibration as it does

in the performance (Lauscher et al.,2020). Since

these models are expected to be calibrated worse

for out-of-domain data compared to in-domain data

(Desai and Durrett,2020), we try to improve cali-

bration by reducing the domain shift through few-

shot learning.

Apart from these, we also consider combinations

of different calibration methods in our experiments,

including Label Smoothing with Temperature Scal-

ing (TS + LS or Self-TS + LS) and Few-Shot Learn-

ing with Label Smoothing (FSL + LS).

3 Experiments

We seek to answer the following research questions:

a) How well calibrated are ﬁne-tuned MMLMs in

the zero-shot cross lingual setting? b) What linguis-

tic and model-speciﬁc factors inﬂuence calibration

errors across languages? c) Can we improve the

calibration of ﬁne-tuned models across languages?

3.1 Experimental Setup

Datasets

We consider 4 multilingual classiﬁcation

datasets to study calibration of MMLMs which in-

clude: i) The Cross-Lingual NLI Corpus (XNLI)

(Conneau et al.,2018), ii) Multilingual Dataset

for Causal Commonsense Reasoning (XCOPA)

(Ponti et al.,2020), iii) Multilingual Amazon Re-

views Corpus (MARC) (Keung et al.,2020) and, iv)

Cross-lingual Adversarial Dataset for Paraphrase

Identiﬁcation (PAWS-X) (Yang et al.,2019). Statis-

tics of these datasets can be found in Table 5.

Dataset

MMLM ECE(en)E

l∈L0

[ECE(l)] max

l∈L0ECE(l)

XNLI XLM-R 7.32 13.34 19.07 (sw)

mBERT 5.44 12.34 45.15 (th)

XCOPA XLM-R 14.54 20.07 29.33 (sw)

mBERT 23.4 23.51 29.02 (sw)

MARC XLM-R 7.15 9.65 13.45 (zh)

mBERT 9.38 11.11 17.33 (ja)

PAWS-X XLM-R 1.93 4.28 5.88 (ja)

mBERT 3.57 10.32 15.65 (ko)

Table 1: Calibration Errors across tasks for XLM-R and

mBERT. L0in the fourth column denotes the set of sup-

ported languages in a task other than English. The lan-

guage in parenthesis in the column 5 denotes the lan-

guage with maximum calibration error.

Dataset SIZE SYN SWO

XNLI -0.8 -0.88 -0.85

XCOPA -0.85 -0.73 -0.62

MARC -0.41 -0.46 -0.27

PAWS-X -0.48 -0.93 -0.92

Table 2: Pearson correlation coefﬁcient of ECE with

SIZE, SYN, and SWO features of different languages

in the test set for XLMR.

Training setup

We consider two commonly

used MMLMs in our experiments i.e. Multilin-

gual BERT (mBERT) (Devlin et al.,2019), and

XLM-RoBERTa (XLMR) (Conneau et al.,2020).

mBERT is only available in the base variant with

12 layers and for XLMR we use the large variant

with 24 layers. We use English training data to

ﬁne-tune the two MMLMs on all the tasks and eval-

uate the accuracies and ECEs on the test data for

different languages. For the few-shot case we use

the validation data in target languages to do con-

tinued ﬁne-tuning (FSL) and temperature scaling

(Self-TS). Refer to Section A.3 in the Appendix for

more details.

3.2 Results

Out of Box Zero-Shot Calibration (OOB)

ﬁrst investigate how well calibrated MMLMs are

on the languages unseen during ﬁne-tuning without

applying any calibration techniques. As can be seen

in the Table 1, the average calibration error on lan-

guages other than English (column 4) is almost al-

ways signiﬁcantly worse than the errors on English

test data (column 3) for both mBERT and XLMR

across the 4 tasks. Along with the expected calibra-

tion errors across unseen languages we also report

the worst case ECE (in column 5), where we see

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OntheCalibrationofMassivelyMultilingualLanguageModelsKabirAhuja1SunayanaSitaram1SandipanDandapat2MonojitChoudhury21MicrosoftResearch,India2MicrosoftR&D,India{t-kabirahuja,sadandap,sunayana.sitaram,monojitc}@microsoft.comAbstractMassivelyMultilingualLanguageModels(MMLMs)haverecentlygainedpopularitydu...

展开>> 收起<<

On the Calibration of Massively Multilingual Language Models Kabir Ahuja1Sunayana Sitaram1Sandipan Dandapat2Monojit Choudhury2 1Microsoft Research India.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the Calibration of Massively Multilingual Language Models Kabir Ahuja1Sunayana Sitaram1Sandipan Dandapat2Monojit Choudhury2 1Microsoft Research India

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: