Too Brittle To Touch Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models Harshita Diddee1Sandipan Dandapat2Monojit Choudhury1

2025-05-06 0 0 931.74KB 16 页 10玖币

侵权投诉

Too Brittle To Touch: Comparing the Stability of Quantization and

Distillation Towards Developing Lightweight Low-Resource MT Models

Harshita Diddee1Sandipan Dandapat2Monojit Choudhury1

Tanuja Ganu1Kalika Bali1

1Microsoft Research, India

2Microsoft R&D, India

{t-hdiddee,sadandap,monojitc,taganu,kalikab}@microsoft.com

Abstract

Leveraging shared learning through Massively

Multilingual Models, state-of-the-art machine

translation (MT) models are often able to adapt

to the paucity of data for low-resource lan-

guages. However, this performance comes at

the cost of signiﬁcantly bloated models which

are not practically deployable. Knowledge

Distillation is one popular technique to de-

velop competitive lightweight models: In this

work, we ﬁrst evaluate it’s use to compress

MT models focusing speciﬁcally on languages

with extremely limited training data. Through

our analysis across 8 languages, we ﬁnd that

the variance in the performance of the dis-

tilled models due to their dependence on pri-

ors including the amount of synthetic data

used for distillation, the student architecture,

training hyper-parameters and conﬁdence of

the teacher models, makes distillation a brittle

compression mechanism. To mitigate this, we

explore the use of post-training quantization

for the compression of these models. Here,

we ﬁnd that while distillation provides gains

across some low-resource languages, quanti-

zation provides more consistent performance

trends for the entire range of languages, espe-

cially the lowest-resource languages in our tar-

get set.

1 Introduction

While NLP has made giant strides in producing

more accurate models, these beneﬁts are often

not transferred representatively to end-users who

would eventually use a language-technology (Etha-

yarajh and Jurafsky,2020;Caselli et al.,2021).

Bloated sizes, cumbersome inference times (Tao

et al.,2022a) and a limited set of languages that

these models serve are a few reasons for this. More

speciﬁcally, their usage is hindered by access bot-

tlenecks such as (a)

Infrastructural Obstacles

A large percentage of end-users do not have sus-

tained access to internet or high-compute devices

to enjoy a stable access to cloud-inferencing of cur-

rent NLP models (Ranathunga and de Silva,2022;

Diddee et al.,2022), (b)

Latency Requirements

Certain NLP services (chat-bots, real-time assis-

tance interfaces, etc.) require very low-inference

time which requisite lightweight-models (c)

Pri-

vacy Constraints

: The outﬂow of sensitive user

data which is fed for inferencing to remotely hosted

NLP models also has well documented issues (Sri-

nath et al.,2021;Huang and Chen,2021;Huang

et al.,2020;Diddee and Kansra,2020).

Within the research that focuses on evaluating

and mitigating these practical constraints, the focus

on low-resource language setups has been fairly

limited (Ganesh et al.,2021). For instance, while

the compression of large language models has re-

ceived consistent attention through analysis of prun-

ing (Behnke and Heaﬁeld,2020;Behnke et al.,

2021), distillation (Bapna et al.,2022;Mghab-

bar and Ratnamogan,2020;Kim and Rush,2016;

Junczys-Dowmunt et al.,2018) and even quantiza-

tion (Bondarenko et al.,2021;Zadeh et al.,2020)

- much of this work has focused on compressing

language models for high-resource languages.

In this paper, we report the results of a compara-

tive analysis of the performance of distillation and

quantization. By focusing on compressing seq2seq

multilingual models across a range of languages

with data ranging from 7000 to 3M samples - we es-

pecially demonstrate the different priors that need

to be ascertained for the successful distillation of

the model. We are unaware of any previous study

that demonstrates the performance of these mecha-

nisms on such low resource languages.

The utility of this work is in commenting on the

feasibility of these two compression techniques for

rapid development and deployment of MT Mod-

els for low resource languages (Joshi et al.,2020).

More speciﬁcally, we believe that distillation’s re-

liance on several priors can be addressed naively

through a resource-intensive exercise, where the

arXiv:2210.15184v2 [cs.CL] 9 Nov 2022

optimal values of these priors are computed exhaus-

tively. However, in the absence of such a budget,

we expect this to be a major impediment in the

development of lightweight models for such lan-

guages. Since low resource language communities

may also be marginalised in other ways, exhaustive

investment of data and compute might not be feasi-

ble for such communities as well as the language

technologists working on these languages (Zhang

et al.,2022;Diddee et al.,2022;Markl,2022).

The main contributions of this work are:

We distill competitive baseline models for

8 low-resource languages (Bribri, Wixarica,

Gondi, Mundari, Assamesse, Odia, Punjabi

and Gujarati) and evaluate the sensitivity of

the generated models to priors including (a)

amount of synthetic Data being used for train-

ing (b) The architecture of the student model

and (d) the conﬁdence of the teacher models.

We, then, quantize these models to observe

if quantization provides a more consistent

compression mechanism for these languages.

Based on our analysis, we conclude that

the suprising stability of naive Post-Training

Quantization, especially in the compression

of extremely-low resource languages (training

data between 5000 and 25000 samples) over

distillation.

We release a combination of lightweight, ofﬂine

support MT models for these languages along with

the scripts for generation and ofﬂine inference to

further reproducible research in this domain1.

2 Approach - Model and Size

Adaptations

In this section, we describe the languages (2.1),

architectures under consideration (2.1), the adap-

tations that we make for training and ﬁne-tuning

these models (2.2) and the adaptations we make to

compress their size.

2.1 Languages

We perform our analysis on the eight languages

shown in Table 1. These languages cover a wide

range of availability of monolingual and parallel

data, spanning from classes 0 to 3 as deﬁned in

Joshi et al. (2020). Additionally, they differ in

1Codebase and Open-Sourced Models

scripts and their inclusion in pretraining corpus

which result in interesting modelling adaptions that

are needed to be performed for the development

of their baselines. In this work, we only study the

High-Resource Language (HRL)

→

Low-Resource

Language (LRL) translation direction. The source

languages for all our target languages are men-

tioned in Table 1.

Family of Models

For this work, we leverage

two model classes to carry out our analysis:

seq2seq transformer (Vaswani et al.,2017), here-

after referred to as vanilla transformer: With 6 En-

coder and Decoder Layers, Vocabulary size - vary-

ing between 8k to 32k and 8 attention heads. and

II)

mT5-small (Xue et al.,2021): With 8 Encoder

and Decoder Layers, Vocabulary Size - 250100 and

6 attention heads.

We train the vanilla transformer from scratch,

hereafter referred to as transformer, to develop a

naive baseline for our experiments, and further ﬁne-

tune the mT5-small, hereafter referred to as mT5,

with certain adaptations for all the languages, as

discussed in section 2.2.

For ease of reporting, we deﬁne the highest-

performing-model (denoted by HM) over our fam-

ily of models as:

HM = argmax

MA(M)

where M is a model class with performance

A(M)

after training (where A is a metric like

BLEU (Papineni et al.,2002) or chrF (Popovi´

2016) used to monitor the task-speciﬁc perfor-

mance of the model).

2.2 Model Adaptations: Language Speciﬁc

Approaches

Here we describe the strategies required to adapt

these models to different low-resource languages:

During ﬁne-tuning, we adapt the pretrained mT5 to-

kenizer to unseen scripts (encountered for Odia) by

transliterating it to the closest, highest-resource lan-

guage included in the pretraining corpus of the pre-

trained model (Khemchandani et al.,2021;Ramesh

et al.,2021,2022). For our extremely low-resource

languages, we used Lexicon-Adaption (Wang et al.,

2022) for the augmentation of target-side monolin-

gual data for languages wherever a bilingual lexi-

con could be leveraged - Detailed performance with

Hindi-Gondi is provided in the Appendix section

Language Class Source Language Data Constraints Model Constraints

Monolingual Data Parallel Data Shared Script Included in Pretraining

Bribri 0 Spanish

Wixarica 0 Spanish

Mundari 0 Hindi

Gondi 0 Hindi

Assammese 1 English

Odia 1 English

Punjabi 2 English

Gujarati 1 English

Table 1: Languages Under Consideration: Note that the except the language’s inclusion in the pretraining corpus

of our chosen pretrained language models, all factors are independent of our experimental setup. Source language

column enlists the source language of the translation pairs

A.2. However since such methods were not exten-

sible to all the languages in our target language set,

we report ﬁnal experimental results on the models

which did not leverage any additional data other

than the data mentioned in A.1. Since we analyze

the HRL to LRL direction and 4 out of 8 (Bribri,

Wixarica, Gondi and Mundari) of our target lan-

guages have little to negligible monolingual data -

we were also unable to leverage Back-Translation

to augment our language-speciﬁc parallel corpus

(Edunov et al.,2018).

2.3 Size Adaptation: Knowledge Distillation

Knowledge distillation involves training a smaller

student network to mimic the token level proba-

bilities of a larger, more accurate teacher model.

We distill our models using Hard Distillation (Kim

and Rush,2016): we utilize a set of monolingual

sentences in the HRL - and forward translate using

the HM to generate synthetic labels that a lighter

student model is then trained on.

2.3.1 Estimation of Optimal Values for Priors

We deﬁne a prior as any attribute of the compres-

sion mechanism that needs to be initialized mean-

ingfully and/or optimized for optimal performance

- akin to hyperparameters. We use this term speciﬁ-

cally so as to put all the dependent variables - such

as training data, prediction conﬁdence of the un-

compressed models, etc in a single bucket: rather

than using a term like hyperparameters that already

holds traditional signiﬁcance in literature. The ex-

perimental sweeps for these priors are brieﬂy ex-

plained in this section. Note that we focus largely

on distillation while estimating for these priors, be-

cause quantization provides competitive models

even with the default choices established by lit-

erature whereas with distillation - the estimation

of these priors is critical to achieve a competitive

compressed model variant in most cases.

Prior 1: Optimal Student Architecture

Fol-

lowing prior work like Bapna et al. (2022), we

experimented with 3 candidate architectures, two

of which used deep encoders and shallower de-

coders. We sweeped across 3 candidate architec-

tures - all variants of a seq2seq transformers with

(a) 8 Encoder + 6 Decoder Layers (b) 6 Encoder +

4 Decoder Layers and (c) 6 Encoder + 3 Decoder

Layers. We chose the architecture that gave the

best BLEU performance after 30 epochs. Sweeps

for the architecture were done across each of the

following languages - Gondi, Assamesse and Odia

as they covered a wide range of training data.

Prior 2: Optimal Training Hyperparameters

We sweeped across a set of hyper-parameter sets

for Bribri, Gondi, Assamesse and Gujarati to iden-

tify the optimal set for the distilled student models.

Our goal here was to speciﬁcally study the trans-

ferability of a hyperparameter set which performed

competitively for one or more languages, to all the

languages in our target set.

Prior 3: Amount of Training Data for the Stu-

dent

We sweeped across 3 candidate sizes of our

synthetic dataset: 100K, 250K and 500K pseudo-

labels. Since this decision could also be greatly

dependent on the quality of the labels generated

per language - we ran this sweep for Bribri, Gondi,

Odia and Gujarati, as the quality of the labels gen-

erated by the teachers for these languages would

be expected to demonstrate signiﬁcant variation.

Prior 4: Optimal Teacher Architecture

To do

a preliminary quantiﬁcation of the effect of the

choice of a teacher architecture and the quantity of

data that a teacher is trained for on the compress-

ibility of the model - we decided to evaluate the

conﬁdence of our teacher models on the predictions

they generated. For this, we sampled 100 instances

from each of our testsets and monitored the logit

distribution of our teacher models. Speciﬁcally,

we calculated the average of the softmax entropy

of the token-level softmax distributions for a se-

quence. Taking inspiration from the unsupervised

estimation of quality of machine translation outputs

(Fomicheva et al.,2020) through similar methods,

we hypothesised that the lower the entropy of our

model, the more conﬁdent it would be in its pre-

dictions for a given sample. The intuition here was

that if a model is conﬁdent about its prediction,

its logit distribution would be highly-skewed, and

not resemble a uniform distribution (which would

indicate its indecisiveness in being able to predict

the right token - and therefore, the right sequence).

Eventually, this could be used to gauge the quality

of the pseudo labels that are student were being

trained on.

2.4 Size Adaptation: Quantization

Quantization is a common way to reduce the com-

putational time and memory consumption of neu-

ral networks (Wu et al.,2020). Here, a lower-bit

representation of weights and activation functions

is used to achieve a lower memory footprint. In

this work, we perform post-training quantization,

where after training the base model with full pre-

cision of ﬂoating point 32 bits (fp-32), we convert

the weights and activations of the model to 8 bit

integers (int-8). Note that during inference, we

still preserve the precision of the input and output

encoder-decoder distributions as fp-32. In theory,

this brings down the memory consumption of the

model by nearly 4x times, though we see an effec-

tive reduction of about 3x in practice. More details

on the memory-reductions achieved are speciﬁed

in the Appendix A.4

3 Experimental Setup

3.1 Data

(a) Bribri and Wixarica:

We use the training data

7K and 8K sentences, respectively from Feldman

and Coto-Solano (2020) and evaluate on test data

from Mager et al. (2021).

(b) Gondi

: We use 26k

sentences from the data opensourced by CGNET

Swara (CGNET,2019) and split it into training

and test sets.

2(c) Mundari:

We use a dataset

To avoid any test-set leaks, we deduplicate the data by

removing tuples (

) where

is the

ith

sentence in

of 10K sentences provided by Indian Institute of

Technology, Kharagpur

, and split it into training

and test sets.

1(d) Assamesse, Odia, Punjabi and

Gujarati

: We use the training data from Ramesh

et al. (2022) (with 0.14M, 1M, 2.4M and 3M sen-

tences, respectively) and evaluate on test data from

FLORES200 Goyal et al. (2022) for Assamese and

WAT2021 Nakazawa et al. (2021) for the remaining

languages. Additional details about datasets (sizes

and splits) are mentioned in the Appendix A.1.

3.2 Training Setup

Hyperparameters:

We use the transformer and

mT5 as our model classes as described previously

in Section 2. The hyperparameters for our trans-

former model was optimized for ﬁne-tuning of

Odia, trained on 1M sentence pairs. For ﬁne-

tuning, we use the Adafactor optimizer (Shazeer

and Stern,2018), with a linearly decaying learning

rate of 1e-3. Since training with smaller batches

is known to be more effective for extremely low-

resource language training (Atrio and Popescu-

Belis,2022), we tuned the training batch size for

every language - varying from 32 to 256 (with gra-

dient accumulation as 2) though we did not see very

signiﬁcant variation in the performance on the basis

of this tuning. For our stopping criteria: we ﬁne-

tuned all models for 60 epochs (which concluded

with considerably overﬁt models) and then selected

models by we picking the checkpoint which had the

best validation performance on BLEU (with only

the 13a tokenizer which mimics the mteval-v13a

script from Moses) (Post,2018).

We use the sentencepiece tokenizer to build tok-

enizers for training the baselines for each of the lan-

guages (Kudo and Richardson,2018). We use the

per-token cross-entropy loss for ﬁne-tuning all our

models. Following Xu et al. (2021), we opt for a

relatively smaller vocabulary size with the intent of

learning more meaningful subword representations

for our extremely low-resource languages. Specif-

ically, we use a vocabulary size of 8K for Gondi,

Mundari, Bribri and Wixarica, compared to 32K

used for Assamesse, Odia Punjabi and Gujarati.

Experimental Setup for Distillation

For

Mundari and Gondi we utilize 500K Hindi sen-

tences sampled from the Samanantar corpus

(Ramesh et al.,2022); We use the corresponding

English corpus to sample English sentences for

the source language and

ith

the sentence in the target

language, between the train and the test set.

3Data to be released soon;

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TooBrittleToTouch:ComparingtheStabilityofQuantizationandDistillationTowardsDevelopingLightweightLow-ResourceMTModelsHarshitaDiddee1SandipanDandapat2MonojitChoudhury1TanujaGanu1KalikaBali11MicrosoftResearch,India2MicrosoftR&D,India{t-hdiddee,sadandap,monojitc,taganu,kalikab}@microsoft.comAbstractLeve...

展开>> 收起<<

Too Brittle To Touch Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models Harshita Diddee1Sandipan Dandapat2Monojit Choudhury1.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Too Brittle To Touch Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models Harshita Diddee1Sandipan Dandapat2Monojit Choudhury1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: