Too Brittle To Touch Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models Harshita Diddee1Sandipan Dandapat2Monojit Choudhury1

2025-05-06 0 0 931.74KB 16 页 10玖币
侵权投诉
Too Brittle To Touch: Comparing the Stability of Quantization and
Distillation Towards Developing Lightweight Low-Resource MT Models
Harshita Diddee1Sandipan Dandapat2Monojit Choudhury1
Tanuja Ganu1Kalika Bali1
1Microsoft Research, India
2Microsoft R&D, India
{t-hdiddee,sadandap,monojitc,taganu,kalikab}@microsoft.com
Abstract
Leveraging shared learning through Massively
Multilingual Models, state-of-the-art machine
translation (MT) models are often able to adapt
to the paucity of data for low-resource lan-
guages. However, this performance comes at
the cost of significantly bloated models which
are not practically deployable. Knowledge
Distillation is one popular technique to de-
velop competitive lightweight models: In this
work, we first evaluate it’s use to compress
MT models focusing specifically on languages
with extremely limited training data. Through
our analysis across 8 languages, we find that
the variance in the performance of the dis-
tilled models due to their dependence on pri-
ors including the amount of synthetic data
used for distillation, the student architecture,
training hyper-parameters and confidence of
the teacher models, makes distillation a brittle
compression mechanism. To mitigate this, we
explore the use of post-training quantization
for the compression of these models. Here,
we find that while distillation provides gains
across some low-resource languages, quanti-
zation provides more consistent performance
trends for the entire range of languages, espe-
cially the lowest-resource languages in our tar-
get set.
1 Introduction
While NLP has made giant strides in producing
more accurate models, these benefits are often
not transferred representatively to end-users who
would eventually use a language-technology (Etha-
yarajh and Jurafsky,2020;Caselli et al.,2021).
Bloated sizes, cumbersome inference times (Tao
et al.,2022a) and a limited set of languages that
these models serve are a few reasons for this. More
specifically, their usage is hindered by access bot-
tlenecks such as (a)
Infrastructural Obstacles
:
A large percentage of end-users do not have sus-
tained access to internet or high-compute devices
to enjoy a stable access to cloud-inferencing of cur-
rent NLP models (Ranathunga and de Silva,2022;
Diddee et al.,2022), (b)
Latency Requirements
:
Certain NLP services (chat-bots, real-time assis-
tance interfaces, etc.) require very low-inference
time which requisite lightweight-models (c)
Pri-
vacy Constraints
: The outflow of sensitive user
data which is fed for inferencing to remotely hosted
NLP models also has well documented issues (Sri-
nath et al.,2021;Huang and Chen,2021;Huang
et al.,2020;Diddee and Kansra,2020).
Within the research that focuses on evaluating
and mitigating these practical constraints, the focus
on low-resource language setups has been fairly
limited (Ganesh et al.,2021). For instance, while
the compression of large language models has re-
ceived consistent attention through analysis of prun-
ing (Behnke and Heafield,2020;Behnke et al.,
2021), distillation (Bapna et al.,2022;Mghab-
bar and Ratnamogan,2020;Kim and Rush,2016;
Junczys-Dowmunt et al.,2018) and even quantiza-
tion (Bondarenko et al.,2021;Zadeh et al.,2020)
- much of this work has focused on compressing
language models for high-resource languages.
In this paper, we report the results of a compara-
tive analysis of the performance of distillation and
quantization. By focusing on compressing seq2seq
multilingual models across a range of languages
with data ranging from 7000 to 3M samples - we es-
pecially demonstrate the different priors that need
to be ascertained for the successful distillation of
the model. We are unaware of any previous study
that demonstrates the performance of these mecha-
nisms on such low resource languages.
The utility of this work is in commenting on the
feasibility of these two compression techniques for
rapid development and deployment of MT Mod-
els for low resource languages (Joshi et al.,2020).
More specifically, we believe that distillation’s re-
liance on several priors can be addressed naively
through a resource-intensive exercise, where the
arXiv:2210.15184v2 [cs.CL] 9 Nov 2022
optimal values of these priors are computed exhaus-
tively. However, in the absence of such a budget,
we expect this to be a major impediment in the
development of lightweight models for such lan-
guages. Since low resource language communities
may also be marginalised in other ways, exhaustive
investment of data and compute might not be feasi-
ble for such communities as well as the language
technologists working on these languages (Zhang
et al.,2022;Diddee et al.,2022;Markl,2022).
The main contributions of this work are:
1.
We distill competitive baseline models for
8 low-resource languages (Bribri, Wixarica,
Gondi, Mundari, Assamesse, Odia, Punjabi
and Gujarati) and evaluate the sensitivity of
the generated models to priors including (a)
amount of synthetic Data being used for train-
ing (b) The architecture of the student model
(c) the training hyper-parameter configuration
and (d) the confidence of the teacher models.
2.
We, then, quantize these models to observe
if quantization provides a more consistent
compression mechanism for these languages.
Based on our analysis, we conclude that
the suprising stability of naive Post-Training
Quantization, especially in the compression
of extremely-low resource languages (training
data between 5000 and 25000 samples) over
distillation.
We release a combination of lightweight, offline
support MT models for these languages along with
the scripts for generation and offline inference to
further reproducible research in this domain1.
2 Approach - Model and Size
Adaptations
In this section, we describe the languages (2.1),
architectures under consideration (2.1), the adap-
tations that we make for training and fine-tuning
these models (2.2) and the adaptations we make to
compress their size.
2.1 Languages
We perform our analysis on the eight languages
shown in Table 1. These languages cover a wide
range of availability of monolingual and parallel
data, spanning from classes 0 to 3 as defined in
Joshi et al. (2020). Additionally, they differ in
1Codebase and Open-Sourced Models
scripts and their inclusion in pretraining corpus
which result in interesting modelling adaptions that
are needed to be performed for the development
of their baselines. In this work, we only study the
High-Resource Language (HRL)
Low-Resource
Language (LRL) translation direction. The source
languages for all our target languages are men-
tioned in Table 1.
Family of Models
For this work, we leverage
two model classes to carry out our analysis:
I)
seq2seq transformer (Vaswani et al.,2017), here-
after referred to as vanilla transformer: With 6 En-
coder and Decoder Layers, Vocabulary size - vary-
ing between 8k to 32k and 8 attention heads. and
II)
mT5-small (Xue et al.,2021): With 8 Encoder
and Decoder Layers, Vocabulary Size - 250100 and
6 attention heads.
We train the vanilla transformer from scratch,
hereafter referred to as transformer, to develop a
naive baseline for our experiments, and further fine-
tune the mT5-small, hereafter referred to as mT5,
with certain adaptations for all the languages, as
discussed in section 2.2.
For ease of reporting, we define the highest-
performing-model (denoted by HM) over our fam-
ily of models as:
HM = argmax
MA(M)
where M is a model class with performance
A(M)
after training (where A is a metric like
BLEU (Papineni et al.,2002) or chrF (Popovi´
c,
2016) used to monitor the task-specific perfor-
mance of the model).
2.2 Model Adaptations: Language Specific
Approaches
Here we describe the strategies required to adapt
these models to different low-resource languages:
During fine-tuning, we adapt the pretrained mT5 to-
kenizer to unseen scripts (encountered for Odia) by
transliterating it to the closest, highest-resource lan-
guage included in the pretraining corpus of the pre-
trained model (Khemchandani et al.,2021;Ramesh
et al.,2021,2022). For our extremely low-resource
languages, we used Lexicon-Adaption (Wang et al.,
2022) for the augmentation of target-side monolin-
gual data for languages wherever a bilingual lexi-
con could be leveraged - Detailed performance with
Hindi-Gondi is provided in the Appendix section
Language Class Source Language Data Constraints Model Constraints
Monolingual Data Parallel Data Shared Script Included in Pretraining
Bribri 0 Spanish
Wixarica 0 Spanish
Mundari 0 Hindi
Gondi 0 Hindi
Assammese 1 English
Odia 1 English
Punjabi 2 English
Gujarati 1 English
Table 1: Languages Under Consideration: Note that the except the language’s inclusion in the pretraining corpus
of our chosen pretrained language models, all factors are independent of our experimental setup. Source language
column enlists the source language of the translation pairs
A.2. However since such methods were not exten-
sible to all the languages in our target language set,
we report final experimental results on the models
which did not leverage any additional data other
than the data mentioned in A.1. Since we analyze
the HRL to LRL direction and 4 out of 8 (Bribri,
Wixarica, Gondi and Mundari) of our target lan-
guages have little to negligible monolingual data -
we were also unable to leverage Back-Translation
to augment our language-specific parallel corpus
(Edunov et al.,2018).
2.3 Size Adaptation: Knowledge Distillation
Knowledge distillation involves training a smaller
student network to mimic the token level proba-
bilities of a larger, more accurate teacher model.
We distill our models using Hard Distillation (Kim
and Rush,2016): we utilize a set of monolingual
sentences in the HRL - and forward translate using
the HM to generate synthetic labels that a lighter
student model is then trained on.
2.3.1 Estimation of Optimal Values for Priors
We define a prior as any attribute of the compres-
sion mechanism that needs to be initialized mean-
ingfully and/or optimized for optimal performance
- akin to hyperparameters. We use this term specifi-
cally so as to put all the dependent variables - such
as training data, prediction confidence of the un-
compressed models, etc in a single bucket: rather
than using a term like hyperparameters that already
holds traditional significance in literature. The ex-
perimental sweeps for these priors are briefly ex-
plained in this section. Note that we focus largely
on distillation while estimating for these priors, be-
cause quantization provides competitive models
even with the default choices established by lit-
erature whereas with distillation - the estimation
of these priors is critical to achieve a competitive
compressed model variant in most cases.
Prior 1: Optimal Student Architecture
Fol-
lowing prior work like Bapna et al. (2022), we
experimented with 3 candidate architectures, two
of which used deep encoders and shallower de-
coders. We sweeped across 3 candidate architec-
tures - all variants of a seq2seq transformers with
(a) 8 Encoder + 6 Decoder Layers (b) 6 Encoder +
4 Decoder Layers and (c) 6 Encoder + 3 Decoder
Layers. We chose the architecture that gave the
best BLEU performance after 30 epochs. Sweeps
for the architecture were done across each of the
following languages - Gondi, Assamesse and Odia
as they covered a wide range of training data.
Prior 2: Optimal Training Hyperparameters
We sweeped across a set of hyper-parameter sets
for Bribri, Gondi, Assamesse and Gujarati to iden-
tify the optimal set for the distilled student models.
Our goal here was to specifically study the trans-
ferability of a hyperparameter set which performed
competitively for one or more languages, to all the
languages in our target set.
Prior 3: Amount of Training Data for the Stu-
dent
We sweeped across 3 candidate sizes of our
synthetic dataset: 100K, 250K and 500K pseudo-
labels. Since this decision could also be greatly
dependent on the quality of the labels generated
per language - we ran this sweep for Bribri, Gondi,
Odia and Gujarati, as the quality of the labels gen-
erated by the teachers for these languages would
be expected to demonstrate significant variation.
Prior 4: Optimal Teacher Architecture
To do
a preliminary quantification of the effect of the
choice of a teacher architecture and the quantity of
data that a teacher is trained for on the compress-
ibility of the model - we decided to evaluate the
confidence of our teacher models on the predictions
they generated. For this, we sampled 100 instances
from each of our testsets and monitored the logit
distribution of our teacher models. Specifically,
we calculated the average of the softmax entropy
of the token-level softmax distributions for a se-
quence. Taking inspiration from the unsupervised
estimation of quality of machine translation outputs
(Fomicheva et al.,2020) through similar methods,
we hypothesised that the lower the entropy of our
model, the more confident it would be in its pre-
dictions for a given sample. The intuition here was
that if a model is confident about its prediction,
its logit distribution would be highly-skewed, and
not resemble a uniform distribution (which would
indicate its indecisiveness in being able to predict
the right token - and therefore, the right sequence).
Eventually, this could be used to gauge the quality
of the pseudo labels that are student were being
trained on.
2.4 Size Adaptation: Quantization
Quantization is a common way to reduce the com-
putational time and memory consumption of neu-
ral networks (Wu et al.,2020). Here, a lower-bit
representation of weights and activation functions
is used to achieve a lower memory footprint. In
this work, we perform post-training quantization,
where after training the base model with full pre-
cision of floating point 32 bits (fp-32), we convert
the weights and activations of the model to 8 bit
integers (int-8). Note that during inference, we
still preserve the precision of the input and output
encoder-decoder distributions as fp-32. In theory,
this brings down the memory consumption of the
model by nearly 4x times, though we see an effec-
tive reduction of about 3x in practice. More details
on the memory-reductions achieved are specified
in the Appendix A.4
3 Experimental Setup
3.1 Data
(a) Bribri and Wixarica:
We use the training data
7K and 8K sentences, respectively from Feldman
and Coto-Solano (2020) and evaluate on test data
from Mager et al. (2021).
(b) Gondi
: We use 26k
sentences from the data opensourced by CGNET
Swara (CGNET,2019) and split it into training
and test sets.
2(c) Mundari:
We use a dataset
2
To avoid any test-set leaks, we deduplicate the data by
removing tuples (
Si
,
Ti
) where
Si
is the
ith
sentence in
of 10K sentences provided by Indian Institute of
Technology, Kharagpur
3
, and split it into training
and test sets.
1(d) Assamesse, Odia, Punjabi and
Gujarati
: We use the training data from Ramesh
et al. (2022) (with 0.14M, 1M, 2.4M and 3M sen-
tences, respectively) and evaluate on test data from
FLORES200 Goyal et al. (2022) for Assamese and
WAT2021 Nakazawa et al. (2021) for the remaining
languages. Additional details about datasets (sizes
and splits) are mentioned in the Appendix A.1.
3.2 Training Setup
Hyperparameters:
We use the transformer and
mT5 as our model classes as described previously
in Section 2. The hyperparameters for our trans-
former model was optimized for fine-tuning of
Odia, trained on 1M sentence pairs. For fine-
tuning, we use the Adafactor optimizer (Shazeer
and Stern,2018), with a linearly decaying learning
rate of 1e-3. Since training with smaller batches
is known to be more effective for extremely low-
resource language training (Atrio and Popescu-
Belis,2022), we tuned the training batch size for
every language - varying from 32 to 256 (with gra-
dient accumulation as 2) though we did not see very
significant variation in the performance on the basis
of this tuning. For our stopping criteria: we fine-
tuned all models for 60 epochs (which concluded
with considerably overfit models) and then selected
models by we picking the checkpoint which had the
best validation performance on BLEU (with only
the 13a tokenizer which mimics the mteval-v13a
script from Moses) (Post,2018).
We use the sentencepiece tokenizer to build tok-
enizers for training the baselines for each of the lan-
guages (Kudo and Richardson,2018). We use the
per-token cross-entropy loss for fine-tuning all our
models. Following Xu et al. (2021), we opt for a
relatively smaller vocabulary size with the intent of
learning more meaningful subword representations
for our extremely low-resource languages. Specif-
ically, we use a vocabulary size of 8K for Gondi,
Mundari, Bribri and Wixarica, compared to 32K
used for Assamesse, Odia Punjabi and Gujarati.
Experimental Setup for Distillation
For
Mundari and Gondi we utilize 500K Hindi sen-
tences sampled from the Samanantar corpus
(Ramesh et al.,2022); We use the corresponding
English corpus to sample English sentences for
the source language and
Ti
is
ith
the sentence in the target
language, between the train and the test set.
3Data to be released soon;
摘要:

TooBrittleToTouch:ComparingtheStabilityofQuantizationandDistillationTowardsDevelopingLightweightLow-ResourceMTModelsHarshitaDiddee1SandipanDandapat2MonojitChoudhury1TanujaGanu1KalikaBali11MicrosoftResearch,India2MicrosoftR&D,India{t-hdiddee,sadandap,monojitc,taganu,kalikab}@microsoft.comAbstractLeve...

展开>> 收起<<
Too Brittle To Touch Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models Harshita Diddee1Sandipan Dandapat2Monojit Choudhury1.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:931.74KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注