Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks Charith Peris

2025-05-03 0 0 457.39KB 10 页 10玖币
侵权投诉
Knowledge Distillation Transfer Sets and their Impact on Downstream
NLU Tasks
Charith Peris
Amazon, Cambridge, USA
perisc@amazon.com
Lizhen Tan
Amazon, Cambridge, USA
ltn@amazon.com
Thomas Gueudre
Amazon, Turin, Italy
tgueudre@amazon.it
Turan Gojayev
Amazon, Berlin, Germany
tgojayev@amazon.de
Pan Wei
Amazon, Cambridge, USA
panwei@amazon.com
Gokmen Oz
Amazon, Cambridge, USA
ogokmen@amazon.com
Abstract
Teacher-student knowledge distillation is a
popular technique for compressing today’s pre-
vailing large language models into manage-
able sizes that fit low-latency downstream ap-
plications. Both the teacher and the choice
of transfer set used for distillation are cru-
cial ingredients in creating a high quality stu-
dent. Yet, the generic corpora used to pre-
train the teacher and the corpora associated
with the downstream target domain are often
significantly different, which raises a natural
question: should the student be distilled over
the generic corpora, so as to learn from high-
quality teacher predictions, or over the down-
stream task corpora to align with finetuning?
Our study investigates this trade-off using Do-
main Classification (DC) and Intent Classifi-
cation/Named Entity Recognition (ICNER) as
downstream tasks. We distill several multilin-
gual students from a larger multilingual LM
with varying proportions of generic and task-
specific datasets, and report their performance
after finetuning on DC and ICNER. We ob-
serve significant improvements across tasks
and test sets when only task-specific corpora
is used. We also report on how the impact of
adding task-specific data to the transfer set cor-
relates with the similarity between generic and
task-specific data. Our results clearly indicate
that, while distillation from a generic LM bene-
fits downstream tasks, students learn better us-
ing target domain data even if it comes at the
price of noisier teacher predictions. In other
words, target domain data still trumps teacher
knowledge.
1 Introduction
In the recent past, large language models (LMs;
BERT-Large, Devlin et al.,2019; GPT-2, Radford
et al.,2019; T5, Raffel et al.,2020) pretrained
in a self-supervised manner on massive web cor-
pora have consistently shown state-of-the-art per-
* Corresponding Author
formance for multiple natural language understand-
ing (NLU) tasks. Therefore, it is no surprise that
these models are of much interest for virtual as-
sistants such as Amazon Alexa, Apple Siri, and
Google Assistant. Some studies have shown that
these large models trained on generic corpora seem
to be more robust to data distributional shifts, rely-
ing less on domain-specific training data to perform
well (Brown et al.,2020).
Since large models cannot be directly used for
low-latency applications on devices with limited
computing capacity, many techniques have been
developed to compress them in size. Knowledge
distillation (referred to simply as distillation here-
after; Hinton et al.,2015), has shown promising
results, especially at the high compression rates
typically required in NLU (Jiao et al.,2020,Soltan
et al.,2021). In this paradigm, lightweight models
referred to as students, are trained to mimic the
teacher predictions over a transfer set (Hinton et al.,
2015). When the pretraining and task-specific cor-
pora have significantly different distributions, as is
often the case, the choice of data for the transfer set
can be ambiguous. On the one hand, using pretrain-
ing corpora in the transfer set ensures high quality
teacher predictions that are important for effective
distillation. On the other, using the downstream
corpora, although it might cause noisier teacher
predictions, ensures the adaptation of the student
to its final use case.
To investigate this trade-off, we present a set
of experiments where we distill several multilin-
gual students from a large multilingual teacher LM
trained using a masked language modeling (MLM)
objective. We perform the distillations using trans-
fer sets that comprise of generic and task-specific
data in varying proportions. The students are then
finetuned and evaluated on two downstream NLU
tasks of interest: a Domain Classification (DC)
task and a joint Intent Classification/Named Entity
Recognition (ICNER) task. For each input utter-
arXiv:2210.04834v3 [cs.CL] 17 Oct 2022
ance DC predicts the relevant domain (Books, Mu-
sic, Shopping, etc.), IC identifies the user’s intent
(find a book, play a song, buy an item, etc.) and
NER extracts the entities in the utterance (dates,
names, locations, etc.).
Our contributions
: (1) We confirm for our
setup that model preparation via distillation from a
larger LM is more beneficial for downstream task
performance when compared to encoder training
from scratch. (2) We show that the largest improve-
ments are seen when using only the downstream
task’s unlabelled data during the distillation pro-
cess. Even though teacher predictions are expected
to be noisy over data that is different from pre-
training corpora, our results clearly indicate that
students learn best in this setting. (3) Because our
ICNER corpora is divided per domain, we are also
able to provide a finer-grained analysis of the im-
pact of corpora similarity on downstream results.
(4) Finally, we also confirm that further adaptation
of the teacher to the target-domain data, results in
improved student performance across tasks.
2 Relevant Work
Building models with inference speeds that are
suitable for production systems is of utmost impor-
tance in the industrial setting. Therefore techniques
for model compression (quantization Gong et al.,
2014; pruning redundant connections Han et al.,
2015) have been active research topics, with dis-
tillation (Romero et al.,2015,Hinton et al.,2015,
Jiao et al.,2020) showing much promise for NLU
models (Sanh et al.,2019). Distillation processes
and their data have evolved over the past few years.
In the teacher-student framework proposed by Hin-
ton et al. (2015), they recommend using the original
pretraining set as the transfer set. Jiao et al. (2020)
proposes a more complex two-stage process with
generic and task-specific distillation phases, each
with their own data sets, designed to augment the
performance of the final model towards the task at
hand.
Our work is focused on exploring how varying
proportions of generic and task-specific data within
the transfer set of a single distillation process im-
pacts downstream NLU performance. Since our
scope does not include optimizing the distillation
process itself, we use a cheaper alternative to Jiao
et al. (2020), via a single-stage distillation setup
to conduct our exploration (see Section A.3 for
details).
Gururangan et al. (2020) showed for the pretrain-
ing phase, that continued domain-adaptive and task-
adaptive pretraining using the downstream task’s
unlabeled data can improve performance. Our work
presents similar results for the distillation phase.
3 Data
3.1 Distillation data
For distillation, we created the transfer sets by mix-
ing two types of data with different distributions:
Generic data:
This data set consisted of
Wikipedia and Common Crawl processed by
an in-house tokenizer.
Task-specific data:
This in-house data set
comprised of de-identified utterances from a
voice assistant across domains of interest. The
text data collected here was the output of an
Automatic Speech Recognition (ASR) model,
which assigned a confidence score per utter-
ance. In order to retain only the highest quality
data, we filtered it by an ASR score threshold.
The data was de-identified, prior to use.
Our distilled students were trained as part of
a larger program resulting in a collection of nine
European and Indic languages being used for dis-
tillation. The language list and counts are shown in
Table A1.
We built transfer sets that had three ratios of
generic to task-specific data: (1) generic-only (base-
line) (2) 7:3 generic to task-specific, to mimic the
commonly encountered low task-specific data set-
ting and (3) task-specific-only. To have a com-
parable distribution of data from each language,
we created samples of equal size for each language
using either generic only, task-specific only or com-
bining both the generic and the task-specific data
based on the targeted ratio. Upsampling is used
when a source data set contains a number less than
the required number. The 7:3 ratio consisted of
Wikipedia, Common Crawl and task-specific data
upsampled to counts of 35M, 35M and 30M respec-
tively, for each language. For two languages Indian-
English and Marathi, where some data constituents
were unobtainable, available data was used in pro-
portion (see Table A1). Once the data sets were
created with the targeted mixing ratio, they were
split into train and validation sets with a ratio of
0.995:0.005 and then used in the transfer sets.
摘要:

KnowledgeDistillationTransferSetsandtheirImpactonDownstreamNLUTasksCharithPerisAmazon,Cambridge,USAperisc@amazon.comLizhenTanAmazon,Cambridge,USAltn@amazon.comThomasGueudreAmazon,Turin,Italytgueudre@amazon.itTuranGojayevAmazon,Berlin,Germanytgojayev@amazon.dePanWeiAmazon,Cambridge,USApanwei@amazon....

展开>> 收起<<
Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks Charith Peris.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:457.39KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注