Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks Charith Peris

2025-05-03 0 0 457.39KB 10 页 10玖币

侵权投诉

Knowledge Distillation Transfer Sets and their Impact on Downstream

NLU Tasks

Charith Peris ∗

Amazon, Cambridge, USA

perisc@amazon.com

Lizhen Tan

Amazon, Cambridge, USA

ltn@amazon.com

Thomas Gueudre

Amazon, Turin, Italy

tgueudre@amazon.it

Turan Gojayev

Amazon, Berlin, Germany

tgojayev@amazon.de

Pan Wei

Amazon, Cambridge, USA

panwei@amazon.com

Gokmen Oz

Amazon, Cambridge, USA

ogokmen@amazon.com

Abstract

Teacher-student knowledge distillation is a

popular technique for compressing today’s pre-

vailing large language models into manage-

able sizes that ﬁt low-latency downstream ap-

plications. Both the teacher and the choice

of transfer set used for distillation are cru-

cial ingredients in creating a high quality stu-

dent. Yet, the generic corpora used to pre-

train the teacher and the corpora associated

with the downstream target domain are often

signiﬁcantly different, which raises a natural

question: should the student be distilled over

the generic corpora, so as to learn from high-

quality teacher predictions, or over the down-

stream task corpora to align with ﬁnetuning?

Our study investigates this trade-off using Do-

main Classiﬁcation (DC) and Intent Classiﬁ-

cation/Named Entity Recognition (ICNER) as

downstream tasks. We distill several multilin-

gual students from a larger multilingual LM

with varying proportions of generic and task-

speciﬁc datasets, and report their performance

after ﬁnetuning on DC and ICNER. We ob-

serve signiﬁcant improvements across tasks

and test sets when only task-speciﬁc corpora

is used. We also report on how the impact of

adding task-speciﬁc data to the transfer set cor-

relates with the similarity between generic and

task-speciﬁc data. Our results clearly indicate

that, while distillation from a generic LM bene-

ﬁts downstream tasks, students learn better us-

ing target domain data even if it comes at the

price of noisier teacher predictions. In other

words, target domain data still trumps teacher

knowledge.

1 Introduction

In the recent past, large language models (LMs;

BERT-Large, Devlin et al.,2019; GPT-2, Radford

et al.,2019; T5, Raffel et al.,2020) pretrained

in a self-supervised manner on massive web cor-

pora have consistently shown state-of-the-art per-

∗

* Corresponding Author

formance for multiple natural language understand-

ing (NLU) tasks. Therefore, it is no surprise that

these models are of much interest for virtual as-

sistants such as Amazon Alexa, Apple Siri, and

Google Assistant. Some studies have shown that

these large models trained on generic corpora seem

to be more robust to data distributional shifts, rely-

ing less on domain-speciﬁc training data to perform

well (Brown et al.,2020).

Since large models cannot be directly used for

low-latency applications on devices with limited

computing capacity, many techniques have been

developed to compress them in size. Knowledge

distillation (referred to simply as distillation here-

after; Hinton et al.,2015), has shown promising

results, especially at the high compression rates

typically required in NLU (Jiao et al.,2020,Soltan

et al.,2021). In this paradigm, lightweight models

referred to as students, are trained to mimic the

teacher predictions over a transfer set (Hinton et al.,

2015). When the pretraining and task-speciﬁc cor-

pora have signiﬁcantly different distributions, as is

often the case, the choice of data for the transfer set

can be ambiguous. On the one hand, using pretrain-

ing corpora in the transfer set ensures high quality

teacher predictions that are important for effective

distillation. On the other, using the downstream

corpora, although it might cause noisier teacher

predictions, ensures the adaptation of the student

to its ﬁnal use case.

To investigate this trade-off, we present a set

of experiments where we distill several multilin-

gual students from a large multilingual teacher LM

trained using a masked language modeling (MLM)

objective. We perform the distillations using trans-

fer sets that comprise of generic and task-speciﬁc

data in varying proportions. The students are then

ﬁnetuned and evaluated on two downstream NLU

tasks of interest: a Domain Classiﬁcation (DC)

task and a joint Intent Classiﬁcation/Named Entity

Recognition (ICNER) task. For each input utter-

arXiv:2210.04834v3 [cs.CL] 17 Oct 2022

ance DC predicts the relevant domain (Books, Mu-

sic, Shopping, etc.), IC identiﬁes the user’s intent

(ﬁnd a book, play a song, buy an item, etc.) and

NER extracts the entities in the utterance (dates,

names, locations, etc.).

Our contributions

: (1) We conﬁrm for our

setup that model preparation via distillation from a

larger LM is more beneﬁcial for downstream task

performance when compared to encoder training

from scratch. (2) We show that the largest improve-

ments are seen when using only the downstream

task’s unlabelled data during the distillation pro-

cess. Even though teacher predictions are expected

to be noisy over data that is different from pre-

training corpora, our results clearly indicate that

students learn best in this setting. (3) Because our

ICNER corpora is divided per domain, we are also

able to provide a ﬁner-grained analysis of the im-

pact of corpora similarity on downstream results.

(4) Finally, we also conﬁrm that further adaptation

of the teacher to the target-domain data, results in

improved student performance across tasks.

2 Relevant Work

Building models with inference speeds that are

suitable for production systems is of utmost impor-

tance in the industrial setting. Therefore techniques

for model compression (quantization Gong et al.,

2014; pruning redundant connections Han et al.,

2015) have been active research topics, with dis-

tillation (Romero et al.,2015,Hinton et al.,2015,

Jiao et al.,2020) showing much promise for NLU

models (Sanh et al.,2019). Distillation processes

and their data have evolved over the past few years.

In the teacher-student framework proposed by Hin-

ton et al. (2015), they recommend using the original

pretraining set as the transfer set. Jiao et al. (2020)

proposes a more complex two-stage process with

generic and task-speciﬁc distillation phases, each

with their own data sets, designed to augment the

performance of the ﬁnal model towards the task at

hand.

Our work is focused on exploring how varying

proportions of generic and task-speciﬁc data within

the transfer set of a single distillation process im-

pacts downstream NLU performance. Since our

scope does not include optimizing the distillation

process itself, we use a cheaper alternative to Jiao

et al. (2020), via a single-stage distillation setup

to conduct our exploration (see Section A.3 for

details).

Gururangan et al. (2020) showed for the pretrain-

ing phase, that continued domain-adaptive and task-

adaptive pretraining using the downstream task’s

unlabeled data can improve performance. Our work

presents similar results for the distillation phase.

3 Data

3.1 Distillation data

For distillation, we created the transfer sets by mix-

ing two types of data with different distributions:

•Generic data:

This data set consisted of

Wikipedia and Common Crawl processed by

an in-house tokenizer.

•Task-speciﬁc data:

This in-house data set

comprised of de-identiﬁed utterances from a

voice assistant across domains of interest. The

text data collected here was the output of an

Automatic Speech Recognition (ASR) model,

which assigned a conﬁdence score per utter-

ance. In order to retain only the highest quality

data, we ﬁltered it by an ASR score threshold.

The data was de-identiﬁed, prior to use.

Our distilled students were trained as part of

a larger program resulting in a collection of nine

European and Indic languages being used for dis-

tillation. The language list and counts are shown in

Table A1.

We built transfer sets that had three ratios of

generic to task-speciﬁc data: (1) generic-only (base-

line) (2) 7:3 generic to task-speciﬁc, to mimic the

commonly encountered low task-speciﬁc data set-

ting and (3) task-speciﬁc-only. To have a com-

parable distribution of data from each language,

we created samples of equal size for each language

using either generic only, task-speciﬁc only or com-

bining both the generic and the task-speciﬁc data

based on the targeted ratio. Upsampling is used

when a source data set contains a number less than

the required number. The 7:3 ratio consisted of

Wikipedia, Common Crawl and task-speciﬁc data

upsampled to counts of 35M, 35M and 30M respec-

tively, for each language. For two languages Indian-

English and Marathi, where some data constituents

were unobtainable, available data was used in pro-

portion (see Table A1). Once the data sets were

created with the targeted mixing ratio, they were

split into train and validation sets with a ratio of

0.995:0.005 and then used in the transfer sets.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

KnowledgeDistillationTransferSetsandtheirImpactonDownstreamNLUTasksCharithPerisAmazon,Cambridge,USAperisc@amazon.comLizhenTanAmazon,Cambridge,USAltn@amazon.comThomasGueudreAmazon,Turin,Italytgueudre@amazon.itTuranGojayevAmazon,Berlin,Germanytgojayev@amazon.dePanWeiAmazon,Cambridge,USApanwei@amazon....

展开>> 收起<<

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks Charith Peris.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks Charith Peris

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: