ance DC predicts the relevant domain (Books, Mu-
sic, Shopping, etc.), IC identifies the user’s intent
(find a book, play a song, buy an item, etc.) and
NER extracts the entities in the utterance (dates,
names, locations, etc.).
Our contributions
: (1) We confirm for our
setup that model preparation via distillation from a
larger LM is more beneficial for downstream task
performance when compared to encoder training
from scratch. (2) We show that the largest improve-
ments are seen when using only the downstream
task’s unlabelled data during the distillation pro-
cess. Even though teacher predictions are expected
to be noisy over data that is different from pre-
training corpora, our results clearly indicate that
students learn best in this setting. (3) Because our
ICNER corpora is divided per domain, we are also
able to provide a finer-grained analysis of the im-
pact of corpora similarity on downstream results.
(4) Finally, we also confirm that further adaptation
of the teacher to the target-domain data, results in
improved student performance across tasks.
2 Relevant Work
Building models with inference speeds that are
suitable for production systems is of utmost impor-
tance in the industrial setting. Therefore techniques
for model compression (quantization Gong et al.,
2014; pruning redundant connections Han et al.,
2015) have been active research topics, with dis-
tillation (Romero et al.,2015,Hinton et al.,2015,
Jiao et al.,2020) showing much promise for NLU
models (Sanh et al.,2019). Distillation processes
and their data have evolved over the past few years.
In the teacher-student framework proposed by Hin-
ton et al. (2015), they recommend using the original
pretraining set as the transfer set. Jiao et al. (2020)
proposes a more complex two-stage process with
generic and task-specific distillation phases, each
with their own data sets, designed to augment the
performance of the final model towards the task at
hand.
Our work is focused on exploring how varying
proportions of generic and task-specific data within
the transfer set of a single distillation process im-
pacts downstream NLU performance. Since our
scope does not include optimizing the distillation
process itself, we use a cheaper alternative to Jiao
et al. (2020), via a single-stage distillation setup
to conduct our exploration (see Section A.3 for
details).
Gururangan et al. (2020) showed for the pretrain-
ing phase, that continued domain-adaptive and task-
adaptive pretraining using the downstream task’s
unlabeled data can improve performance. Our work
presents similar results for the distillation phase.
3 Data
3.1 Distillation data
For distillation, we created the transfer sets by mix-
ing two types of data with different distributions:
•Generic data:
This data set consisted of
Wikipedia and Common Crawl processed by
an in-house tokenizer.
•Task-specific data:
This in-house data set
comprised of de-identified utterances from a
voice assistant across domains of interest. The
text data collected here was the output of an
Automatic Speech Recognition (ASR) model,
which assigned a confidence score per utter-
ance. In order to retain only the highest quality
data, we filtered it by an ASR score threshold.
The data was de-identified, prior to use.
Our distilled students were trained as part of
a larger program resulting in a collection of nine
European and Indic languages being used for dis-
tillation. The language list and counts are shown in
Table A1.
We built transfer sets that had three ratios of
generic to task-specific data: (1) generic-only (base-
line) (2) 7:3 generic to task-specific, to mimic the
commonly encountered low task-specific data set-
ting and (3) task-specific-only. To have a com-
parable distribution of data from each language,
we created samples of equal size for each language
using either generic only, task-specific only or com-
bining both the generic and the task-specific data
based on the targeted ratio. Upsampling is used
when a source data set contains a number less than
the required number. The 7:3 ratio consisted of
Wikipedia, Common Crawl and task-specific data
upsampled to counts of 35M, 35M and 30M respec-
tively, for each language. For two languages Indian-
English and Marathi, where some data constituents
were unobtainable, available data was used in pro-
portion (see Table A1). Once the data sets were
created with the targeted mixing ratio, they were
split into train and validation sets with a ratio of
0.995:0.005 and then used in the transfer sets.