
adaptation.
We consider a Constrained Domain Adaptation task,
where the adaptation for the new domain is done without
access to any of the original domain data. We also strictly
limit the allowed degradation on the original domain after
adaptation [12]. Our main contributions are the following:
1. We add adapter modules [8] to the encoder, decoder,
and joint modules of the Conformer Transducer.
2. We train the adapters using various regularization tech-
niques alongside a constrained training schedule, and
show considerable improvement on the new domain
while limiting degradation on the original domain.
3. Finally, we propose a scoring scheme to select mod-
els that perform well in the constrained adaptation set-
ting and evaluate the proposed approach on the Google
Speech Commands [15] benchmark and the UK and
Ireland English Dialect speech dataset [16].
2. CONSTRAINED DOMAIN ADAPTATION
2.1. Degradation control on original domain
In order to reduce the accuracy loss on the original domain,
we formalize the first constraint as follows. During con-
strained domain adaptation, a candidate solution (C) must
be evaluated on some evaluation set of the original domain
prior to adaptation (o) and after adaptation (o∗) so as to limit
the absolute degradation of WER to at most κ, where κis
some predetermined acceptable degradation in WER on the
evaluation datasets of the original domain.
To formalize the above constraint, we first define Word
Error Rate (WER) degradation after adaptation as:
WERDego= max(0,WERo∗−WERo)(1)
where the subscript o∗,a∗represent evaluation on the origi-
nal domain and adapted domain after the adaptation process,
and o,arepresent the evaluation on the original domain and
adapted domain prior to the adaptation process respectively.
We then define the weight of degradation on the original
domain as OSCALE , which is a scaling factor computed as :
OSCALE =1
N
N
X
i=1
max 0,κi−WERDego,i
κi(2)
where Nis the number of evaluation datasets from the original
domain that the model was initially trained on and κis the
maximum tolerable absolute degradation of word error rate
on the original domain.
Next, we define relative WER improvement (WERR) on
the new domain as AWERR, such that
AWERR = max 0,WERa−WERa∗
WERa(3)
In this formulation, we only accept candidates which improve
in WER on the new domain.
Combining the above definitions, we propose the follow-
ing candidate selection metric which when maximized, yields
candidates that maximize the relative WER improvement on
the new domain, while simultaneously minimizing the degra-
dation on the old domain. We define this metric as a score
function in Eqn 4:
Score =OSCALE ∗AWERR (4)
We select the value of κto be 3%, such that the abso-
lute increase in WER on the original dataset is constrained
to 3%. It is possible to select a stricter threshold, however,
the number of candidate solutions that satisfy the constraints
decrease significantly, and exceedingly few valid candidates
exist for the fine-tuning case. This score is maximized when
the candidate attains the largest relative WER improvement
on the new domain after scaling by a factor in the range of
[0,1] which indicates the weight of degradation of WER on
the original domain. Note that the score has a minimum value
of 0, if the absolute degradation of WER on the original do-
main surpasses κ, or if WER on the new domain becomes
worse after adaptation.
2.2. Adaptation without access to original dataset
During constrained domain adaptation, a candidate solution
must only use the data of the new domain for adaptation, with-
out access to any data from the original domain. It may use
data from the original domain only for evaluation, in order
to determine the severity of degradation on that domain after
adaptation. When applying this constraint, we cannot freely
compute the Coverage (COV) metric [12] since it computes
the difference between fine-tuning and CJT, though we may
still utilize Learning Without Forgetting (LWF) [18] which
distills the model’s knowledge using just the data of the new
domain.
3. SPEECH TRANSDUCERS WITH ADAPTERS
Adapters are small sub-networks injected into specific layers
of a pre-trained neural network, such that the original param-
eters of the network are frozen, avoiding gradient updates,
and only the injected parameters of the adapter sub-networks
are updated. Original adapters from [8] are two-layer residual
feed-forward networks, with an intermediate activation func-
tion, usually, ReLU [19]. Optionally adapters have a layer
normalization applied to the input or output of the adapter.
Adaptors have been also applied to multilingual ASR [20],
cross-lingual ASR [21], self-supervised ASR [22] and con-
textual ASR [2]. Adapters have been found useful in reducing
Catastrophic Forgetting in ASR [12].
Following an unified framework for adapters proposed
by He et al. [23], we consider three types of adapters for
2