
Causal
Encoder
RNN-T
decoder
Transformer
decoder
atten
atten
e
<latexit sha1_base64="N3ad6f1ua+RZz7nD8VOXPZUdqLc=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHCvYD2lKy6Wwbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMGyaCG+v7315hbX1jc6u4XdrZ3ds/KB8eNY1KNcMGU0LpdkgNCi6xYbkV2E400jgU2ArHt7N66wm14Uo+2EmCvZgOJY84o9ZZzW4YZTjtlyt+1Z+LrEKQQwVy1fvlr+5AsTRGaZmgxnQCP7G9jGrLmcBpqZsaTCgb0yF2HEoao+ll822n5Mw5AxIp7Z60ZO7+nshobMwkDl1nTO3ILNdm5n+1Tmqj617GZZJalGzxUZQKYhWZnU4GXCOzYuKAMs3droSNqKbMuoBKLoRg+eRVaF5UA8f3l5XaTR5HEU7gFM4hgCuowR3UoQEMHuEZXuHNU96L9+59LFoLXj5zDH/kff4AuoSPNw==</latexit>
<latexit sha1_base64="N3ad6f1ua+RZz7nD8VOXPZUdqLc=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHCvYD2lKy6Wwbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMGyaCG+v7315hbX1jc6u4XdrZ3ds/KB8eNY1KNcMGU0LpdkgNCi6xYbkV2E400jgU2ArHt7N66wm14Uo+2EmCvZgOJY84o9ZZzW4YZTjtlyt+1Z+LrEKQQwVy1fvlr+5AsTRGaZmgxnQCP7G9jGrLmcBpqZsaTCgb0yF2HEoao+ll822n5Mw5AxIp7Z60ZO7+nshobMwkDl1nTO3ILNdm5n+1Tmqj617GZZJalGzxUZQKYhWZnU4GXCOzYuKAMs3droSNqKbMuoBKLoRg+eRVaF5UA8f3l5XaTR5HEU7gFM4hgCuowR3UoQEMHuEZXuHNU96L9+59LFoLXj5zDH/kff4AuoSPNw==</latexit>
<latexit sha1_base64="N3ad6f1ua+RZz7nD8VOXPZUdqLc=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHCvYD2lKy6Wwbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMGyaCG+v7315hbX1jc6u4XdrZ3ds/KB8eNY1KNcMGU0LpdkgNCi6xYbkV2E400jgU2ArHt7N66wm14Uo+2EmCvZgOJY84o9ZZzW4YZTjtlyt+1Z+LrEKQQwVy1fvlr+5AsTRGaZmgxnQCP7G9jGrLmcBpqZsaTCgb0yF2HEoao+ll822n5Mw5AxIp7Z60ZO7+nshobMwkDl1nTO3ILNdm5n+1Tmqj617GZZJalGzxUZQKYhWZnU4GXCOzYuKAMs3droSNqKbMuoBKLoRg+eRVaF5UA8f3l5XaTR5HEU7gFM4hgCuowR3UoQEMHuEZXuHNU96L9+59LFoLXj5zDH/kff4AuoSPNw==</latexit>
<latexit sha1_base64="N3ad6f1ua+RZz7nD8VOXPZUdqLc=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHCvYD2lKy6Wwbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMGyaCG+v7315hbX1jc6u4XdrZ3ds/KB8eNY1KNcMGU0LpdkgNCi6xYbkV2E400jgU2ArHt7N66wm14Uo+2EmCvZgOJY84o9ZZzW4YZTjtlyt+1Z+LrEKQQwVy1fvlr+5AsTRGaZmgxnQCP7G9jGrLmcBpqZsaTCgb0yF2HEoao+ll822n5Mw5AxIp7Z60ZO7+nshobMwkDl1nTO3ILNdm5n+1Tmqj617GZZJalGzxUZQKYhWZnU4GXCOzYuKAMs3droSNqKbMuoBKLoRg+eRVaF5UA8f3l5XaTR5HEU7gFM4hgCuowR3UoQEMHuEZXuHNU96L9+59LFoLXj5zDH/kff4AuoSPNw==</latexit>
yr
<latexit sha1_base64="ka1wy5sPOUiTobRCZIwd+bLkfxg=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2k3bpZhN3N0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqzU6QVhNumrab9ccavuHGSVeDmpQI5Gv/zVG8QsjVAaJqjWXc9NjJ9RZTgTOC31Uo0JZWM6xK6lkkao/Wx+75ScWWVAwljZkobM1d8TGY20nkSB7YyoGellbyb+53VTE177GZdJalCyxaIwFcTEZPY8GXCFzIiJJZQpbm8lbEQVZcZGVLIheMsvr5JWrepdVGv3l5X6TR5HEU7gFM7Bgyuowx00oAkMBDzDK7w5j86L8+58LFoLTj5zDH/gfP4AZlSQNA==</latexit>
ce
<latexit sha1_base64="5H0quaAk5SHUKyW5X2UzPjBhCLE=">AAAB73icbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY9OKxgmkLbSib7aRdutnE3Y1QQv+EFw+KePXvePPfuG1z0NYXFh7emWFn3jAVXBvX/XZKa+sbm1vl7crO7t7+QfXwqKWTTDH0WSIS1QmpRsEl+oYbgZ1UIY1Dge1wfDurt59QaZ7IBzNJMYjpUPKIM2qs1emFUc76OO1Xa27dnYusgldADQo1+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtShqjDvL5vlNyZp0BiRJlnzRk7v6eyGms9SQObWdMzUgv12bmf7VuZqLrIOcyzQxKtvgoygQxCZkdTwZcITNiYoEyxe2uhI2ooszYiCo2BG/55FVoXdQ9y/eXtcZNEUcZTuAUzsGDK2jAHTTBBwYCnuEV3pxH58V5dz4WrSWnmDmGP3I+fwAvrZAN</latexit>
<latexit sha1_base64="5H0quaAk5SHUKyW5X2UzPjBhCLE=">AAAB73icbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY9OKxgmkLbSib7aRdutnE3Y1QQv+EFw+KePXvePPfuG1z0NYXFh7emWFn3jAVXBvX/XZKa+sbm1vl7crO7t7+QfXwqKWTTDH0WSIS1QmpRsEl+oYbgZ1UIY1Dge1wfDurt59QaZ7IBzNJMYjpUPKIM2qs1emFUc76OO1Xa27dnYusgldADQo1+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtShqjDvL5vlNyZp0BiRJlnzRk7v6eyGms9SQObWdMzUgv12bmf7VuZqLrIOcyzQxKtvgoygQxCZkdTwZcITNiYoEyxe2uhI2ooszYiCo2BG/55FVoXdQ9y/eXtcZNEUcZTuAUzsGDK2jAHTTBBwYCnuEV3pxH58V5dz4WrSWnmDmGP3I+fwAvrZAN</latexit>
<latexit sha1_base64="5H0quaAk5SHUKyW5X2UzPjBhCLE=">AAAB73icbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY9OKxgmkLbSib7aRdutnE3Y1QQv+EFw+KePXvePPfuG1z0NYXFh7emWFn3jAVXBvX/XZKa+sbm1vl7crO7t7+QfXwqKWTTDH0WSIS1QmpRsEl+oYbgZ1UIY1Dge1wfDurt59QaZ7IBzNJMYjpUPKIM2qs1emFUc76OO1Xa27dnYusgldADQo1+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtShqjDvL5vlNyZp0BiRJlnzRk7v6eyGms9SQObWdMzUgv12bmf7VuZqLrIOcyzQxKtvgoygQxCZkdTwZcITNiYoEyxe2uhI2ooszYiCo2BG/55FVoXdQ9y/eXtcZNEUcZTuAUzsGDK2jAHTTBBwYCnuEV3pxH58V5dz4WrSWnmDmGP3I+fwAvrZAN</latexit>
<latexit sha1_base64="5H0quaAk5SHUKyW5X2UzPjBhCLE=">AAAB73icbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY9OKxgmkLbSib7aRdutnE3Y1QQv+EFw+KePXvePPfuG1z0NYXFh7emWFn3jAVXBvX/XZKa+sbm1vl7crO7t7+QfXwqKWTTDH0WSIS1QmpRsEl+oYbgZ1UIY1Dge1wfDurt59QaZ7IBzNJMYjpUPKIM2qs1emFUc76OO1Xa27dnYusgldADQo1+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtShqjDvL5vlNyZp0BiRJlnzRk7v6eyGms9SQObWdMzUgv12bmf7VuZqLrIOcyzQxKtvgoygQxCZkdTwZcITNiYoEyxe2uhI2ooszYiCo2BG/55FVoXdQ9y/eXtcZNEUcZTuAUzsGDK2jAHTTBBwYCnuEV3pxH58V5dz4WrSWnmDmGP3I+fwAvrZAN</latexit>
hb
<latexit sha1_base64="2CEkucdJceIJ9ueDpeuaaM7RUNc=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZdOOygr1AG8JkOmmHTiZh5kQsIeCruHGhiFufw51v47SNoK0/DHz85xzOmT9IBNfgOF/W0vLK6tp6aaO8ubW9s2vv7bd0nCrKmjQWseoERDPBJWsCB8E6iWIkCgRrB6PrSb19z5TmsbyDccK8iAwkDzklYCzfPuwBe4AgzIa5/4NB7tsVp+pMhRfBLaCCCjV8+7PXj2kaMQlUEK27rpOAlxEFnAqWl3upZgmhIzJgXYOSREx72fT8HJ8Yp4/DWJknAU/d3xMZibQeR4HpjAgM9XxtYv5X66YQXnoZl0kKTNLZojAVGGI8yQL3uWIUxNgAoYqbWzEdEkUomMTKJgR3/suL0KpV3bNq7fa8Ur8q4iihI3SMTpGLLlAd3aAGaiKKMvSEXtCr9Wg9W2/W+6x1ySpmDtAfWR/favaWbA==</latexit>
yd
<latexit sha1_base64="x7pCq9nD+95QHdFUq6gXgI3lYlE=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZdOOygr1AG8JkMmmHTi7MnIghBHwVNy4UcetzuPNtnLYRtPWHgY//nMM583uJ4Aos68tYWl5ZXVuvbFQ3t7Z3ds29/Y6KU0lZm8Yilj2PKCZ4xNrAQbBeIhkJPcG63vh6Uu/eM6l4HN1BljAnJMOIB5wS0JZrHg6APYAX5Fnh/qBfuGbNqltT4UWwS6ihUi3X/Bz4MU1DFgEVRKm+bSXg5EQCp4IV1UGqWELomAxZX2NEQqacfHp+gU+04+MglvpFgKfu74mchEploac7QwIjNV+bmP/V+ikEl07OoyQFFtHZoiAVGGI8yQL7XDIKItNAqOT6VkxHRBIKOrGqDsGe//IidBp1+6zeuD2vNa/KOCroCB2jU2SjC9REN6iF2oiiHD2hF/RqPBrPxpvxPmtdMsqZA/RHxsc3iJCWfw==</latexit>
cb
<latexit sha1_base64="90CYJkgof86fikI2U9r/I3ZJsL0=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZdOOygr1AG8JkOmmHTiZh5kQsIeCruHGhiFufw51v47SNoK0/DHz85xzOmT9IBNfgOF/W0vLK6tp6aaO8ubW9s2vv7bd0nCrKmjQWseoERDPBJWsCB8E6iWIkCgRrB6PrSb19z5TmsbyDccK8iAwkDzklYCzfPuwBe4AgzGju/2CQ+3bFqTpT4UVwC6igQg3f/uz1Y5pGTAIVROuu6yTgZUQBp4Ll5V6qWULoiAxY16AkEdNeNj0/xyfG6eMwVuZJwFP390RGIq3HUWA6IwJDPV+bmP/VuimEl17GZZICk3S2KEwFhhhPssB9rhgFMTZAqOLmVkyHRBEKJrGyCcGd//IitGpV96xauz2v1K+KOEroCB2jU+SiC1RHN6iBmoiiDD2hF/RqPVrP1pv1PmtdsoqZA/RH1sc3YyaWZw==</latexit>
Multilingual Deliberation Decoder
Non-causal
Encoder
Multilingual First-Pass Model
Speech
Fig. 1. Multilingual deliberation based on a truly multilingual cas-
caded encoder model as the first pass.
multiple choices in designing a deliberation model in a multilingual
setup, e.g., one can use a per-language deliberation decoder after a
first-pass model, or a single deliberation decoder for all languages.
The former is more similar to a monolingual deliberation and may
need explicit language information in inference, while the latter is
truly multilingual and simpler to train and inference. In this work,
we investigate a truly multilingual deliberation model, i.e. a single
deliberation decoder for deliberation of multiple languages.
There are a few distinctions between multilingual deliberation
and its monolingual versions: 1) The hypotheses from the first-pass
model are in multiple languages, and it is unclear whether a sin-
gle deliberation text encoder can model multilingual inputs, 2) A
single deliberation decoder attends to both text and audio encod-
ings in different languages, and 3) Does the deliberation also need
increased capacity in the multilingual scenario (similar to [10]) to
achieve comparable improvements in monolingual scenarios?
We show the diagram of the proposed multilingual deliberation
model in Fig. 1. To achieve a truly multilingual system, we use
a language agnostic multilingual model proposed in [11] as the first
pass. The first-pass model does not require any any explicit language
information in either encoder or decoder. Note that [11] only uses
left context for audio encoders, and we use a non-causal cascaded
encoder [23] to leverage the audio right context. Note that our cas-
caded encoder is also multilingual and does not explicitly use any
language information. For all languages, we sample the first-pass
decoder softmax [20] to generate multilingual hypothesis outputs.
To achieve a truly multilingual ASR system, our multilingual text
encoder does not use any explicit language information and encodes
text inputs of multiple languages using shared parameters. Usually a
bidirectional LSTM [14] or a conformer encoder [25] is used as the
text encoder. Similarly, to achieve a truly multilingual rescorer, we
use a single deliberation decoder based on transformer layers [24].
No explicit language information is used during rescoring. The mul-
tilingual deliberation decoder attends to both the cascaded encoder
(non-causal) output (e) and hypotheses (yr) from the non-causal en-
coder.
In an ASR task of 9 languages, we use 16,384 wordpieces as
the output vocabulary for both first-pass multilingual model and the
second-pass deliberation. The wordpicece model is generated on the
text transcriptions of our training data by pooling all the languages
together. We use a word count threshold of 20 to drop uncommon
words when building the wordpiece model. For Chinese, due to the
ambiguity in word segmentation, we simply use characters instead
words. While for Japanese, we use MeCab [26] to segment the text
into words. During training, we similarly pool all the data together
and sample each batch according to the natural distribution. We
freeze the first-pass model and only the deliberation network is opti-
mized using a cross entropy loss between the output of the network
and the ground truth transcripts.
Locale Language Counts (M) Hours (K)
en-US English (USA) 34.0 52.6
zh-TW Chinese 16.6 22.0
fr-FR French 16.5 23.7
de-DE German 15.3 23.4
ja-JP Japanese 14.9 20.2
es-US Spanish (USA) 14.2 23.8
es-ES Spanish (Spain) 12.9 20.1
it-IT Italian 11.8 19.8
en-GB English (UK) 6.0 8.6
Total 142.2 214.2
Table 1. Training data for 9 languages. Utterance counts are in
millions (M) and duration is in thousand (K) of hours.
2.2. Scaling Up Deliberation
[10] shows that scaling up the number of model parameters is an ef-
ficient way to increase modeling capacity for multilingual ASR. In-
spired by [10], we scale up the deliberation architecture in different
ways to increase the capacity for multilingual rescoring. As shown
in Fig. 1, there are two major components in the deliberation de-
coder: 1) Multilingual text encoder, and 2) Multilingual two-source
transformer decoder. In this work, we empirically study the effect of
increasing the capacity of the two components. We experiment with
increasing both the width or the depth of the text encoder for mod-
eling the hypotheses in multiple languages. While [11] finds that
increasing encoder in a transducer model is effective in increasing
model capacity, we increase the attention-based decoder up to 1B pa-
rameters and compare that to [11] with a large cascaded encoder. We
also research alternatives such as scaling up the non-causal cascaded
encoder first and then add deliberation or scaling up the deliberation
decoder alone. Training large-scale model is challenging and we use
Adafactor [27] and repeated layers [28] for reduce high-bandwidth
memory use. Our models are trained in Tensorflow [29] using the
Lingvo framework [28] on 4×4×8 Tensor Processing Units (TPU)
v4 slices [30] with a global batch size of 4,096, and optimized using
synchronized stochastic gradient descent. We cap the gradient norm
change of a parameter to 5.0 to stabilize training.
3. EXPERIMENTAL DETAILS
3.1. Data
Our multilingual training data consists of Voice Search speech from
9 languages in different locales, including English (USA), Chinese,
French, German, Japanese, Spanish (USA), Spanish (Spain), Ital-
ian and English (UK) (see Table 1for more details). They consist of
around 142M utterances and the total duration is around 214K hours.
The training data is anonymized and human transcribed. The per-
language training data ranges from 6M to 34M. For each language,
we use a test set with utterances sampled from the Voice Search traf-
fic for each language and they range from 23.8K to 300K utterances.
The test set does not overlap with the training set, and they are also
anonymized and human transcribed.
3.2. Model Description
We use a baseline multilingual model similar to [11] which is lan-
guage agnostic. The baseline model consists of a 12-layer causal
conformer encoder and a 5-layer non-causal cascaded encoder. The