SCALING UP DELIBERATION FOR MULTILINGUAL ASR Ke Hu Bo Li Tara N. Sainath Google LLC USA

2025-04-26 0 0 516.42KB 6 页 10玖币
侵权投诉
SCALING UP DELIBERATION FOR MULTILINGUAL ASR
Ke Hu, Bo Li, Tara N. Sainath
Google LLC, USA
{huk,boboli,tsainath}@google.com
ABSTRACT
Multilingual end-to-end automatic speech recognition models are at-
tractive due to its simplicity in training and deployment. Recent
work on large-scale training of such models has shown promising
results compared to monolingual models. However, the work often
focuses on multilingual models themselves in a single-pass setup. In
this work, we investigate second-pass deliberation for multilingual
speech recognition. Our proposed deliberation is multilingual, i.e.,
the text encoder encodes hypothesis text from multiple languages,
and the decoder attends to multilingual text and audio. We inves-
tigate scaling the deliberation text encoder and decoder, and com-
pare scaling the deliberation decoder and the first-pass cascaded en-
coder. We show that deliberation improves the average WER on 9
languages by 4% relative compared to the single-pass model. By
increasing the size of the deliberation up to 1B parameters, the aver-
age WER improvement increases to 9%, with up to 14% for certain
languages. Our deliberation rescorer is based on transformer layers
and can be parallelized during rescoring.
Index TermsMultilingual deliberation, multilingual auto-
matic speech recognition, large-scale training
1. INTRODUCTION
There has been growing interest in developing multilingual end-to-
end (E2E) automatic speech recognition (ASR) models in the past
few years [1,2,3,4,5,6,7]. Previous work on multilingual E2E
models has explored different model structures such as connectionist
temporal classification (CTC) models [2], attention-decoder based
models [1,3,5], and streaming models [4,6]. While the previous re-
search mainly focuses on various E2E model structures, recent large-
scale data sets have motivated work in developing giant multilingual
ASR models [8,9,10,11,12]. For example, [11] has proposed a
truly multilingual on-device transducer model for 9 languages with-
out explicitly using language information. By increasing the model
capacity to 1B parameters, [11] shows that the model performs gen-
erally better than monolingual models for all languages. In another
large-scale training, [8] builds a multilingual E2E model up to 1B
size, and the authors have improved quality of all variants of the
multilingual model [8]. However, the training data used in [8] is rel-
atively small compared to [11]. Although the aforementioned studies
pool all languages together and train from scratch, [12] has proposed
a life-long learning strategy for large-scale multilingual model train-
ing. By adding languages incrementally and increasing the model
size up to 1B, [12] shows that a lifelong learning strategy is more
effective than training from scratch.
Recently, two-pass ASR models have been shown to further im-
prove first-pass model performance [13,14,15,16]. While increas-
ing the capacity of the multilingual models themselves improves
recognition quality, it is unclear whether a second-pass model works
in a multilingual setup and how much it can benefit from a big-
ger size. In [17], the authors have investigated both multilingual
and monolingual second-pass rescoring for a 6-language multilin-
gual model and obtained significant improvements in both scenar-
ios. However, the biggest model used is around 300M and it is
unclear how the second-pass performance changes as capacity in-
creases. Traditional shallow fusion [18] or neural language model
(LM) rescoring [19] are mostly language dependent, i.e. each lan-
guage has a monolingual LM for post-processing.
Deliberation networks are one type of two-pass models, and
have achieved state-of-the-art results in monolingual ASR for En-
glish [20,14,21,16,22]. Deliberation is a two-pass ASR model
where the first-pass is usually a transducer [14] generating first-pass
text hypotheses, and in the second pass, a two-source attention is
used to attend to both first-pass hypotheses and audio encoder out-
puts for redecoding or rescoring [14]. Note that in the monolingual
scenario, the first-pass hypotheses are in a single language to delib-
erate [14]. It is unknown how deliberation performs in a multilingual
scenario where hypothesis texts are in different languages and audio
inputs are multilingual as well.
In this work, we investigate deliberation for multilingual second-
pass rescoring. We extend the model from [11] and add a cascaded
encoder [23] as the first-pass multilingual model. A multilingual
deliberation decoder is then used for second-pass rescoring. The
deliberation decoder consists of a multilingual text encoder and a
transformer-based multi-source attention decoder [24]. We show
that, in a truly multilingual setup without explicitly using any lan-
guage information, deliberation improves the multilingual first-pass
by 4% in terms of average WER, despite both hypothesis text and au-
dio encoder outputs are from multiple languages. Second, we show
that by scaling up deliberation text encoder and the deliberation de-
coder, further improvements are achieved. Specifically, it is more
effective to increase the width and depth of the deliberation decoder
than the text encoder or the first-pass cascaded encoder. By increas-
ing the multilingual deliberation up to 1B parameters, we improve
the average WER by 9% relative for 9 languages. The improvement
is uniform for all languages and up to 14% for certain languages.
Compared to a first-pass cascaded encoder in a similar size (1B), the
deliberation model has a relative average WER improvement of 4%
and per-language improvement up to 8%. As far as we know, this is
the first work on large-scale training of deliberation for multilingual
ASR.
2. SYSTEM DESCRIPTION
2.1. Multilingual Deliberation
Previous work shows that deliberation achieves the state-of-the-art
results for monolingual models [14,24,22]. However, it remains
unknown how deliberation works for multilingual models. There are
arXiv:2210.05785v1 [cs.CL] 11 Oct 2022
Causal
Encoder
RNN-T
decoder
Transformer
decoder
atten
atten
e
<latexit sha1_base64="N3ad6f1ua+RZz7nD8VOXPZUdqLc=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHCvYD2lKy6Wwbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMGyaCG+v7315hbX1jc6u4XdrZ3ds/KB8eNY1KNcMGU0LpdkgNCi6xYbkV2E400jgU2ArHt7N66wm14Uo+2EmCvZgOJY84o9ZZzW4YZTjtlyt+1Z+LrEKQQwVy1fvlr+5AsTRGaZmgxnQCP7G9jGrLmcBpqZsaTCgb0yF2HEoao+ll822n5Mw5AxIp7Z60ZO7+nshobMwkDl1nTO3ILNdm5n+1Tmqj617GZZJalGzxUZQKYhWZnU4GXCOzYuKAMs3droSNqKbMuoBKLoRg+eRVaF5UA8f3l5XaTR5HEU7gFM4hgCuowR3UoQEMHuEZXuHNU96L9+59LFoLXj5zDH/kff4AuoSPNw==</latexit>
<latexit sha1_base64="N3ad6f1ua+RZz7nD8VOXPZUdqLc=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHCvYD2lKy6Wwbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMGyaCG+v7315hbX1jc6u4XdrZ3ds/KB8eNY1KNcMGU0LpdkgNCi6xYbkV2E400jgU2ArHt7N66wm14Uo+2EmCvZgOJY84o9ZZzW4YZTjtlyt+1Z+LrEKQQwVy1fvlr+5AsTRGaZmgxnQCP7G9jGrLmcBpqZsaTCgb0yF2HEoao+ll822n5Mw5AxIp7Z60ZO7+nshobMwkDl1nTO3ILNdm5n+1Tmqj617GZZJalGzxUZQKYhWZnU4GXCOzYuKAMs3droSNqKbMuoBKLoRg+eRVaF5UA8f3l5XaTR5HEU7gFM4hgCuowR3UoQEMHuEZXuHNU96L9+59LFoLXj5zDH/kff4AuoSPNw==</latexit>
<latexit sha1_base64="N3ad6f1ua+RZz7nD8VOXPZUdqLc=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHCvYD2lKy6Wwbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMGyaCG+v7315hbX1jc6u4XdrZ3ds/KB8eNY1KNcMGU0LpdkgNCi6xYbkV2E400jgU2ArHt7N66wm14Uo+2EmCvZgOJY84o9ZZzW4YZTjtlyt+1Z+LrEKQQwVy1fvlr+5AsTRGaZmgxnQCP7G9jGrLmcBpqZsaTCgb0yF2HEoao+ll822n5Mw5AxIp7Z60ZO7+nshobMwkDl1nTO3ILNdm5n+1Tmqj617GZZJalGzxUZQKYhWZnU4GXCOzYuKAMs3droSNqKbMuoBKLoRg+eRVaF5UA8f3l5XaTR5HEU7gFM4hgCuowR3UoQEMHuEZXuHNU96L9+59LFoLXj5zDH/kff4AuoSPNw==</latexit>
<latexit sha1_base64="N3ad6f1ua+RZz7nD8VOXPZUdqLc=">AAAB7XicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHCvYD2lKy6Wwbm02WJCuUpf/BiwdFvPp/vPlvTNs9aOsLgYd3ZsjMGyaCG+v7315hbX1jc6u4XdrZ3ds/KB8eNY1KNcMGU0LpdkgNCi6xYbkV2E400jgU2ArHt7N66wm14Uo+2EmCvZgOJY84o9ZZzW4YZTjtlyt+1Z+LrEKQQwVy1fvlr+5AsTRGaZmgxnQCP7G9jGrLmcBpqZsaTCgb0yF2HEoao+ll822n5Mw5AxIp7Z60ZO7+nshobMwkDl1nTO3ILNdm5n+1Tmqj617GZZJalGzxUZQKYhWZnU4GXCOzYuKAMs3droSNqKbMuoBKLoRg+eRVaF5UA8f3l5XaTR5HEU7gFM4hgCuowR3UoQEMHuEZXuHNU96L9+59LFoLXj5zDH/kff4AuoSPNw==</latexit>
yr
<latexit sha1_base64="ka1wy5sPOUiTobRCZIwd+bLkfxg=">AAAB73icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2k3bpZhN3N0IJ/RNePCji1b/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqzU6QVhNumrab9ccavuHGSVeDmpQI5Gv/zVG8QsjVAaJqjWXc9NjJ9RZTgTOC31Uo0JZWM6xK6lkkao/Wx+75ScWWVAwljZkobM1d8TGY20nkSB7YyoGellbyb+53VTE177GZdJalCyxaIwFcTEZPY8GXCFzIiJJZQpbm8lbEQVZcZGVLIheMsvr5JWrepdVGv3l5X6TR5HEU7gFM7Bgyuowx00oAkMBDzDK7w5j86L8+58LFoLTj5zDH/gfP4AZlSQNA==</latexit>
ce
<latexit sha1_base64="5H0quaAk5SHUKyW5X2UzPjBhCLE=">AAAB73icbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY9OKxgmkLbSib7aRdutnE3Y1QQv+EFw+KePXvePPfuG1z0NYXFh7emWFn3jAVXBvX/XZKa+sbm1vl7crO7t7+QfXwqKWTTDH0WSIS1QmpRsEl+oYbgZ1UIY1Dge1wfDurt59QaZ7IBzNJMYjpUPKIM2qs1emFUc76OO1Xa27dnYusgldADQo1+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtShqjDvL5vlNyZp0BiRJlnzRk7v6eyGms9SQObWdMzUgv12bmf7VuZqLrIOcyzQxKtvgoygQxCZkdTwZcITNiYoEyxe2uhI2ooszYiCo2BG/55FVoXdQ9y/eXtcZNEUcZTuAUzsGDK2jAHTTBBwYCnuEV3pxH58V5dz4WrSWnmDmGP3I+fwAvrZAN</latexit>
<latexit sha1_base64="5H0quaAk5SHUKyW5X2UzPjBhCLE=">AAAB73icbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY9OKxgmkLbSib7aRdutnE3Y1QQv+EFw+KePXvePPfuG1z0NYXFh7emWFn3jAVXBvX/XZKa+sbm1vl7crO7t7+QfXwqKWTTDH0WSIS1QmpRsEl+oYbgZ1UIY1Dge1wfDurt59QaZ7IBzNJMYjpUPKIM2qs1emFUc76OO1Xa27dnYusgldADQo1+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtShqjDvL5vlNyZp0BiRJlnzRk7v6eyGms9SQObWdMzUgv12bmf7VuZqLrIOcyzQxKtvgoygQxCZkdTwZcITNiYoEyxe2uhI2ooszYiCo2BG/55FVoXdQ9y/eXtcZNEUcZTuAUzsGDK2jAHTTBBwYCnuEV3pxH58V5dz4WrSWnmDmGP3I+fwAvrZAN</latexit>
<latexit sha1_base64="5H0quaAk5SHUKyW5X2UzPjBhCLE=">AAAB73icbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY9OKxgmkLbSib7aRdutnE3Y1QQv+EFw+KePXvePPfuG1z0NYXFh7emWFn3jAVXBvX/XZKa+sbm1vl7crO7t7+QfXwqKWTTDH0WSIS1QmpRsEl+oYbgZ1UIY1Dge1wfDurt59QaZ7IBzNJMYjpUPKIM2qs1emFUc76OO1Xa27dnYusgldADQo1+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtShqjDvL5vlNyZp0BiRJlnzRk7v6eyGms9SQObWdMzUgv12bmf7VuZqLrIOcyzQxKtvgoygQxCZkdTwZcITNiYoEyxe2uhI2ooszYiCo2BG/55FVoXdQ9y/eXtcZNEUcZTuAUzsGDK2jAHTTBBwYCnuEV3pxH58V5dz4WrSWnmDmGP3I+fwAvrZAN</latexit>
<latexit sha1_base64="5H0quaAk5SHUKyW5X2UzPjBhCLE=">AAAB73icbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY9OKxgmkLbSib7aRdutnE3Y1QQv+EFw+KePXvePPfuG1z0NYXFh7emWFn3jAVXBvX/XZKa+sbm1vl7crO7t7+QfXwqKWTTDH0WSIS1QmpRsEl+oYbgZ1UIY1Dge1wfDurt59QaZ7IBzNJMYjpUPKIM2qs1emFUc76OO1Xa27dnYusgldADQo1+9Wv3iBhWYzSMEG17npuaoKcKsOZwGmll2lMKRvTIXYtShqjDvL5vlNyZp0BiRJlnzRk7v6eyGms9SQObWdMzUgv12bmf7VuZqLrIOcyzQxKtvgoygQxCZkdTwZcITNiYoEyxe2uhI2ooszYiCo2BG/55FVoXdQ9y/eXtcZNEUcZTuAUzsGDK2jAHTTBBwYCnuEV3pxH58V5dz4WrSWnmDmGP3I+fwAvrZAN</latexit>
Multilingual Text
Encoder
hb
<latexit sha1_base64="2CEkucdJceIJ9ueDpeuaaM7RUNc=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZdOOygr1AG8JkOmmHTiZh5kQsIeCruHGhiFufw51v47SNoK0/DHz85xzOmT9IBNfgOF/W0vLK6tp6aaO8ubW9s2vv7bd0nCrKmjQWseoERDPBJWsCB8E6iWIkCgRrB6PrSb19z5TmsbyDccK8iAwkDzklYCzfPuwBe4AgzIa5/4NB7tsVp+pMhRfBLaCCCjV8+7PXj2kaMQlUEK27rpOAlxEFnAqWl3upZgmhIzJgXYOSREx72fT8HJ8Yp4/DWJknAU/d3xMZibQeR4HpjAgM9XxtYv5X66YQXnoZl0kKTNLZojAVGGI8yQL3uWIUxNgAoYqbWzEdEkUomMTKJgR3/suL0KpV3bNq7fa8Ur8q4iihI3SMTpGLLlAd3aAGaiKKMvSEXtCr9Wg9W2/W+6x1ySpmDtAfWR/favaWbA==</latexit>
yd
<latexit sha1_base64="x7pCq9nD+95QHdFUq6gXgI3lYlE=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZdOOygr1AG8JkMmmHTi7MnIghBHwVNy4UcetzuPNtnLYRtPWHgY//nMM583uJ4Aos68tYWl5ZXVuvbFQ3t7Z3ds29/Y6KU0lZm8Yilj2PKCZ4xNrAQbBeIhkJPcG63vh6Uu/eM6l4HN1BljAnJMOIB5wS0JZrHg6APYAX5Fnh/qBfuGbNqltT4UWwS6ihUi3X/Bz4MU1DFgEVRKm+bSXg5EQCp4IV1UGqWELomAxZX2NEQqacfHp+gU+04+MglvpFgKfu74mchEploac7QwIjNV+bmP/V+ikEl07OoyQFFtHZoiAVGGI8yQL7XDIKItNAqOT6VkxHRBIKOrGqDsGe//IidBp1+6zeuD2vNa/KOCroCB2jU2SjC9REN6iF2oiiHD2hF/RqPBrPxpvxPmtdMsqZA/RHxsc3iJCWfw==</latexit>
cb
<latexit sha1_base64="90CYJkgof86fikI2U9r/I3ZJsL0=">AAAB/nicbZDLSsNAFIYnXmu9RcWVm8EiuCpJFXRZdOOygr1AG8JkOmmHTiZh5kQsIeCruHGhiFufw51v47SNoK0/DHz85xzOmT9IBNfgOF/W0vLK6tp6aaO8ubW9s2vv7bd0nCrKmjQWseoERDPBJWsCB8E6iWIkCgRrB6PrSb19z5TmsbyDccK8iAwkDzklYCzfPuwBe4AgzGju/2CQ+3bFqTpT4UVwC6igQg3f/uz1Y5pGTAIVROuu6yTgZUQBp4Ll5V6qWULoiAxY16AkEdNeNj0/xyfG6eMwVuZJwFP390RGIq3HUWA6IwJDPV+bmP/VuimEl17GZZICk3S2KEwFhhhPssB9rhgFMTZAqOLmVkyHRBEKJrGyCcGd//IitGpV96xauz2v1K+KOEroCB2jU+SiC1RHN6iBmoiiDD2hF/RqPVrP1pv1PmtdsoqZA/RH1sc3YyaWZw==</latexit>
Multilingual Deliberation Decoder
Non-causal
Encoder
Multilingual First-Pass Model
Speech
Fig. 1. Multilingual deliberation based on a truly multilingual cas-
caded encoder model as the first pass.
multiple choices in designing a deliberation model in a multilingual
setup, e.g., one can use a per-language deliberation decoder after a
first-pass model, or a single deliberation decoder for all languages.
The former is more similar to a monolingual deliberation and may
need explicit language information in inference, while the latter is
truly multilingual and simpler to train and inference. In this work,
we investigate a truly multilingual deliberation model, i.e. a single
deliberation decoder for deliberation of multiple languages.
There are a few distinctions between multilingual deliberation
and its monolingual versions: 1) The hypotheses from the first-pass
model are in multiple languages, and it is unclear whether a sin-
gle deliberation text encoder can model multilingual inputs, 2) A
single deliberation decoder attends to both text and audio encod-
ings in different languages, and 3) Does the deliberation also need
increased capacity in the multilingual scenario (similar to [10]) to
achieve comparable improvements in monolingual scenarios?
We show the diagram of the proposed multilingual deliberation
model in Fig. 1. To achieve a truly multilingual system, we use
a language agnostic multilingual model proposed in [11] as the first
pass. The first-pass model does not require any any explicit language
information in either encoder or decoder. Note that [11] only uses
left context for audio encoders, and we use a non-causal cascaded
encoder [23] to leverage the audio right context. Note that our cas-
caded encoder is also multilingual and does not explicitly use any
language information. For all languages, we sample the first-pass
decoder softmax [20] to generate multilingual hypothesis outputs.
To achieve a truly multilingual ASR system, our multilingual text
encoder does not use any explicit language information and encodes
text inputs of multiple languages using shared parameters. Usually a
bidirectional LSTM [14] or a conformer encoder [25] is used as the
text encoder. Similarly, to achieve a truly multilingual rescorer, we
use a single deliberation decoder based on transformer layers [24].
No explicit language information is used during rescoring. The mul-
tilingual deliberation decoder attends to both the cascaded encoder
(non-causal) output (e) and hypotheses (yr) from the non-causal en-
coder.
In an ASR task of 9 languages, we use 16,384 wordpieces as
the output vocabulary for both first-pass multilingual model and the
second-pass deliberation. The wordpicece model is generated on the
text transcriptions of our training data by pooling all the languages
together. We use a word count threshold of 20 to drop uncommon
words when building the wordpiece model. For Chinese, due to the
ambiguity in word segmentation, we simply use characters instead
words. While for Japanese, we use MeCab [26] to segment the text
into words. During training, we similarly pool all the data together
and sample each batch according to the natural distribution. We
freeze the first-pass model and only the deliberation network is opti-
mized using a cross entropy loss between the output of the network
and the ground truth transcripts.
Locale Language Counts (M) Hours (K)
en-US English (USA) 34.0 52.6
zh-TW Chinese 16.6 22.0
fr-FR French 16.5 23.7
de-DE German 15.3 23.4
ja-JP Japanese 14.9 20.2
es-US Spanish (USA) 14.2 23.8
es-ES Spanish (Spain) 12.9 20.1
it-IT Italian 11.8 19.8
en-GB English (UK) 6.0 8.6
Total 142.2 214.2
Table 1. Training data for 9 languages. Utterance counts are in
millions (M) and duration is in thousand (K) of hours.
2.2. Scaling Up Deliberation
[10] shows that scaling up the number of model parameters is an ef-
ficient way to increase modeling capacity for multilingual ASR. In-
spired by [10], we scale up the deliberation architecture in different
ways to increase the capacity for multilingual rescoring. As shown
in Fig. 1, there are two major components in the deliberation de-
coder: 1) Multilingual text encoder, and 2) Multilingual two-source
transformer decoder. In this work, we empirically study the effect of
increasing the capacity of the two components. We experiment with
increasing both the width or the depth of the text encoder for mod-
eling the hypotheses in multiple languages. While [11] finds that
increasing encoder in a transducer model is effective in increasing
model capacity, we increase the attention-based decoder up to 1B pa-
rameters and compare that to [11] with a large cascaded encoder. We
also research alternatives such as scaling up the non-causal cascaded
encoder first and then add deliberation or scaling up the deliberation
decoder alone. Training large-scale model is challenging and we use
Adafactor [27] and repeated layers [28] for reduce high-bandwidth
memory use. Our models are trained in Tensorflow [29] using the
Lingvo framework [28] on 4×4×8 Tensor Processing Units (TPU)
v4 slices [30] with a global batch size of 4,096, and optimized using
synchronized stochastic gradient descent. We cap the gradient norm
change of a parameter to 5.0 to stabilize training.
3. EXPERIMENTAL DETAILS
3.1. Data
Our multilingual training data consists of Voice Search speech from
9 languages in different locales, including English (USA), Chinese,
French, German, Japanese, Spanish (USA), Spanish (Spain), Ital-
ian and English (UK) (see Table 1for more details). They consist of
around 142M utterances and the total duration is around 214K hours.
The training data is anonymized and human transcribed. The per-
language training data ranges from 6M to 34M. For each language,
we use a test set with utterances sampled from the Voice Search traf-
fic for each language and they range from 23.8K to 300K utterances.
The test set does not overlap with the training set, and they are also
anonymized and human transcribed.
3.2. Model Description
We use a baseline multilingual model similar to [11] which is lan-
guage agnostic. The baseline model consists of a 12-layer causal
conformer encoder and a 5-layer non-causal cascaded encoder. The
摘要:

SCALINGUPDELIBERATIONFORMULTILINGUALASRKeHu,BoLi,TaraN.SainathGoogleLLC,USAfhuk,boboli,tsainathg@google.comABSTRACTMultilingualend-to-endautomaticspeechrecognitionmodelsareat-tractiveduetoitssimplicityintraininganddeployment.Recentworkonlarge-scaletrainingofsuchmodelshasshownpromisingresultscompared...

展开>> 收起<<
SCALING UP DELIBERATION FOR MULTILINGUAL ASR Ke Hu Bo Li Tara N. Sainath Google LLC USA.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:516.42KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注