
UFO2: A UNIFIED PRE-TRAINING FRAMEWORK FOR ONLINE AND OFFLINE SPEECH
RECOGNITION
Li Fu, Siqi Li, Qingtao Li, Liping Deng, Fangzhu Li, Lu Fan, Meng Chen, Xiaodong He
JD AI Research, Beijing, China
ABSTRACT
In this paper, we propose a Unified pre-training Framework for
Online and Offline (UFO2) Automatic Speech Recognition (ASR),
which 1) simplifies the two separate training workflows for online
and offline modes into one process, and 2) improves the Word Error
Rate (WER) performance with limited utterance annotating. Specif-
ically, we extend the conventional offline-mode Self-Supervised
Learning (SSL)-based ASR approach to a unified manner, where the
model training is conditioned on both the full-context and dynamic-
chunked inputs. To enhance the pre-trained representation model,
stop-gradient operation is applied to decouple the online-mode ob-
jectives to the quantizer. Moreover, in both the pre-training and
the downstream fine-tuning stages, joint losses are proposed to
train the unified model with full-weight sharing for the two modes.
Experimental results on the LibriSpeech dataset show that UFO2
outperforms the SSL-based baseline method by 29.7% and 18.2%
relative WER reduction in offline and online modes, respectively.
Index Terms—Automatic speech recognition, self-supervised
learning, online and offline unified model
1. INTRODUCTION
In recent years, Self-Supervised Learning (SSL) has received much
attention in the Automatic Speech Recognition (ASR) domain [1–
7]. Generally, the SSL-based ASR approach first pre-trains a speech
representation encoder on numerous unlabeled utterances via self-
supervised strategies (e.g. masking, quantization, contrasting [1]),
and then fine-tunes the model on labeled data with ASR objectives.
It has shown great potential in improving the ASR performance with
limited speech labeling, which is very promising and valuable when
the human-annotated utterances are expensive or scarce [8].
Based on how the tokens are emitted, ASR systems are typi-
cally categorized by their use in: 1) online mode (a.k.a. streaming),
which is developed to emit each hypothesized word as quickly and
accurately as possible when it is spoken [9–12], and 2) offline mode
(a.k.a. non-streaming), which aims to accurately emit the complete
hypotheses after processing a full utterance [13–15]. However, most
existing SSL-based ASR methods focus on the pre-training in an
offline manner, i.e. each represented feature is conditioned on the
full-context inputs [16]. As for the downstream online ASR model
that no (or limited) future context is permitted, the accuracy perfor-
mance might be hindered due to the mode inconsistency between
the pre-training and fine-tuning [17]. One might pre-train the repre-
sentation encoder in online manners, while the model might suffer a
heavy burden on the representation learning when a large proportion
of the utterance (i.e. future context) is unavailable [18].
Different from the offline-mode SSL, there are only few works
about pre-training for online ASR models. Chiu et al. [17] explored
replacing the learnable quantizer in [1] by a Random-projection
Quantizer (RQ), to make the quantized representation independent
Fig. 1. Overview of the proposed UFO2: Encoder features eare
masked and fed to Conformer blocks to extract offline/online latent
features (hoff
∗/hon
∗) conditioned on full-context/dynamic-chunked
inputs. Then context features cof f /con and quantized features q/e
q
(w/ stop gradient) are used for a joint of losses Loff and Lon.
to the recognition mode. Although the RQ method was separately
evaluated on online and offline models with 0.6 billion parame-
ters, as mentioned in [17] the random strategy of the quantizer and
smaller model sizes that are more efficient for online ASR tasks still
need to be further investigated. Cao et al. [19] trained an SSL-based
offline ASR model (teacher model), and then adopted knowledge
distillation [20] to guide the fine-tuning of an online model (student
model). Nevertheless, besides introducing the additional offline
model optimization and distillation strategies under the SSL frame-
work, the online model to be fine-tuned was still initialized from an
offline-mode representation encoder. Moreover, in the existing SSL-
based works, the online and offline ASR systems were processed
separately, causing high costs in model development and training
workflows for applications in different modes [21–23].
In this paper, a novel Unified pre-training Framework is pro-
posed to improve the speech representation learning for downstream
Online and Offline (UFO2) ASR tasks. In particular, UFO2 simpli-
fies the training workflows by unifying the online and offline modes
into a single model. As shown in Fig. 1, different from the most rep-
resentative SSL-based approach – Wav2vec2 [1] and its variants, e.g.
Wav2vec2-Conformer [25], we extend the offline-mode approach to
a unified manner via four strategies on the feature extraction and
training objectives. 1) Dual-mode attention. To train a unified rep-
resentation encoder, the full-context Multi-Headed Self-Attention
(MHSA) in the Conformer block [24] is used to extract offline-mode
features conditioned on the complete utterance. Simultaneously, the
dynamic-chunked MHSA [21] is adopted to mimic different latency
arXiv:2210.14515v2 [eess.AS] 3 Apr 2023