FOUR-IN-ONE: A JOINT APPROACH TO INVERSE TEXT NORMALIZATION,
PUNCTUATION, CAPITALIZATION, AND DISFLUENCY
FOR AUTOMATIC SPEECH RECOGNITION
Sharman Tan, Piyush Behre, Nick Kibre, Issac Alphonso, Shuangyu Chang
Microsoft Corporation
ABSTRACT
Features such as punctuation, capitalization, and format-
ting of entities are important for readability, understanding,
and natural language processing tasks. However, Automatic
Speech Recognition (ASR) systems produce spoken-form
text devoid of formatting, and tagging approaches to format-
ting address just one or two features at a time. In this paper,
we unify spoken-to-written text conversion via a two-stage
process: First, we use a single transformer tagging model
to jointly produce token-level tags for inverse text normal-
ization (ITN), punctuation, capitalization, and disfluencies.
Then, we apply the tags to generate written-form text and use
weighted finite state transducer (WFST) grammars to format
tagged ITN entity spans. Despite joining four models into
one, our unified tagging approach matches or outperforms
task-specific models across all four tasks on benchmark test
sets across several domains.
Index Terms—automatic speech recognition, multi-task
learning, inverse text normalization, spoken-text formatting,
automatic punctuation
1. INTRODUCTION
Automatic Speech Recognition (ASR) systems produce un-
structured spoken-form text that lacks the formatting of
written-form text. Converting ASR outputs into written form
involves applying features such as inverse text normalization
(ITN), punctuation, capitalization, and disfluency removal.
ITN formats entities such as numbers, dates, times, and ad-
dresses. Disfluency removal strips the spoken-form text of
interruptions such as false starts, corrections, repetitions, and
filled pauses.
Spoken-to-written text conversion is critical for readabil-
ity and understanding [1], as well as for accurate downstream
text processing. Prior works have emphasized the importance
of well-formatted text for natural language processing (NLP)
tasks including part-of-speech (POS) tagging [2], named en-
tity recognition (NER) [3], machine translation [4], informa-
tion extraction [5], and summarization [6].
The problem of spoken-to-written text conversion is com-
plex. Punctuation restoration requires effectively capturing
long-range dependencies in text. Techniques have evolved to
do so, from ngram and classical machine learning approaches
to recurrent neural networks and, most recently, transform-
ers [7]. Punctuation and capitalization may vary across do-
mains, and prior works have examined legal [8] and medi-
cal [9] texts. ASR errors and production resource constraints
pose additional challenges [10].
ITN often involves weighted finite state transducer (WFST)
grammars [11] or sequence-to-sequence models [12]. Punctu-
ation, capitalization, and disfluency removal are approached
as machine translation [13] or, more commonly, sequence
labeling problems. Sequence labeling tags each token in the
spoken-form text to signify the desired formatting. The trans-
lation approach is attractive as an end-to-end solution, but
sequence labeling enforces structure and enables customiza-
tion for domains that may only require partial formatting.
Recent works have used pre-trained transformers to jointly
predict punctuation together with capitalization [14] and dis-
fluency [15]. Prosodic features have also proven helpful for
punctuation early on [16] and have since been used for punc-
tuation and disfluency detection [17, 18]. Despite these joint
approaches, no work thus far has completely unified tagging
for all four tasks.
We frame spoken-to-written text conversion as a two-
stage process. The first stage jointly tags spoken-form text
for ITN, punctuation, capitalization, and disfluencies. The
second stage applies each tag sequence and outputs written-
form text, employing WFST grammars for ITN and simple
conversions for the remaining tasks. To our knowledge, we
are the first to jointly train a model to tag for the four tasks.
We make the following key contributions:
• We introduce a novel two-stage approach to spoken-
to-written text conversion consisting of a single joint
tagging model followed by a tag application stage, as
described in section 2
• We define text processing pipelines for spoken- and
written-form public datasets to jointly predict token-
level ITN, punctuation, capitalization, and disfluency
tags, as described in sections 3 and 4
• We report joint model performance on par with or ex-
ceeding task-specific models for each of the four tasks
on a wide range of test sets, as described in section 5
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.15063v1 [cs.CL] 26 Oct 2022