FOUR-IN-ONE A JOINT APPROACH TO INVERSE TEXT NORMALIZATION PUNCTUATION CAPITALIZATION AND DISFLUENCY FOR AUTOMATIC SPEECH RECOGNITION

2025-05-06 0 0 221.03KB 8 页 10玖币
侵权投诉
FOUR-IN-ONE: A JOINT APPROACH TO INVERSE TEXT NORMALIZATION,
PUNCTUATION, CAPITALIZATION, AND DISFLUENCY
FOR AUTOMATIC SPEECH RECOGNITION
Sharman Tan, Piyush Behre, Nick Kibre, Issac Alphonso, Shuangyu Chang
Microsoft Corporation
ABSTRACT
Features such as punctuation, capitalization, and format-
ting of entities are important for readability, understanding,
and natural language processing tasks. However, Automatic
Speech Recognition (ASR) systems produce spoken-form
text devoid of formatting, and tagging approaches to format-
ting address just one or two features at a time. In this paper,
we unify spoken-to-written text conversion via a two-stage
process: First, we use a single transformer tagging model
to jointly produce token-level tags for inverse text normal-
ization (ITN), punctuation, capitalization, and disfluencies.
Then, we apply the tags to generate written-form text and use
weighted finite state transducer (WFST) grammars to format
tagged ITN entity spans. Despite joining four models into
one, our unified tagging approach matches or outperforms
task-specific models across all four tasks on benchmark test
sets across several domains.
Index Termsautomatic speech recognition, multi-task
learning, inverse text normalization, spoken-text formatting,
automatic punctuation
1. INTRODUCTION
Automatic Speech Recognition (ASR) systems produce un-
structured spoken-form text that lacks the formatting of
written-form text. Converting ASR outputs into written form
involves applying features such as inverse text normalization
(ITN), punctuation, capitalization, and disfluency removal.
ITN formats entities such as numbers, dates, times, and ad-
dresses. Disfluency removal strips the spoken-form text of
interruptions such as false starts, corrections, repetitions, and
filled pauses.
Spoken-to-written text conversion is critical for readabil-
ity and understanding [1], as well as for accurate downstream
text processing. Prior works have emphasized the importance
of well-formatted text for natural language processing (NLP)
tasks including part-of-speech (POS) tagging [2], named en-
tity recognition (NER) [3], machine translation [4], informa-
tion extraction [5], and summarization [6].
The problem of spoken-to-written text conversion is com-
plex. Punctuation restoration requires effectively capturing
long-range dependencies in text. Techniques have evolved to
do so, from ngram and classical machine learning approaches
to recurrent neural networks and, most recently, transform-
ers [7]. Punctuation and capitalization may vary across do-
mains, and prior works have examined legal [8] and medi-
cal [9] texts. ASR errors and production resource constraints
pose additional challenges [10].
ITN often involves weighted finite state transducer (WFST)
grammars [11] or sequence-to-sequence models [12]. Punctu-
ation, capitalization, and disfluency removal are approached
as machine translation [13] or, more commonly, sequence
labeling problems. Sequence labeling tags each token in the
spoken-form text to signify the desired formatting. The trans-
lation approach is attractive as an end-to-end solution, but
sequence labeling enforces structure and enables customiza-
tion for domains that may only require partial formatting.
Recent works have used pre-trained transformers to jointly
predict punctuation together with capitalization [14] and dis-
fluency [15]. Prosodic features have also proven helpful for
punctuation early on [16] and have since been used for punc-
tuation and disfluency detection [17, 18]. Despite these joint
approaches, no work thus far has completely unified tagging
for all four tasks.
We frame spoken-to-written text conversion as a two-
stage process. The first stage jointly tags spoken-form text
for ITN, punctuation, capitalization, and disfluencies. The
second stage applies each tag sequence and outputs written-
form text, employing WFST grammars for ITN and simple
conversions for the remaining tasks. To our knowledge, we
are the first to jointly train a model to tag for the four tasks.
We make the following key contributions:
We introduce a novel two-stage approach to spoken-
to-written text conversion consisting of a single joint
tagging model followed by a tag application stage, as
described in section 2
We define text processing pipelines for spoken- and
written-form public datasets to jointly predict token-
level ITN, punctuation, capitalization, and disfluency
tags, as described in sections 3 and 4
We report joint model performance on par with or ex-
ceeding task-specific models for each of the four tasks
on a wide range of test sets, as described in section 5
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.15063v1 [cs.CL] 26 Oct 2022
2. PROPOSED METHOD
In this section, we describe our two-stage approach to for-
matting spoken-form ASR outputs. Figure 1 illustrates the
end-to-end workflow of our proposed method.
2.1. Joint labeling of ITN, punctuation, capitalization,
and disfluency
Stage 1 addresses ASR formatting as a multiple sequence la-
beling problem. We first tokenize the spoken-form text and
then use a transformer encoder [7] to learn a shared represen-
tation of the input. Four task-specific classification heads –
corresponding to ITN, punctuation, capitalization, and disflu-
ency – predict four token-level tag sequences from the shared
representation. Each classification head consists of a dropout
layer followed by a fully connected layer.
We use the cross-entropy (CE) loss function and jointly
optimize all four tasks by minimizing an evenly weighted
combination of the losses as shown in
CEjoint =CEi+CEp+CEc+CEd
4(1)
where CEi,CEp,C Ec, and CEdare the cross-entropy loss
functions for ITN, punctuation, capitalization, and disfluency,
respectively. Our task-specific experiments, described in Sec-
tion 4, optimize just the loss for the single task at hand.
2.2. Tag application
Stage 2 uses the four tag sequences to format the spoken-form
ASR outputs as their written form. Since the tag sequences
are token-level, we convert them to word-level for tag appli-
cation.
To format ITN entities, we extract each span of ITN to-
kens that are consecutively tagged as the same ITN entity type
and span. Then, we apply WFST grammar for that entity type
to generate the written form.
ITN formatting may change the number of words in the
sequence, so we preserve alignments between the original
spoken-form tokens and the formatted ITN entities. When
multiple spoken-form tokens map to a single WFST output,
we only apply the last punctuation tag and the first capitaliza-
tion tag. For punctuation, we append the indicated punctua-
tion tags to the corresponding words. For capitalization, we
capitalize the first letter or entirety of words, as tagged.
To remove disfluencies, we simply remove the disfluency-
tagged words from the text sequence.
Although we compare task-specific and joint models for
our experiments, using four independent task-specific mod-
els in real scenarios may result in undesirable conflicts be-
tween features. For instance, predicted punctuation may not
line up with predicted beginning-of-sentence capitalization.
The joint model’s shared representations encourage predic-
tions for ITN, punctuation, capitalization, and disfluency to
Fig. 1. Flow chart for joint prediction of ITN, punctuation,
capitalization, and disfluency
be consistent, avoiding such conflicts later in the tag applica-
tion stage.
3. DATA PROCESSING PIPELINE
3.1. Datasets
We use public datasets from various domains as well as addi-
tional sets specifically targeting ITN and disfluency. Table 1
shows the word count distributions by percentage among the
sets.
OpenWebText [19]: This dataset consists of web content ex-
tracted from URLs shared on Reddit with at least three up-
votes.
Stack Exchange1: This dataset consists of user-contributed
content on the Stack Exchange network.
OpenSubtitles2016 [20]: This dataset consists of movie and
TV subtitles.
Multimodal Aligned Earnings Conference (MAEC) [21]:
This dataset consists of transcribed earnings calls based on
S&P 1500 companies.
1https://archive.org/details/stackexchange
摘要:

FOUR-IN-ONE:AJOINTAPPROACHTOINVERSETEXTNORMALIZATION,PUNCTUATION,CAPITALIZATION,ANDDISFLUENCYFORAUTOMATICSPEECHRECOGNITIONSharmanTan,PiyushBehre,NickKibre,IssacAlphonso,ShuangyuChangMicrosoftCorporationABSTRACTFeaturessuchaspunctuation,capitalization,andformat-tingofentitiesareimportantforreadabilit...

展开>> 收起<<
FOUR-IN-ONE A JOINT APPROACH TO INVERSE TEXT NORMALIZATION PUNCTUATION CAPITALIZATION AND DISFLUENCY FOR AUTOMATIC SPEECH RECOGNITION.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:221.03KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注