FOUR-IN-ONE A JOINT APPROACH TO INVERSE TEXT NORMALIZATION PUNCTUATION CAPITALIZATION AND DISFLUENCY FOR AUTOMATIC SPEECH RECOGNITION

2025-05-06 0 0 221.03KB 8 页 10玖币

侵权投诉

FOUR-IN-ONE: A JOINT APPROACH TO INVERSE TEXT NORMALIZATION,

PUNCTUATION, CAPITALIZATION, AND DISFLUENCY

FOR AUTOMATIC SPEECH RECOGNITION

Sharman Tan, Piyush Behre, Nick Kibre, Issac Alphonso, Shuangyu Chang

Microsoft Corporation

ABSTRACT

Features such as punctuation, capitalization, and format-

ting of entities are important for readability, understanding,

and natural language processing tasks. However, Automatic

Speech Recognition (ASR) systems produce spoken-form

text devoid of formatting, and tagging approaches to format-

ting address just one or two features at a time. In this paper,

we unify spoken-to-written text conversion via a two-stage

process: First, we use a single transformer tagging model

to jointly produce token-level tags for inverse text normal-

ization (ITN), punctuation, capitalization, and disﬂuencies.

Then, we apply the tags to generate written-form text and use

weighted ﬁnite state transducer (WFST) grammars to format

tagged ITN entity spans. Despite joining four models into

one, our uniﬁed tagging approach matches or outperforms

task-speciﬁc models across all four tasks on benchmark test

sets across several domains.

Index Terms—automatic speech recognition, multi-task

learning, inverse text normalization, spoken-text formatting,

automatic punctuation

1. INTRODUCTION

Automatic Speech Recognition (ASR) systems produce un-

structured spoken-form text that lacks the formatting of

written-form text. Converting ASR outputs into written form

involves applying features such as inverse text normalization

(ITN), punctuation, capitalization, and disﬂuency removal.

ITN formats entities such as numbers, dates, times, and ad-

dresses. Disﬂuency removal strips the spoken-form text of

interruptions such as false starts, corrections, repetitions, and

ﬁlled pauses.

Spoken-to-written text conversion is critical for readabil-

ity and understanding [1], as well as for accurate downstream

text processing. Prior works have emphasized the importance

of well-formatted text for natural language processing (NLP)

tasks including part-of-speech (POS) tagging [2], named en-

tity recognition (NER) [3], machine translation [4], informa-

tion extraction [5], and summarization [6].

The problem of spoken-to-written text conversion is com-

plex. Punctuation restoration requires effectively capturing

long-range dependencies in text. Techniques have evolved to

do so, from ngram and classical machine learning approaches

to recurrent neural networks and, most recently, transform-

ers [7]. Punctuation and capitalization may vary across do-

mains, and prior works have examined legal [8] and medi-

cal [9] texts. ASR errors and production resource constraints

pose additional challenges [10].

ITN often involves weighted ﬁnite state transducer (WFST)

grammars [11] or sequence-to-sequence models [12]. Punctu-

ation, capitalization, and disﬂuency removal are approached

as machine translation [13] or, more commonly, sequence

labeling problems. Sequence labeling tags each token in the

spoken-form text to signify the desired formatting. The trans-

lation approach is attractive as an end-to-end solution, but

sequence labeling enforces structure and enables customiza-

tion for domains that may only require partial formatting.

Recent works have used pre-trained transformers to jointly

predict punctuation together with capitalization [14] and dis-

ﬂuency [15]. Prosodic features have also proven helpful for

punctuation early on [16] and have since been used for punc-

tuation and disﬂuency detection [17, 18]. Despite these joint

approaches, no work thus far has completely uniﬁed tagging

for all four tasks.

We frame spoken-to-written text conversion as a two-

stage process. The ﬁrst stage jointly tags spoken-form text

for ITN, punctuation, capitalization, and disﬂuencies. The

second stage applies each tag sequence and outputs written-

form text, employing WFST grammars for ITN and simple

conversions for the remaining tasks. To our knowledge, we

are the ﬁrst to jointly train a model to tag for the four tasks.

We make the following key contributions:

• We introduce a novel two-stage approach to spoken-

to-written text conversion consisting of a single joint

tagging model followed by a tag application stage, as

described in section 2

• We deﬁne text processing pipelines for spoken- and

written-form public datasets to jointly predict token-

level ITN, punctuation, capitalization, and disﬂuency

tags, as described in sections 3 and 4

• We report joint model performance on par with or ex-

ceeding task-speciﬁc models for each of the four tasks

on a wide range of test sets, as described in section 5

arXiv:2210.15063v1 [cs.CL] 26 Oct 2022

2. PROPOSED METHOD

In this section, we describe our two-stage approach to for-

matting spoken-form ASR outputs. Figure 1 illustrates the

end-to-end workﬂow of our proposed method.

2.1. Joint labeling of ITN, punctuation, capitalization,

and disﬂuency

Stage 1 addresses ASR formatting as a multiple sequence la-

beling problem. We ﬁrst tokenize the spoken-form text and

then use a transformer encoder [7] to learn a shared represen-

tation of the input. Four task-speciﬁc classiﬁcation heads –

corresponding to ITN, punctuation, capitalization, and disﬂu-

ency – predict four token-level tag sequences from the shared

representation. Each classiﬁcation head consists of a dropout

layer followed by a fully connected layer.

We use the cross-entropy (CE) loss function and jointly

optimize all four tasks by minimizing an evenly weighted

combination of the losses as shown in

CEjoint =CEi+CEp+CEc+CEd

4(1)

where CEi,CEp,C Ec, and CEdare the cross-entropy loss

functions for ITN, punctuation, capitalization, and disﬂuency,

respectively. Our task-speciﬁc experiments, described in Sec-

tion 4, optimize just the loss for the single task at hand.

2.2. Tag application

Stage 2 uses the four tag sequences to format the spoken-form

ASR outputs as their written form. Since the tag sequences

are token-level, we convert them to word-level for tag appli-

cation.

To format ITN entities, we extract each span of ITN to-

kens that are consecutively tagged as the same ITN entity type

and span. Then, we apply WFST grammar for that entity type

to generate the written form.

ITN formatting may change the number of words in the

sequence, so we preserve alignments between the original

spoken-form tokens and the formatted ITN entities. When

multiple spoken-form tokens map to a single WFST output,

we only apply the last punctuation tag and the ﬁrst capitaliza-

tion tag. For punctuation, we append the indicated punctua-

tion tags to the corresponding words. For capitalization, we

capitalize the ﬁrst letter or entirety of words, as tagged.

To remove disﬂuencies, we simply remove the disﬂuency-

tagged words from the text sequence.

Although we compare task-speciﬁc and joint models for

our experiments, using four independent task-speciﬁc mod-

els in real scenarios may result in undesirable conﬂicts be-

tween features. For instance, predicted punctuation may not

line up with predicted beginning-of-sentence capitalization.

The joint model’s shared representations encourage predic-

tions for ITN, punctuation, capitalization, and disﬂuency to

Fig. 1. Flow chart for joint prediction of ITN, punctuation,

capitalization, and disﬂuency

be consistent, avoiding such conﬂicts later in the tag applica-

tion stage.

3. DATA PROCESSING PIPELINE

3.1. Datasets

We use public datasets from various domains as well as addi-

tional sets speciﬁcally targeting ITN and disﬂuency. Table 1

shows the word count distributions by percentage among the

sets.

OpenWebText [19]: This dataset consists of web content ex-

tracted from URLs shared on Reddit with at least three up-

votes.

Stack Exchange1: This dataset consists of user-contributed

content on the Stack Exchange network.

OpenSubtitles2016 [20]: This dataset consists of movie and

TV subtitles.

Multimodal Aligned Earnings Conference (MAEC) [21]:

This dataset consists of transcribed earnings calls based on

S&P 1500 companies.

1https://archive.org/details/stackexchange

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FOUR-IN-ONE:AJOINTAPPROACHTOINVERSETEXTNORMALIZATION,PUNCTUATION,CAPITALIZATION,ANDDISFLUENCYFORAUTOMATICSPEECHRECOGNITIONSharmanTan,PiyushBehre,NickKibre,IssacAlphonso,ShuangyuChangMicrosoftCorporationABSTRACTFeaturessuchaspunctuation,capitalization,andformat-tingofentitiesareimportantforreadabilit...

展开>> 收起<<

FOUR-IN-ONE A JOINT APPROACH TO INVERSE TEXT NORMALIZATION PUNCTUATION CAPITALIZATION AND DISFLUENCY FOR AUTOMATIC SPEECH RECOGNITION.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FOUR-IN-ONE A JOINT APPROACH TO INVERSE TEXT NORMALIZATION PUNCTUATION CAPITALIZATION AND DISFLUENCY FOR AUTOMATIC SPEECH RECOGNITION

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: