SMART SPEECH SEGMENTATION USING ACOUSTO-LINGUISTIC FEATURES WITH LOOK-AHEAD Piyush Behre Naveen Parihar Sharman Tan Amy Shah Eva Sharma

2025-05-03 0 0 239.86KB 6 页 10玖币

侵权投诉

SMART SPEECH SEGMENTATION USING ACOUSTO-LINGUISTIC FEATURES

WITH LOOK-AHEAD

Piyush Behre Naveen Parihar Sharman Tan Amy Shah Eva Sharma

Geoffrey Liu Shuangyu Chang Hosam Khalil Chris Basoglu Sayan Pathak

Microsoft Corporation

ABSTRACT

Segmentation for continuous Automatic Speech Recog-

nition (ASR) has traditionally used silence timeouts or voice

activity detectors (VADs), which are both limited to acoustic

features. This segmentation is often overly aggressive, given

that people naturally pause to think as they speak. Conse-

quently, segmentation happens mid-sentence, hindering both

punctuation and downstream tasks like machine translation

for which high-quality segmentation is critical. Model-based

segmentation methods that leverage acoustic features are

powerful, but without an understanding of the language itself,

these approaches are limited. We present a hybrid approach

that leverages both acoustic and language information to

improve segmentation. Furthermore, we show that including

one word as a look-ahead boosts segmentation quality. On av-

erage, our models improve segmentation-F0.5score by 9.8%

over baseline. We show that this approach works for multiple

languages. For the downstream task of machine translation,

it improves the translation BLEU score by an average of 1.05

points.

Index Terms—Speech recognition, audio segmentation,

decoder segmentation, continuous recognition

1. INTRODUCTION

As Automatic Speech Recognition (ASR) quality has im-

proved, the focus has gradually shifted from short utter-

ance scenarios such as Voice Search and Voice Assistants

to long utterance scenarios such as Voice Typing and Meet-

ing Transcription. In the short utterance scenarios, speech

end-pointing is important for user perceived latency and user

experience. Voice Search and Voice Assistants are scenarios

where the primary goal is task completion and elements of

written form language such as punctuation are not as critical.

The output of ASR is rarely revisited after task completion.

For long-form scenarios, the primary goal is to generate

highly readable well formatted transcription. Voice Typing

aims to replace typing with keyboard for important tasks such

as typing e-mails or documents, which are more “permanent”

than search queries. Punctuation and capitalization become

as important as recognition errors.

Recent research has demonstrated that ASR models suffer

from several problems in the context of long-form utterances,

such as lack of generalization from short to long utterances

[1] and high WER and deletion errors [2, 3, 4]. The com-

mon practice in the context of long-form ASR is to segment

the input stream. Segmentation quality is critical for optimal

WER and punctuation, which is in turn critical for readability

[5]. Furthermore, segmentation directly impacts downstream

tasks such as machine translation. Prior works have demon-

strated that improvements in segmentation and punctuation

lead to signiﬁcant BLEU gains in machine translation [6, 7].

Conventionally, simple silence timeouts or voice activ-

ity detectors (VADs) have been used to determine segment

boundaries [8, 9]. Over the years, researchers have taken

more complex and model-based approaches to predicting

end-of-segment boundaries [10, 11, 12]. However, a clear

drawback of VAD and many such model-based approaches

is that they leverage only acoustic information, foregoing po-

tential gains from incorporating semantic information from

text [13]. Many works have addressed this issue in end-of-

query prediction, combining the prediction task with ASR

via end-to-end models [14, 15, 16, 17, 18]. Similarly, [19]

leveraged both acoustic and textual features via an end-to-end

segmenter for long-form ASR.

Our main contributions are as follows:

• We demonstrate that linguistic features improve de-

coder segmentation decisions

• We use look-ahead to further improve segmentation de-

cisions by leveraging more surrounding context

• We extend our approach to other languages and estab-

lish BLEU score gains on the downstream task of ma-

chine translation

2. METHODS

2.1. Models

We describe three end-pointing techniques, each progres-

sively improving upon the previous. A key contribution

arXiv:2210.14446v2 [cs.CL] 27 Oct 2022

Fig. 1: Flow chart illustrating hybrid segmentation setup incorporating decisions from VAD-EOS and LM-EOS models. x

represents LFB-80 features, while wrepresents word embeddings.

of this paper is introducing an RNN-LM in the segmenta-

tion decision-making process, which becomes even more

powerful when using a look-ahead. Once segments are pro-

duced, they continue through a punctuation stage, where a

transformer-based punctuation model punctuates each seg-

ment. This punctuation model is ﬁxed for all following

setups.

2.1.1. Acoustic/prosodic-only signals (v1)

In this baseline system, the segmentation decisions are based

on a pre-deﬁned threshold for silence. Typically, the default

threshold used in such systems is 500ms. This threshold may

vary by locale, given that speech rate as well as the frequency

and duration of pauses may vary from language to language.

This threshold may also vary by scenario. For instance, peo-

ple tend to speak faster in conversations compared to dic-

tation, so the optimal silence-based timeout threshold may

be higher for dictation compared to conversational scenarios

like meeting transcription. In addition to silence threshold,

the system uses VAD models, which produces better speech

end-pointing compared to a simple silence-based timeout ap-

proach [8, 9].

2.1.2. Acousto-linguistic signals (v2)

In natural speech, people often pause disproportionately.

Thus, an aggressive v1 setup would result in overly aggres-

sive end-pointing. In the v2 setup, we introduce a language

model to offer a second opinion based on linguistic features.

We call this an LM-EOS (Language Model – End of Seg-

ment predictor) model, as shown in Fig 1. Since the v2

setup incorporates both acoustic and linguistic features in

decision-making, it avoids obvious error cases from v1.

2.1.3. Acousto-linguistic signals with look-ahead (v3)

In v2, LM-EOS only has access to left context when pre-

dicting end-of-segment boundaries. As prior work has estab-

lished, this setup is severely limiting for punctuation tasks,

where the right context is important for optimal punctuation

quality [7]. Therefore, in the v3 setup we incorporate the right

context in LM-EOS predictions.

2.2. Model Training

Our VAD follows prior works which have extensively covered

VAD implementation details [8]. Here, we describe LM-EOS

training in detail. First, let us establish the goal for these mod-

els.

2.2.1. LM-EOS model with no look-ahead

This model used in v2 is trained to predict whether the in-

put sequence is a valid end of a sentence or not, only look-

ing at the past. As illustrated in the examples below, only

looking at the left context to predict can be quite limiting.

To train this model, we use the Open Web Text corpus [20],

and splitting the data into rows with one sentence per row.

Each sentence must end in a period or a question mark. We

discard any sentences containing punctuation other than pe-

riods, commas, and question marks. We then normalize the

rows into the spoken form using a WFST (Weighted Finite

State Transducers)-based text normalization tool. The LM-

EOS model should predict end-of-segment (heosi) for every

one of the rows, as each row is a sentence. To balance this

set of sentences with countercases, we take each sentence and

delete the last word. For each of these modiﬁed sequences,

the model will be trained to predict non-EOS. Examples of

the resulting training sequences are illustrated in 1.

Id Input Output

A1 how is the weather in seattle O O O O O eos

A2 how is the weather in O O O O O

B1 i’m new in town O O O eos

B2 i’m new in O O O

C1 wake me up at noon tomorrow O O O O O eos

C2 wake me up at noon O O O O O

Table 1: V2 training data sample rows

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SMARTSPEECHSEGMENTATIONUSINGACOUSTO-LINGUISTICFEATURESWITHLOOK-AHEADPiyushBehreNaveenPariharSharmanTanAmyShahEvaSharmaGeoffreyLiuShuangyuChangHosamKhalilChrisBasogluSayanPathakMicrosoftCorporationABSTRACTSegmentationforcontinuousAutomaticSpeechRecog-nition(ASR)hastraditionallyusedsilencetimeoutsorvo...

展开>> 收起<<

SMART SPEECH SEGMENTATION USING ACOUSTO-LINGUISTIC FEATURES WITH LOOK-AHEAD Piyush Behre Naveen Parihar Sharman Tan Amy Shah Eva Sharma.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SMART SPEECH SEGMENTATION USING ACOUSTO-LINGUISTIC FEATURES WITH LOOK-AHEAD Piyush Behre Naveen Parihar Sharman Tan Amy Shah Eva Sharma

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: