SMART SPEECH SEGMENTATION USING ACOUSTO-LINGUISTIC FEATURES WITH LOOK-AHEAD Piyush Behre Naveen Parihar Sharman Tan Amy Shah Eva Sharma

2025-05-03 0 0 239.86KB 6 页 10玖币
侵权投诉
SMART SPEECH SEGMENTATION USING ACOUSTO-LINGUISTIC FEATURES
WITH LOOK-AHEAD
Piyush Behre Naveen Parihar Sharman Tan Amy Shah Eva Sharma
Geoffrey Liu Shuangyu Chang Hosam Khalil Chris Basoglu Sayan Pathak
Microsoft Corporation
ABSTRACT
Segmentation for continuous Automatic Speech Recog-
nition (ASR) has traditionally used silence timeouts or voice
activity detectors (VADs), which are both limited to acoustic
features. This segmentation is often overly aggressive, given
that people naturally pause to think as they speak. Conse-
quently, segmentation happens mid-sentence, hindering both
punctuation and downstream tasks like machine translation
for which high-quality segmentation is critical. Model-based
segmentation methods that leverage acoustic features are
powerful, but without an understanding of the language itself,
these approaches are limited. We present a hybrid approach
that leverages both acoustic and language information to
improve segmentation. Furthermore, we show that including
one word as a look-ahead boosts segmentation quality. On av-
erage, our models improve segmentation-F0.5score by 9.8%
over baseline. We show that this approach works for multiple
languages. For the downstream task of machine translation,
it improves the translation BLEU score by an average of 1.05
points.
Index TermsSpeech recognition, audio segmentation,
decoder segmentation, continuous recognition
1. INTRODUCTION
As Automatic Speech Recognition (ASR) quality has im-
proved, the focus has gradually shifted from short utter-
ance scenarios such as Voice Search and Voice Assistants
to long utterance scenarios such as Voice Typing and Meet-
ing Transcription. In the short utterance scenarios, speech
end-pointing is important for user perceived latency and user
experience. Voice Search and Voice Assistants are scenarios
where the primary goal is task completion and elements of
written form language such as punctuation are not as critical.
The output of ASR is rarely revisited after task completion.
For long-form scenarios, the primary goal is to generate
highly readable well formatted transcription. Voice Typing
aims to replace typing with keyboard for important tasks such
as typing e-mails or documents, which are more “permanent”
than search queries. Punctuation and capitalization become
as important as recognition errors.
Recent research has demonstrated that ASR models suffer
from several problems in the context of long-form utterances,
such as lack of generalization from short to long utterances
[1] and high WER and deletion errors [2, 3, 4]. The com-
mon practice in the context of long-form ASR is to segment
the input stream. Segmentation quality is critical for optimal
WER and punctuation, which is in turn critical for readability
[5]. Furthermore, segmentation directly impacts downstream
tasks such as machine translation. Prior works have demon-
strated that improvements in segmentation and punctuation
lead to significant BLEU gains in machine translation [6, 7].
Conventionally, simple silence timeouts or voice activ-
ity detectors (VADs) have been used to determine segment
boundaries [8, 9]. Over the years, researchers have taken
more complex and model-based approaches to predicting
end-of-segment boundaries [10, 11, 12]. However, a clear
drawback of VAD and many such model-based approaches
is that they leverage only acoustic information, foregoing po-
tential gains from incorporating semantic information from
text [13]. Many works have addressed this issue in end-of-
query prediction, combining the prediction task with ASR
via end-to-end models [14, 15, 16, 17, 18]. Similarly, [19]
leveraged both acoustic and textual features via an end-to-end
segmenter for long-form ASR.
Our main contributions are as follows:
We demonstrate that linguistic features improve de-
coder segmentation decisions
We use look-ahead to further improve segmentation de-
cisions by leveraging more surrounding context
We extend our approach to other languages and estab-
lish BLEU score gains on the downstream task of ma-
chine translation
2. METHODS
2.1. Models
We describe three end-pointing techniques, each progres-
sively improving upon the previous. A key contribution
arXiv:2210.14446v2 [cs.CL] 27 Oct 2022
Fig. 1: Flow chart illustrating hybrid segmentation setup incorporating decisions from VAD-EOS and LM-EOS models. x
represents LFB-80 features, while wrepresents word embeddings.
of this paper is introducing an RNN-LM in the segmenta-
tion decision-making process, which becomes even more
powerful when using a look-ahead. Once segments are pro-
duced, they continue through a punctuation stage, where a
transformer-based punctuation model punctuates each seg-
ment. This punctuation model is fixed for all following
setups.
2.1.1. Acoustic/prosodic-only signals (v1)
In this baseline system, the segmentation decisions are based
on a pre-defined threshold for silence. Typically, the default
threshold used in such systems is 500ms. This threshold may
vary by locale, given that speech rate as well as the frequency
and duration of pauses may vary from language to language.
This threshold may also vary by scenario. For instance, peo-
ple tend to speak faster in conversations compared to dic-
tation, so the optimal silence-based timeout threshold may
be higher for dictation compared to conversational scenarios
like meeting transcription. In addition to silence threshold,
the system uses VAD models, which produces better speech
end-pointing compared to a simple silence-based timeout ap-
proach [8, 9].
2.1.2. Acousto-linguistic signals (v2)
In natural speech, people often pause disproportionately.
Thus, an aggressive v1 setup would result in overly aggres-
sive end-pointing. In the v2 setup, we introduce a language
model to offer a second opinion based on linguistic features.
We call this an LM-EOS (Language Model – End of Seg-
ment predictor) model, as shown in Fig 1. Since the v2
setup incorporates both acoustic and linguistic features in
decision-making, it avoids obvious error cases from v1.
2.1.3. Acousto-linguistic signals with look-ahead (v3)
In v2, LM-EOS only has access to left context when pre-
dicting end-of-segment boundaries. As prior work has estab-
lished, this setup is severely limiting for punctuation tasks,
where the right context is important for optimal punctuation
quality [7]. Therefore, in the v3 setup we incorporate the right
context in LM-EOS predictions.
2.2. Model Training
Our VAD follows prior works which have extensively covered
VAD implementation details [8]. Here, we describe LM-EOS
training in detail. First, let us establish the goal for these mod-
els.
2.2.1. LM-EOS model with no look-ahead
This model used in v2 is trained to predict whether the in-
put sequence is a valid end of a sentence or not, only look-
ing at the past. As illustrated in the examples below, only
looking at the left context to predict can be quite limiting.
To train this model, we use the Open Web Text corpus [20],
and splitting the data into rows with one sentence per row.
Each sentence must end in a period or a question mark. We
discard any sentences containing punctuation other than pe-
riods, commas, and question marks. We then normalize the
rows into the spoken form using a WFST (Weighted Finite
State Transducers)-based text normalization tool. The LM-
EOS model should predict end-of-segment (heosi) for every
one of the rows, as each row is a sentence. To balance this
set of sentences with countercases, we take each sentence and
delete the last word. For each of these modified sequences,
the model will be trained to predict non-EOS. Examples of
the resulting training sequences are illustrated in 1.
Id Input Output
A1 how is the weather in seattle O O O O O eos
A2 how is the weather in O O O O O
B1 i’m new in town O O O eos
B2 i’m new in O O O
C1 wake me up at noon tomorrow O O O O O eos
C2 wake me up at noon O O O O O
Table 1: V2 training data sample rows
摘要:

SMARTSPEECHSEGMENTATIONUSINGACOUSTO-LINGUISTICFEATURESWITHLOOK-AHEADPiyushBehreNaveenPariharSharmanTanAmyShahEvaSharmaGeoffreyLiuShuangyuChangHosamKhalilChrisBasogluSayanPathakMicrosoftCorporationABSTRACTSegmentationforcontinuousAutomaticSpeechRecog-nition(ASR)hastraditionallyusedsilencetimeoutsorvo...

展开>> 收起<<
SMART SPEECH SEGMENTATION USING ACOUSTO-LINGUISTIC FEATURES WITH LOOK-AHEAD Piyush Behre Naveen Parihar Sharman Tan Amy Shah Eva Sharma.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:239.86KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注