SMART SPEECH SEGMENTATION USING ACOUSTO-LINGUISTIC FEATURES
WITH LOOK-AHEAD
Piyush Behre Naveen Parihar Sharman Tan Amy Shah Eva Sharma
Geoffrey Liu Shuangyu Chang Hosam Khalil Chris Basoglu Sayan Pathak
Microsoft Corporation
ABSTRACT
Segmentation for continuous Automatic Speech Recog-
nition (ASR) has traditionally used silence timeouts or voice
activity detectors (VADs), which are both limited to acoustic
features. This segmentation is often overly aggressive, given
that people naturally pause to think as they speak. Conse-
quently, segmentation happens mid-sentence, hindering both
punctuation and downstream tasks like machine translation
for which high-quality segmentation is critical. Model-based
segmentation methods that leverage acoustic features are
powerful, but without an understanding of the language itself,
these approaches are limited. We present a hybrid approach
that leverages both acoustic and language information to
improve segmentation. Furthermore, we show that including
one word as a look-ahead boosts segmentation quality. On av-
erage, our models improve segmentation-F0.5score by 9.8%
over baseline. We show that this approach works for multiple
languages. For the downstream task of machine translation,
it improves the translation BLEU score by an average of 1.05
points.
Index Terms—Speech recognition, audio segmentation,
decoder segmentation, continuous recognition
1. INTRODUCTION
As Automatic Speech Recognition (ASR) quality has im-
proved, the focus has gradually shifted from short utter-
ance scenarios such as Voice Search and Voice Assistants
to long utterance scenarios such as Voice Typing and Meet-
ing Transcription. In the short utterance scenarios, speech
end-pointing is important for user perceived latency and user
experience. Voice Search and Voice Assistants are scenarios
where the primary goal is task completion and elements of
written form language such as punctuation are not as critical.
The output of ASR is rarely revisited after task completion.
For long-form scenarios, the primary goal is to generate
highly readable well formatted transcription. Voice Typing
aims to replace typing with keyboard for important tasks such
as typing e-mails or documents, which are more “permanent”
than search queries. Punctuation and capitalization become
as important as recognition errors.
Recent research has demonstrated that ASR models suffer
from several problems in the context of long-form utterances,
such as lack of generalization from short to long utterances
[1] and high WER and deletion errors [2, 3, 4]. The com-
mon practice in the context of long-form ASR is to segment
the input stream. Segmentation quality is critical for optimal
WER and punctuation, which is in turn critical for readability
[5]. Furthermore, segmentation directly impacts downstream
tasks such as machine translation. Prior works have demon-
strated that improvements in segmentation and punctuation
lead to significant BLEU gains in machine translation [6, 7].
Conventionally, simple silence timeouts or voice activ-
ity detectors (VADs) have been used to determine segment
boundaries [8, 9]. Over the years, researchers have taken
more complex and model-based approaches to predicting
end-of-segment boundaries [10, 11, 12]. However, a clear
drawback of VAD and many such model-based approaches
is that they leverage only acoustic information, foregoing po-
tential gains from incorporating semantic information from
text [13]. Many works have addressed this issue in end-of-
query prediction, combining the prediction task with ASR
via end-to-end models [14, 15, 16, 17, 18]. Similarly, [19]
leveraged both acoustic and textual features via an end-to-end
segmenter for long-form ASR.
Our main contributions are as follows:
• We demonstrate that linguistic features improve de-
coder segmentation decisions
• We use look-ahead to further improve segmentation de-
cisions by leveraging more surrounding context
• We extend our approach to other languages and estab-
lish BLEU score gains on the downstream task of ma-
chine translation
2. METHODS
2.1. Models
We describe three end-pointing techniques, each progres-
sively improving upon the previous. A key contribution
arXiv:2210.14446v2 [cs.CL] 27 Oct 2022