
REAL-TIME SPEECH INTERRUPTION ANALYSIS: FROM CLOUD TO CLIENT
DEPLOYMENT
Quchen Fu†, Szu-Wei Fu?, Yaran Fan?, Yu Wu?, Zhuo Chen?, Jayant Gupchup?1, Ross Cutler?
†Dept. of Computer Science, Vanderbilt University
?Microsoft Corporation
ABSTRACT
Meetings are an essential form of communication for all types
of organizations, and remote collaboration systems have been
much more widely used since the COVID-19 pandemic. One
major issue with remote meetings is that it is challenging
for remote participants to interrupt and speak. We have re-
cently developed the first speech interruption analysis model
WavLM SI, which detects failed speech interruptions, shows
very promising performance, and is being deployed in the
cloud. To deliver this feature in a more cost-efficient and
environment-friendly way, we reduced the model complexity
and size to ship the WavLM SI model in client devices. In
this paper, we first describe how we successfully improved the
True Positive Rate (TPR) at a 1% False Positive Rate (FPR)
from 50.9% to 68.3% for the failed speech interruption detec-
tion model by training on a larger dataset and fine-tuning. We
then shrank the model size from 222.7 MB to 9.3 MB with
an acceptable loss in accuracy and reduced the complexity
from 31.2 GMACS (Giga Multiply-Accumulate Operations
per Second) to 4.3 GMACS. We also estimated the environ-
mental impact of the complexity reduction, which can be used
as a general guideline for large Transformer-based models, and
thus make those models more accessible with less computation
overhead.
Index Terms—
Semi-Supervised Learning, Model Size
Reduction, Speech Interruption Detection
1. INTRODUCTION
The inability of virtual meeting participants to interrupt and
speak has been identified as the largest impediment to more
inclusive online meetings [1]. Remote participants may often
attempt to join a discussion but can not get the floor as the
other speakers are talking. We refer to such attempts as failed
interruptions.
Too many failed interruptions may alienate participants
and result in a less effective and inclusive meeting environ-
ment, and further impact the overall working environment and
employee retention at organizations [2]. Our previous work [2]
1
Work performed while at Microsoft, now affiliated with Uber technolo-
gies.
has explored the feasibility of mitigating this issue by creating
a failed interruption detection model that prompts the failed
interrupter to raise their virtual hands and gain attention, a
feature that is helpful to improve meeting inclusiveness but
rarely used.
The WavLM-based Speech Interruption Detection model
(WavLM SI) [2] was deployed in the Azure cloud and has
proven to be a useful feature. Client integration allows us
to reach a much wider range of customers without the cost
of cloud deployment, as well as lower environmental impact
since it does not have the overhead of a dedicated service.
However, the original model is based on a pre-trained speech
model, which is large and computationally expensive, and
therefore can not be run on client devices. To be deployed in
the client, the model needs to be small in size to meet memory
constraints, computationally lightweight to incorporate less
capable processors, and energy efficient to save battery life. A
demo video of WaveML SI is available here.
This paper provides three contributions to the study of
speech interruption analysis. First, we increased the state-of-
the-art performance of the speech interruption detection task
from 50.9% to 68.3% TPR at a fixed 1% FPR. Second, we con-
ducted comprehensive work on model structure engineering,
including the bottleneck analysis for client integration, and cre-
ated a customized model under strict computation and memory
constraints. Third, we showed how pruning and quantizing can
be used to reduce the model size by
23×
and the complexity
by
9×
with acceptable performance, and this size/complexity
and accuracy trade-off is smooth and can accommodate a large
variety of client devices.
The remainder of this paper is organized as follows: Sec-
tion 2summarizes prior work on deep learning model size
reduction and energy measurement. Section 3introduces our
base model structure and discusses structural exploration and
quantization. Section 4describes our experiment setup and
analyses memory usage, energy consumption, and potential
environmental implications. Section 5provides concluding
remarks and future work.
2. RELATED WORK
Improving meeting inclusiveness has significant financial in-
centives such as more effective meetings and higher employee
arXiv:2210.13334v1 [cs.CL] 24 Oct 2022