REAL-TIME SPEECH INTERRUPTION ANALYSIS FROM CLOUD TO CLIENT DEPLOYMENT Quchen Fuy Szu-Wei Fu Yaran Fan Yu Wu Zhuo Chen Jayant Gupchup1 Ross Cutler

2025-04-29 0 0 4.15MB 5 页 10玖币

侵权投诉

REAL-TIME SPEECH INTERRUPTION ANALYSIS: FROM CLOUD TO CLIENT

DEPLOYMENT

Quchen Fu†, Szu-Wei Fu?, Yaran Fan?, Yu Wu?, Zhuo Chen?, Jayant Gupchup?1, Ross Cutler?

†Dept. of Computer Science, Vanderbilt University

?Microsoft Corporation

ABSTRACT

Meetings are an essential form of communication for all types

of organizations, and remote collaboration systems have been

much more widely used since the COVID-19 pandemic. One

major issue with remote meetings is that it is challenging

for remote participants to interrupt and speak. We have re-

cently developed the ﬁrst speech interruption analysis model

WavLM SI, which detects failed speech interruptions, shows

very promising performance, and is being deployed in the

cloud. To deliver this feature in a more cost-efﬁcient and

environment-friendly way, we reduced the model complexity

and size to ship the WavLM SI model in client devices. In

this paper, we ﬁrst describe how we successfully improved the

True Positive Rate (TPR) at a 1% False Positive Rate (FPR)

from 50.9% to 68.3% for the failed speech interruption detec-

tion model by training on a larger dataset and ﬁne-tuning. We

then shrank the model size from 222.7 MB to 9.3 MB with

an acceptable loss in accuracy and reduced the complexity

from 31.2 GMACS (Giga Multiply-Accumulate Operations

per Second) to 4.3 GMACS. We also estimated the environ-

mental impact of the complexity reduction, which can be used

as a general guideline for large Transformer-based models, and

thus make those models more accessible with less computation

overhead.

Index Terms—

Semi-Supervised Learning, Model Size

Reduction, Speech Interruption Detection

1. INTRODUCTION

The inability of virtual meeting participants to interrupt and

speak has been identiﬁed as the largest impediment to more

inclusive online meetings [1]. Remote participants may often

attempt to join a discussion but can not get the ﬂoor as the

other speakers are talking. We refer to such attempts as failed

interruptions.

Too many failed interruptions may alienate participants

and result in a less effective and inclusive meeting environ-

ment, and further impact the overall working environment and

employee retention at organizations [2]. Our previous work [2]

Work performed while at Microsoft, now afﬁliated with Uber technolo-

gies.

has explored the feasibility of mitigating this issue by creating

a failed interruption detection model that prompts the failed

interrupter to raise their virtual hands and gain attention, a

feature that is helpful to improve meeting inclusiveness but

rarely used.

The WavLM-based Speech Interruption Detection model

(WavLM SI) [2] was deployed in the Azure cloud and has

proven to be a useful feature. Client integration allows us

to reach a much wider range of customers without the cost

of cloud deployment, as well as lower environmental impact

since it does not have the overhead of a dedicated service.

However, the original model is based on a pre-trained speech

model, which is large and computationally expensive, and

therefore can not be run on client devices. To be deployed in

the client, the model needs to be small in size to meet memory

constraints, computationally lightweight to incorporate less

capable processors, and energy efﬁcient to save battery life. A

demo video of WaveML SI is available here.

This paper provides three contributions to the study of

speech interruption analysis. First, we increased the state-of-

the-art performance of the speech interruption detection task

from 50.9% to 68.3% TPR at a ﬁxed 1% FPR. Second, we con-

ducted comprehensive work on model structure engineering,

including the bottleneck analysis for client integration, and cre-

ated a customized model under strict computation and memory

constraints. Third, we showed how pruning and quantizing can

be used to reduce the model size by

23×

and the complexity

9×

with acceptable performance, and this size/complexity

and accuracy trade-off is smooth and can accommodate a large

variety of client devices.

The remainder of this paper is organized as follows: Sec-

tion 2summarizes prior work on deep learning model size

reduction and energy measurement. Section 3introduces our

base model structure and discusses structural exploration and

quantization. Section 4describes our experiment setup and

analyses memory usage, energy consumption, and potential

environmental implications. Section 5provides concluding

remarks and future work.

2. RELATED WORK

Improving meeting inclusiveness has signiﬁcant ﬁnancial in-

centives such as more effective meetings and higher employee

arXiv:2210.13334v1 [cs.CL] 24 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

REAL-TIMESPEECHINTERRUPTIONANALYSIS:FROMCLOUDTOCLIENTDEPLOYMENTQuchenFuy,Szu-WeiFu?,YaranFan?,YuWu?,ZhuoChen?,JayantGupchup?1,RossCutler?yDept.ofComputerScience,VanderbiltUniversity?MicrosoftCorporationABSTRACTMeetingsareanessentialformofcommunicationforalltypesoforganizations,andremotecollaboration...

展开>> 收起<<

REAL-TIME SPEECH INTERRUPTION ANALYSIS FROM CLOUD TO CLIENT DEPLOYMENT Quchen Fuy Szu-Wei Fu Yaran Fan Yu Wu Zhuo Chen Jayant Gupchup1 Ross Cutler.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

REAL-TIME SPEECH INTERRUPTION ANALYSIS FROM CLOUD TO CLIENT DEPLOYMENT Quchen Fuy Szu-Wei Fu Yaran Fan Yu Wu Zhuo Chen Jayant Gupchup1 Ross Cutler

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: