A Novel Frame Structure for Cloud-Based Audio-Visual Speech Enhancement in Multimodal Hearing-aids

2025-04-30 0 0 1.43MB 6 页 10玖币

侵权投诉

A Novel Frame Structure for Cloud-Based

Audio-Visual Speech Enhancement in Multimodal

Hearing-aids

Abhijeet Bishnu∗, Ankit Gupta†, Mandar Gogate‡, Kia Dashtipour‡, Ahsan Adeel§, Amir Hussain‡,

Mathini Sellathurai†, and Tharmalingam Ratnarajah∗

∗School of Engineering,University of Edinburgh, Edinburgh, United Kingdom

Email: {abishnu,t.ratnarajah}@ed.ac.uk

†School of Engineering & Physical Sciences,Heriot-Watt Watt University, Edinburgh, United Kingdom

Email: {ankit.gupta,m.sellathurai}@hwu.ac.uk

‡School of Computing,Edinburgh Napier University, Edinburgh, United Kingdom

Email: {m.gogate, k.dashtipour, a.hussain}@napier.ac.uk

§School of Mathematics & Computer Science,University of Wolverhampton, Wolverhampton, United Kingdom

Email: a.adeel@wlv.ac.uk

Abstract—In this paper, we design a ﬁrst of its kind transceiver

(PHY layer) prototype for cloud-based audio-visual (AV) speech

enhancement (SE) complying with high data rate and low latency

requirements of future multimodal hearing assistive technol-

ogy. The innovative design needs to meet multiple challenging

constraints including up/down link communications, delay of

transmission and signal processing, and real-time AV SE models

processing. The transceiver includes device detection, frame

detection, frequency offset estimation, and channel estimation

capabilities. We develop both uplink (hearing aid to the cloud)

and downlink (cloud to hearing aid) frame structures based

on the data rate and latency requirements. Due to the varying

nature of uplink information (audio and lip-reading), the uplink

channel supports multiple data rate frame structure, while

the downlink channel has a ﬁxed data rate frame structure.

In addition, we evaluate the latency of different PHY layer

blocks of the transceiver for developed frame structures using

LabVIEW NXG. This can be used with software deﬁned radio

(such as Universal Software Radio Peripheral) for real-time

demonstration scenarios.

Index Terms—Audio-Visual Speech Enhancement, Hearing

Technology, Downlink, Frame structure, Physical layer, Uplink.

I. INTRODUCTION

Hearing impairment is one of the major public health

issues affecting more than 20% of the global population.

Hearing aids are the most widely used devices to improve

the intelligibility of speech in noise for the hearing impaired

listeners. However, even the sophisticated hearing aids that use

state-of-the-art multi-channel audio-only speech enhancement

(SE) algorithms pose signiﬁcant problems for the people

with hearing loss as these listening devices often amplify

sounds but do not restore speech intelligibility in busy social

situations [1]. In order to address the aforementioned issue,

researchers have proposed audio-visual (AV) SE algorithms [2]

[3] that exploit the multi-modal nature of speech for more

robust speech enhancement. In addition, various machine

learning (ML) based SE have been proposed due to their

ability to surpass conventional SE algorithms. These methods

use the ML algorithm to reconstruct the clean audio signal

from noisy audio signal. Some ML based algorithms include

sparse coding [4], robust principal component analysis [5],

visually derived wiener ﬁlter [6] [7], non-negative matrix

factorisation [8] and deep neural networks [9] [10]. However,

despite signiﬁcant research in the area of AV SE, deployment

of real-time processing models on a small electronic device

such as hearing aid remains a formidable technical challenge.

Therefore, we propose a cloud-based framework that exploits

the power of 5G and cloud processing infrastructure for real-

time audio-visual speech enhancement.

We need a robust transceiver to send the raw or pre-

processed data which includes AV signal from hearing aid

device to cloud and get back clean signal from the cloud. The

transceiver should meet the latency and data-rate requirements

for synchronization of lip movement (visual) and the voice. In

order to meet the stringent requirements of hearing aids, we

developed both uplink (hearing aid to cloud) and downlink

(cloud to hearing aid) transceiver, and in this paper our focus

on physical layer of the proposed transceiver. Conventionally,

downlink channel has high data rate as compared to uplink

signal [11]. However in hearing aids scenario, the uplink chan-

nel has high data rate due to transmission of AV signal, and

the downlink has low data rate due to the transmission of only

clean audio signal. Thus, uplink channel supports varying data

rate as low data rate for audio signal only and high data rate for

AV signal. The transmission and reception of data is based on

frame which contains synchronization signal, reference signal,

control channel and shared channel. Synchronization signal

is used for timing and frequency synchronization, reference

signal is used for wireless channel (between hearing aid and

cloud) estimation, control channel is used for transmission of

downlink or uplink control information, and the shared channel

is used for downlink or uplink payload or data.

arXiv:2210.13127v1 [eess.AS] 24 Oct 2022

Fig. 1. Model of cloud-based audio-visual speech enhancement hearing aid

Rest of the paper is organised as follows. Section II

describes the proposed model of cloud-based AV SE en-

hancement, source encoding and the proposed ML algorithm.

Section III describes the downlink and uplink frame structure

with proposed downlink and uplink control information. In

section IV, latency of various proposed blocks of uplink and

downlink frame structure is evaluated and ﬁnally conclude the

paper in Section V.

II. MODEL OF CLOUD BASED HEARING AID

Fig. 1 shows the proposed model of cloud-based AV speech

enhancement hearing-aid. In the proposed model, the hearing

aid device with small camera captures AV information and this

information is compressed by source encoder. The compressed

information is converted into a frame structure for transmission

over wireless channel to the cloud. In the receiver side (cloud),

the received AV signal is decoded from the frame structure

including source decoder. In order to clean the AV signal,

machine learning algorithm is performed on the decoded AV

signal and then the clean signal is transmitted from the cloud

to the hearing aid device.

The next two subsection discuss the source encoder and ML

algorithm for AV speech enhancement.

A. Designing Low-Latency Audio Codecs

The wireless transceiver design employs a source coding

technique for efﬁcient data compression-decompression. Thus,

efﬁcient source codes need to be designed for hearing-aid

applications that map audio signals to the small binary code-

words, also referred to as audio codecs. We can broadly

classify the audio codecs as waveform codecs and parametric

codecs. The waveform codecs aim to create codewords to

obtain faithful reconstruction on a sample-by-sample basis

by making no prior assumptions on the input audio signal.

Thereby for general audio, waveform codecs produce very

high-quality audio at medium-to-high bit rates but suffer from

coding artefacts at low bit rates. The parametric codecs aim to

create codewords to obtain perceptually similar reconstruction

as the original audio by making strong prior assumptions on

the input audio signal. Thus, parametric codecs perform well

under low bit rates. Currently, OPUS [12] and enhanced voice

services (EVS) [13] are the two most widely employed state-

of-the-art audio codecs. The OPUS codec [5] is standardized

by the Internet Engineering Task Force (IETF) by including

technologies from the Skype’s SILK codec and Xiph.Org’s

CELT codec. The OPUS codec provides efﬁcient compression

for the transmission of interactive speech/music via the Inter-

net, for example employed by YouTube. The EVS codec [8]

is standardized by the 3GPP by combining traditional coding

tools such as LPC, MDCT and CELP. The EVS codec is a

next-generation version of the AMR-WB mobile HD voice

codec that provides the most efﬁcient means for maintaining

quality and efﬁciency in the wireless communication systems.

Both the OPUS and EVS codecs provide high coding efﬁ-

ciency for varying audio signals, sampling, and bit rates, while

enabling low-latency communications for audio signals in the

real-time. We summarize the characteristics of the OPUS and

EVS codecs in Table I.

However, both OPUS and EVS codecs are not able to

perform well for low bit rates. Thus, data-driven deep learning

(DL) approaches have appeared as a promising solution for

providing efﬁciently compressing and enhancing the audio

signals. For example, a DL-based audio codec Sound-stream

is recently proposed by Google [14]. The Sound-stream audio

codec is an end-to-end learning autoencoder-based framework

that provides efﬁcient audio signal compression at low bit-

rates, while maintaining similar latency as EVS/OPUS. Al-

though the latency of the state-of-the-art audio codecs (OPUS,

EVS, Sound-stream) is low (around 30 ms), still the latency

remains very large for hearing-aid scenario that targets a la-

tency between 5-10 ms for end-to-end transmission, reception,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ANovelFrameStructureforCloud-BasedAudio-VisualSpeechEnhancementinMultimodalHearing-aidsAbhijeetBishnu,AnkitGuptay,MandarGogatez,KiaDashtipourz,AhsanAdeelx,AmirHussainz,MathiniSellathuraiy,andTharmalingamRatnarajahSchoolofEngineering,UniversityofEdinburgh,Edinburgh,UnitedKingdomEmail:fabishnu,t.ra...

展开>> 收起<<

A Novel Frame Structure for Cloud-Based Audio-Visual Speech Enhancement in Multimodal Hearing-aids.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Novel Frame Structure for Cloud-Based Audio-Visual Speech Enhancement in Multimodal Hearing-aids

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: