A Novel Frame Structure for Cloud-Based Audio-Visual Speech Enhancement in Multimodal Hearing-aids

2025-04-30 0 0 1.43MB 6 页 10玖币
侵权投诉
A Novel Frame Structure for Cloud-Based
Audio-Visual Speech Enhancement in Multimodal
Hearing-aids
Abhijeet Bishnu, Ankit Gupta, Mandar Gogate, Kia Dashtipour, Ahsan Adeel§, Amir Hussain,
Mathini Sellathurai, and Tharmalingam Ratnarajah
School of Engineering,University of Edinburgh, Edinburgh, United Kingdom
Email: {abishnu,t.ratnarajah}@ed.ac.uk
School of Engineering & Physical Sciences,Heriot-Watt Watt University, Edinburgh, United Kingdom
Email: {ankit.gupta,m.sellathurai}@hwu.ac.uk
School of Computing,Edinburgh Napier University, Edinburgh, United Kingdom
Email: {m.gogate, k.dashtipour, a.hussain}@napier.ac.uk
§School of Mathematics & Computer Science,University of Wolverhampton, Wolverhampton, United Kingdom
Email: a.adeel@wlv.ac.uk
Abstract—In this paper, we design a first of its kind transceiver
(PHY layer) prototype for cloud-based audio-visual (AV) speech
enhancement (SE) complying with high data rate and low latency
requirements of future multimodal hearing assistive technol-
ogy. The innovative design needs to meet multiple challenging
constraints including up/down link communications, delay of
transmission and signal processing, and real-time AV SE models
processing. The transceiver includes device detection, frame
detection, frequency offset estimation, and channel estimation
capabilities. We develop both uplink (hearing aid to the cloud)
and downlink (cloud to hearing aid) frame structures based
on the data rate and latency requirements. Due to the varying
nature of uplink information (audio and lip-reading), the uplink
channel supports multiple data rate frame structure, while
the downlink channel has a fixed data rate frame structure.
In addition, we evaluate the latency of different PHY layer
blocks of the transceiver for developed frame structures using
LabVIEW NXG. This can be used with software defined radio
(such as Universal Software Radio Peripheral) for real-time
demonstration scenarios.
Index Terms—Audio-Visual Speech Enhancement, Hearing
Technology, Downlink, Frame structure, Physical layer, Uplink.
I. INTRODUCTION
Hearing impairment is one of the major public health
issues affecting more than 20% of the global population.
Hearing aids are the most widely used devices to improve
the intelligibility of speech in noise for the hearing impaired
listeners. However, even the sophisticated hearing aids that use
state-of-the-art multi-channel audio-only speech enhancement
(SE) algorithms pose significant problems for the people
with hearing loss as these listening devices often amplify
sounds but do not restore speech intelligibility in busy social
situations [1]. In order to address the aforementioned issue,
researchers have proposed audio-visual (AV) SE algorithms [2]
[3] that exploit the multi-modal nature of speech for more
robust speech enhancement. In addition, various machine
learning (ML) based SE have been proposed due to their
ability to surpass conventional SE algorithms. These methods
use the ML algorithm to reconstruct the clean audio signal
from noisy audio signal. Some ML based algorithms include
sparse coding [4], robust principal component analysis [5],
visually derived wiener filter [6] [7], non-negative matrix
factorisation [8] and deep neural networks [9] [10]. However,
despite significant research in the area of AV SE, deployment
of real-time processing models on a small electronic device
such as hearing aid remains a formidable technical challenge.
Therefore, we propose a cloud-based framework that exploits
the power of 5G and cloud processing infrastructure for real-
time audio-visual speech enhancement.
We need a robust transceiver to send the raw or pre-
processed data which includes AV signal from hearing aid
device to cloud and get back clean signal from the cloud. The
transceiver should meet the latency and data-rate requirements
for synchronization of lip movement (visual) and the voice. In
order to meet the stringent requirements of hearing aids, we
developed both uplink (hearing aid to cloud) and downlink
(cloud to hearing aid) transceiver, and in this paper our focus
on physical layer of the proposed transceiver. Conventionally,
downlink channel has high data rate as compared to uplink
signal [11]. However in hearing aids scenario, the uplink chan-
nel has high data rate due to transmission of AV signal, and
the downlink has low data rate due to the transmission of only
clean audio signal. Thus, uplink channel supports varying data
rate as low data rate for audio signal only and high data rate for
AV signal. The transmission and reception of data is based on
frame which contains synchronization signal, reference signal,
control channel and shared channel. Synchronization signal
is used for timing and frequency synchronization, reference
signal is used for wireless channel (between hearing aid and
cloud) estimation, control channel is used for transmission of
downlink or uplink control information, and the shared channel
is used for downlink or uplink payload or data.
arXiv:2210.13127v1 [eess.AS] 24 Oct 2022
Fig. 1. Model of cloud-based audio-visual speech enhancement hearing aid
Rest of the paper is organised as follows. Section II
describes the proposed model of cloud-based AV SE en-
hancement, source encoding and the proposed ML algorithm.
Section III describes the downlink and uplink frame structure
with proposed downlink and uplink control information. In
section IV, latency of various proposed blocks of uplink and
downlink frame structure is evaluated and finally conclude the
paper in Section V.
II. MODEL OF CLOUD BASED HEARING AID
Fig. 1 shows the proposed model of cloud-based AV speech
enhancement hearing-aid. In the proposed model, the hearing
aid device with small camera captures AV information and this
information is compressed by source encoder. The compressed
information is converted into a frame structure for transmission
over wireless channel to the cloud. In the receiver side (cloud),
the received AV signal is decoded from the frame structure
including source decoder. In order to clean the AV signal,
machine learning algorithm is performed on the decoded AV
signal and then the clean signal is transmitted from the cloud
to the hearing aid device.
The next two subsection discuss the source encoder and ML
algorithm for AV speech enhancement.
A. Designing Low-Latency Audio Codecs
The wireless transceiver design employs a source coding
technique for efficient data compression-decompression. Thus,
efficient source codes need to be designed for hearing-aid
applications that map audio signals to the small binary code-
words, also referred to as audio codecs. We can broadly
classify the audio codecs as waveform codecs and parametric
codecs. The waveform codecs aim to create codewords to
obtain faithful reconstruction on a sample-by-sample basis
by making no prior assumptions on the input audio signal.
Thereby for general audio, waveform codecs produce very
high-quality audio at medium-to-high bit rates but suffer from
coding artefacts at low bit rates. The parametric codecs aim to
create codewords to obtain perceptually similar reconstruction
as the original audio by making strong prior assumptions on
the input audio signal. Thus, parametric codecs perform well
under low bit rates. Currently, OPUS [12] and enhanced voice
services (EVS) [13] are the two most widely employed state-
of-the-art audio codecs. The OPUS codec [5] is standardized
by the Internet Engineering Task Force (IETF) by including
technologies from the Skype’s SILK codec and Xiph.Org’s
CELT codec. The OPUS codec provides efficient compression
for the transmission of interactive speech/music via the Inter-
net, for example employed by YouTube. The EVS codec [8]
is standardized by the 3GPP by combining traditional coding
tools such as LPC, MDCT and CELP. The EVS codec is a
next-generation version of the AMR-WB mobile HD voice
codec that provides the most efficient means for maintaining
quality and efficiency in the wireless communication systems.
Both the OPUS and EVS codecs provide high coding effi-
ciency for varying audio signals, sampling, and bit rates, while
enabling low-latency communications for audio signals in the
real-time. We summarize the characteristics of the OPUS and
EVS codecs in Table I.
However, both OPUS and EVS codecs are not able to
perform well for low bit rates. Thus, data-driven deep learning
(DL) approaches have appeared as a promising solution for
providing efficiently compressing and enhancing the audio
signals. For example, a DL-based audio codec Sound-stream
is recently proposed by Google [14]. The Sound-stream audio
codec is an end-to-end learning autoencoder-based framework
that provides efficient audio signal compression at low bit-
rates, while maintaining similar latency as EVS/OPUS. Al-
though the latency of the state-of-the-art audio codecs (OPUS,
EVS, Sound-stream) is low (around 30 ms), still the latency
remains very large for hearing-aid scenario that targets a la-
tency between 5-10 ms for end-to-end transmission, reception,
摘要:

ANovelFrameStructureforCloud-BasedAudio-VisualSpeechEnhancementinMultimodalHearing-aidsAbhijeetBishnu,AnkitGuptay,MandarGogatez,KiaDashtipourz,AhsanAdeelx,AmirHussainz,MathiniSellathuraiy,andTharmalingamRatnarajahSchoolofEngineering,UniversityofEdinburgh,Edinburgh,UnitedKingdomEmail:fabishnu,t.ra...

展开>> 收起<<
A Novel Frame Structure for Cloud-Based Audio-Visual Speech Enhancement in Multimodal Hearing-aids.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:1.43MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注