A Novel Frame Structure for Cloud-Based
Audio-Visual Speech Enhancement in Multimodal
Hearing-aids
Abhijeet Bishnu∗, Ankit Gupta†, Mandar Gogate‡, Kia Dashtipour‡, Ahsan Adeel§, Amir Hussain‡,
Mathini Sellathurai†, and Tharmalingam Ratnarajah∗
∗School of Engineering,University of Edinburgh, Edinburgh, United Kingdom
Email: {abishnu,t.ratnarajah}@ed.ac.uk
†School of Engineering & Physical Sciences,Heriot-Watt Watt University, Edinburgh, United Kingdom
Email: {ankit.gupta,m.sellathurai}@hwu.ac.uk
‡School of Computing,Edinburgh Napier University, Edinburgh, United Kingdom
Email: {m.gogate, k.dashtipour, a.hussain}@napier.ac.uk
§School of Mathematics & Computer Science,University of Wolverhampton, Wolverhampton, United Kingdom
Email: a.adeel@wlv.ac.uk
Abstract—In this paper, we design a first of its kind transceiver
(PHY layer) prototype for cloud-based audio-visual (AV) speech
enhancement (SE) complying with high data rate and low latency
requirements of future multimodal hearing assistive technol-
ogy. The innovative design needs to meet multiple challenging
constraints including up/down link communications, delay of
transmission and signal processing, and real-time AV SE models
processing. The transceiver includes device detection, frame
detection, frequency offset estimation, and channel estimation
capabilities. We develop both uplink (hearing aid to the cloud)
and downlink (cloud to hearing aid) frame structures based
on the data rate and latency requirements. Due to the varying
nature of uplink information (audio and lip-reading), the uplink
channel supports multiple data rate frame structure, while
the downlink channel has a fixed data rate frame structure.
In addition, we evaluate the latency of different PHY layer
blocks of the transceiver for developed frame structures using
LabVIEW NXG. This can be used with software defined radio
(such as Universal Software Radio Peripheral) for real-time
demonstration scenarios.
Index Terms—Audio-Visual Speech Enhancement, Hearing
Technology, Downlink, Frame structure, Physical layer, Uplink.
I. INTRODUCTION
Hearing impairment is one of the major public health
issues affecting more than 20% of the global population.
Hearing aids are the most widely used devices to improve
the intelligibility of speech in noise for the hearing impaired
listeners. However, even the sophisticated hearing aids that use
state-of-the-art multi-channel audio-only speech enhancement
(SE) algorithms pose significant problems for the people
with hearing loss as these listening devices often amplify
sounds but do not restore speech intelligibility in busy social
situations [1]. In order to address the aforementioned issue,
researchers have proposed audio-visual (AV) SE algorithms [2]
[3] that exploit the multi-modal nature of speech for more
robust speech enhancement. In addition, various machine
learning (ML) based SE have been proposed due to their
ability to surpass conventional SE algorithms. These methods
use the ML algorithm to reconstruct the clean audio signal
from noisy audio signal. Some ML based algorithms include
sparse coding [4], robust principal component analysis [5],
visually derived wiener filter [6] [7], non-negative matrix
factorisation [8] and deep neural networks [9] [10]. However,
despite significant research in the area of AV SE, deployment
of real-time processing models on a small electronic device
such as hearing aid remains a formidable technical challenge.
Therefore, we propose a cloud-based framework that exploits
the power of 5G and cloud processing infrastructure for real-
time audio-visual speech enhancement.
We need a robust transceiver to send the raw or pre-
processed data which includes AV signal from hearing aid
device to cloud and get back clean signal from the cloud. The
transceiver should meet the latency and data-rate requirements
for synchronization of lip movement (visual) and the voice. In
order to meet the stringent requirements of hearing aids, we
developed both uplink (hearing aid to cloud) and downlink
(cloud to hearing aid) transceiver, and in this paper our focus
on physical layer of the proposed transceiver. Conventionally,
downlink channel has high data rate as compared to uplink
signal [11]. However in hearing aids scenario, the uplink chan-
nel has high data rate due to transmission of AV signal, and
the downlink has low data rate due to the transmission of only
clean audio signal. Thus, uplink channel supports varying data
rate as low data rate for audio signal only and high data rate for
AV signal. The transmission and reception of data is based on
frame which contains synchronization signal, reference signal,
control channel and shared channel. Synchronization signal
is used for timing and frequency synchronization, reference
signal is used for wireless channel (between hearing aid and
cloud) estimation, control channel is used for transmission of
downlink or uplink control information, and the shared channel
is used for downlink or uplink payload or data.
arXiv:2210.13127v1 [eess.AS] 24 Oct 2022