Vision Transformer for Adaptive Image
Transmission over MIMO Channels
Haotian Wu, Yulin Shao, Chenghong Bian, Krystian Mikolajczyk, and Deniz G¨
und¨
uz
Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2BT, UK
Email:{haotian.wu17, y.shao, c.bian22, k.mikolajczyk, d.gunduz}@imperial.ac.uk
Abstract—This paper presents a vision transformer (ViT)
based joint source and channel coding (JSCC) scheme for wireless
image transmission over multiple-input multiple-output (MIMO)
systems, called ViT-MIMO. The proposed ViT-MIMO archi-
tecture, in addition to outperforming separation-based bench-
marks, can flexibly adapt to different channel conditions without
requiring retraining. Specifically, exploiting the self-attention
mechanism of the ViT enables the proposed ViT-MIMO model
to adaptively learn the feature mapping and power allocation
based on the source image and channel conditions. Numerical
experiments show that ViT-MIMO can significantly improve the
transmission quality across a large variety of scenarios, including
varying channel conditions, making it an attractive solution for
emerging semantic communication systems.
Index Terms—Joint source channel coding, vision transformer,
MIMO, image transmission, semantic communications.
I. INTRODUCTION
The design of efficient image communication systems over
wireless channels has recently attracted a lot of interest due
to the increasing number of Internet-of-things (IoT) and edge
intelligence applications [1], [2]. The traditional solution from
Shannon’s separation theorem is to design source and channel
coding independently; however, the separation-based approach
is known to be sub-optimal in practice, which becomes partic-
ularly limiting in applications that impose strict latency con-
straints [3]. Despite the known theoretical benefits, designing
practical joint source channel coding (JSCC) schemes has been
an ongoing challenge for many decades. Significant progress
has been made in this direction over the last years thanks to the
introduction of deep neural networks (DNNs) for the design
of JSCC schemes [4]–[8]. The first deep learning based JSCC
(DeepJSCC) scheme for wireless image transmission is pre-
sented in [4], and it is shwon to outperform the concatenation
of state-of-the-art image compression algorithm better portable
graphics (BPG) with LDPC codes. It was later extended to
transmission with adaptive channel bandwidth in [9] and to
the transmission over multipath fading channels in [10] and
[11].
To the best of our knowledge, all the existing papers on
DeepJSCC consider single-antenna transmitters and receivers.
While there is a growing literature successfully employing
DNNs for various multiple-input multiple-output (MIMO)
related tasks, such as detection, channel estimation, or channel
state feedback [12]–[16], no previous work has so far applied
DeepJSCC to the more challenging MIMO channels. MIMO
systems are known to boost the throughput and spectral
efficiency of wireless communications, showing significant im-
provements in the capacity and reliability. JSCC over MIMO
channels is studied in [17] from a theoretical perspective. It
is challenging to design a practical JSCC scheme for MIMO
channels, where the model needs to retrieve coupled signals
from different antennas experiencing different channel gains.
A limited number of papers focus on DNN-based end-to-end
MIMO communication schemes. The first autoencoder-based
end-to-end MIMO communication method is introduced in
[18]. In [19], the authors set the symbol error rate bench-
marks for MIMO channels by evaluating several AE-based
models with channel state information (CSI). A singular-value
decomposition (SVD) based autoencoder is proposed in [20]
to achieve the state-of-the-art bit error rate. However, these
MIMO schemes only consider the transmission of bits at a
fixed signal-to-noise ratio (SNR) value, ignoring the source
signal’s semantic context and channel adaptability.
In this paper, we design an end-to-end unified DeepJSCC
scheme for MIMO image transmission. In particular, we
introduce a vision transformer (ViT) based DeepJSCC scheme
for MIMO systems with CSI, called ViT-MIMO. Inspired
by the success of the attention mechanism in the design of
flexible communication schemes [11], [21]–[23], we leverage
the self-attention mechanism of the ViT in wireless image
transmission. Specifically, we represent the channel conditions
with a channel heatmap, and adapt the JSCC encoding and
decoding parameters according to this heatmap. Our method
can learn global attention between the source image and
the channel conditions in all the intermediate layers of the
DeepJSCC encoder and decoder. Intuitively, we expect this
design to simultaneously learn feature mapping and power
allocation based on the source semantics and the channel
conditions. Our main contributions can be listed as follows:
•To the best of the authors’ knowledge, our ViT-MIMO
is the first DeepJSCC-enabled MIMO communication
system for image transmission, where a ViT is designed
to explore the contextual semantic features of the image
as well as the CSI in a self-attention fashion.
•Numerical results show that our ViT-MIMO model sig-
nificantly improves the transmission quality over a large
range of channel conditions and bandwidth ratios, com-
pared with the traditional separate source and channel
coding schemes adopting BPG image compression algo-
arXiv:2210.15347v1 [cs.IT] 27 Oct 2022