data xis transformed by an analysis function ga(x;θ)to
generate a latent representation ˜yin continuous domain. Next,
˜yis quantized to latent representation yin discrete domain.
Then, the arithmetic encoder encodes yinto a bitstream using
the estimated distribution provided by the probability model.
The encoder operation can be represented by the function
y=Q(ga(x;θ)) .(1)
At the decoder side, an arithmetic decoder reconstructs yfrom
the bitstream with the help of the same probability model
that is used at the encoder side. Next, yis dequantized and
a synthesis function gs(y;φ)is used to generate ˆxas the
reconstructed input data. ga(·;θ)and gs(·;φ)are implemented
using deep neural networks with parameters θand φ, respec-
tively. The codec is trained by optimizing the RD loss function
defined by
L=R(y) + λ·D(x, ˆx)(2)
=Ex∼p(x)[−log p(y)] + λ·Ex∼p(x)[d(x, ˆx)] (3)
In Eq. 2, R(y)is the rate loss measuring the expected number
of bits to encode y,D(x, ˆx)is the expected reconstruction loss
measuring the quality of the reconstructed image, and λis the
Lagrange multiplier that adjusts the weighting of the two loss
terms to achieve difference compression rate. In Eq. 3, p(y),
also known as prior distribution, is the probability distribution
of latent representation y,d(x, ˆx)is the distance function
measuring the quality of the reconstructed data ˆx, where
MSE or MS-SSIM is normally used as the reconstruction
loss [4, 25]. The RD loss function is a weighted sum of the
expected bitstream length and the reconstructed loss.
The rate loss R(y)is calculated by the expected length
of the bitstream to encode latent representation ywhen input
data xis compressed. Let q(y)be the true distribution of y.
Although yis deterministic over random variable xaccording
to Eq. 1, q(y)is still unknown since the true distribution of
xis unknown. To tackle this problem, an LIC codec uses a
variational distribution p(y), either a parametric model from
a known distribution family [8, 9, 11, 18, 20, 22, 26, 27] or a
non-parametric model [7], to replace q(y)in calculating the
rate loss. The rate loss is then calculated as the cross-entropy
between q(y)and p(y), such as
R(y) = H(q(y), p (y)) (4)
=Ey∼q(y)[−log p(y)] (5)
=H(q(y)) + DKL (q(y)kp(y)) .(6)
In [7], the authors model the prior distribution p(y)as
fully factorized where the distribution of each element is
modeled non-parametrically using piece-wise linear functions.
This simple model does not capture spatial dependencies in
the latent representation. In [18], the authors introduced the
scale hyperprior model, where the elements in the latent rep-
resentation are modeled as independent zero-mean Gaussian
distributions and the variance of each element is derived from
side information z.zis modeled using a different distribution
model and transferred separately. The loss function becomes
L=R(y) + R(z) + λD (x, ˆx).(7)
In [11], the authors improved the scale hyperprior model
from two aspects. First, both the mean and the variance of
the Gaussian model are derived from the side information
z, named as mean-scale hyperprior model. Second, a context
model is introduced to further exploit the spatial dependencies
of the elements. The context model uses the elements that
have already been decoded to improve the model accuracy of
the current element. A similar technique was used in image
generative models such as PixelCNN [28–30]. Many recent
LIC systems are based on the hyperprior architecture and the
context model. The authors in [8] use mixture of Gaussian
distributions instead of the Gaussian distribution in the mean-
scale hyperprior model. In [20], the authors further apply
mixture of Gaussian-Lapalacian-Logistic (GLL) distribution to
model the latent representation. In [26], the authors enhance
the context model by exploiting the channel dependencies. The
parameters of the distribution function of the elements in yare
derived from the channels that have already been decoded. A
3D masked CNN is used to improve computational throughput.
In [9], the authors divide the channels into two groups, where
the first group is decoded in the same way as the normal
context model and the second group is decoded using the first
group as its context. With this architecture, the pixels in the
second group are able to use the long-range correlation in
ysince the first group is fully decoded. In [17], the authors
proposed a method that partitions the elements in the latent
representation into two groups along the spatial dimension in
a checkerboard pattern. The elements in the first group are
used as the context for the elements in the second group.
The context model, inspired by PixelCNN, exploits the
spatial and channel correlation further. Although the encoding
can be performed in a batch mode, the main issue of the
PixelCNN-based context model is that the decoding has to
be performed in sequential order, i.e., pixel by pixel, or even
element by element if the channel dependency is exploited.
According to the evaluation reported in [19], in an environment
with GPUs, the average decoding time of Cheng2020 model
[8] is 5.9 seconds per image with a resolution of 512 ×768
and 45.9 seconds per image with an average resolution of
1913 ×1361. To achieve better compression performance,
recent LIC systems are even multiple times slower than the
Cheng2020 system [9, 20, 22]. Furthermore, to avoid excessive
computational complexity, the context model is implemented
with a small neural network with a limited receptive field,
which greatly degrades the system performance. The hyper-
prior architecture also tries to capture the spatial correlation
in the latent representation. However. this architecture signif-
icantly increases the system complexity, for example, a 4-
layer context model network and a 5-layer hyperpior decoder
network are used in [8].
In this paper, we propose a LIC framework using a novel
probability model which significantly improves the compres-