Leveraging progressive model and overfitting for efficient learned image compression Honglei Zhang Francesco Cricri Hamed Rezazadegan Tavakoli Emre Aksu Miska M. Hannuksela

2025-04-29
0
0
2.83MB
10 页
10玖币
侵权投诉
Leveraging progressive model and overfitting for
efficient learned image compression
Honglei Zhang, Francesco Cricri, Hamed Rezazadegan Tavakoli, Emre Aksu, Miska M. Hannuksela
Nokia Technologies
Finland
Abstract—Deep learning is overwhelmingly dominant in the
field of computer vision and image/video processing for the
last decade. However, for image and video compression, it
lags behind the traditional techniques based on discrete cosine
transform (DCT) and linear filters. Built on top of an auto-
encoder architecture, learned image compression (LIC) systems
have drawn enormous attention in recent years. Nevertheless,
the proposed LIC systems are still inferior to the state-of-the-art
traditional techniques, for example, the Versatile Video Coding
(VVC/H.266) standard, due to either their compression perfor-
mance or decoding complexity. Although claimed to outperform
the VVC/H.266 on a limited bit rate range, some proposed LIC
systems take over 40 seconds to decode a 2K image on a GPU
system. In this paper, we introduce a powerful and flexible
LIC framework with multi-scale progressive (MSP) probability
model and latent representation overfitting (LOF) technique.
With different predefined profiles, the proposed framework can
achieve various balance points between compression efficiency
and computational complexity. Experiments show that the pro-
posed framework achieves 2.5%, 1.0%, and 1.3% Bjontegaard
delta bit rate (BD-rate) reduction over the VVC/H.266 standard
on three benchmark datasets on a wide bit rate range. More
importantly, the decoding complexity is reduced from O(n)to
O(1) compared to many other LIC systems, resulting in over 20
times speedup when decoding 2K images.
I. INTRODUCTION
Although deep learning-based technology has achieved
tremendous success in most computer vision and image/video
processing tasks, it has not been able to demonstrate superior
performance over traditional technologies for image and video
compression, in particular, for practical usage. Traditional
image compression techniques, such as JPEG/JPEG 2000 [1],
High Efficiency Video Coding (HEVC) (all-intra mode) [2],
and Versatile Video Coding (VVC/H.266) (all-intra mode)
[3], apply carefully designed processing steps such as data
transformation, quantization, entropy coding to compress the
image data while maintaining certain quality for human per-
ception [4–6]. End-to-end learned image compression (LIC)
systems are based on deep learning technology and data-
driven paradigm [4, 7–17]. These systems normally adopt the
variational auto-encoder architecture, as shown in Figure 1,
comprising encoder, decoder and probability model imple-
mented by deep convolutional neural networks (CNN) [7–
14, 18]. LIC systems are trained on datasets with a large
number of natural images by optimizing a rate-distortion (RD)
loss function.
Encoder Quantizer
Dequantizer
Decoder
Probability
Model
Arithmetic
Encoder
Arithmetic
Decoder
Encoder
Decoder
Fig. 1. LIC architecture.
Compared with the state-of-the-art traditional image com-
pression technologies, for example, VVC/H.266, most LIC
systems do not provide better compression performance de-
spite a much higher encoding and decoding complexity [4, 14,
19]. Recently, some proposed LIC systems have improved the
compression performance to be on par with or slightly better
than the VVC/H.266 [9, 20–24] on a limited bit rate range.
However, the decoding procedures of these systems are very
inefficient which prevents them from being used in practice.
In this paper, we propose a flexible and novel LIC frame-
work that achieves various balance points between com-
pression efficiency and computational complexity. A system
based on the proposed frame outperforms the VVC/H.266
on three benchmarking datasets over a wide bit rate range.
Compared to most other LIC systems, the proposed system not
only improves the compression performance but also reduces
the decoding complexity from O(n)to O(1) in a parallel
computing environment. Our contributions are summarized as
follows:
•We propose the multi-scale progressive (MSP) probabil-
ity model for lossy image compression that efficiently
exploits both spatial and channel correlation of the latent
representation and significantly reduces decoding com-
plexity.
•We present a greedy search method in applying the latent
representation overfitting (LOF) technique and show that
LOF can considerably improve the performance of LIC
systems and mitigate the domain-shift problem.
II. LIC SYSTEM AND RELATED WORKS
In [7], the authors formulated the LIC codec, named as
transform coding model, from the generative Bayesian model.
Figure 1 shows the architecture of a typical LIC codec. Input
arXiv:2210.04112v1 [cs.CV] 8 Oct 2022
data xis transformed by an analysis function ga(x;θ)to
generate a latent representation ˜yin continuous domain. Next,
˜yis quantized to latent representation yin discrete domain.
Then, the arithmetic encoder encodes yinto a bitstream using
the estimated distribution provided by the probability model.
The encoder operation can be represented by the function
y=Q(ga(x;θ)) .(1)
At the decoder side, an arithmetic decoder reconstructs yfrom
the bitstream with the help of the same probability model
that is used at the encoder side. Next, yis dequantized and
a synthesis function gs(y;φ)is used to generate ˆxas the
reconstructed input data. ga(·;θ)and gs(·;φ)are implemented
using deep neural networks with parameters θand φ, respec-
tively. The codec is trained by optimizing the RD loss function
defined by
L=R(y) + λ·D(x, ˆx)(2)
=Ex∼p(x)[−log p(y)] + λ·Ex∼p(x)[d(x, ˆx)] (3)
In Eq. 2, R(y)is the rate loss measuring the expected number
of bits to encode y,D(x, ˆx)is the expected reconstruction loss
measuring the quality of the reconstructed image, and λis the
Lagrange multiplier that adjusts the weighting of the two loss
terms to achieve difference compression rate. In Eq. 3, p(y),
also known as prior distribution, is the probability distribution
of latent representation y,d(x, ˆx)is the distance function
measuring the quality of the reconstructed data ˆx, where
MSE or MS-SSIM is normally used as the reconstruction
loss [4, 25]. The RD loss function is a weighted sum of the
expected bitstream length and the reconstructed loss.
The rate loss R(y)is calculated by the expected length
of the bitstream to encode latent representation ywhen input
data xis compressed. Let q(y)be the true distribution of y.
Although yis deterministic over random variable xaccording
to Eq. 1, q(y)is still unknown since the true distribution of
xis unknown. To tackle this problem, an LIC codec uses a
variational distribution p(y), either a parametric model from
a known distribution family [8, 9, 11, 18, 20, 22, 26, 27] or a
non-parametric model [7], to replace q(y)in calculating the
rate loss. The rate loss is then calculated as the cross-entropy
between q(y)and p(y), such as
R(y) = H(q(y), p (y)) (4)
=Ey∼q(y)[−log p(y)] (5)
=H(q(y)) + DKL (q(y)kp(y)) .(6)
In [7], the authors model the prior distribution p(y)as
fully factorized where the distribution of each element is
modeled non-parametrically using piece-wise linear functions.
This simple model does not capture spatial dependencies in
the latent representation. In [18], the authors introduced the
scale hyperprior model, where the elements in the latent rep-
resentation are modeled as independent zero-mean Gaussian
distributions and the variance of each element is derived from
side information z.zis modeled using a different distribution
model and transferred separately. The loss function becomes
L=R(y) + R(z) + λD (x, ˆx).(7)
In [11], the authors improved the scale hyperprior model
from two aspects. First, both the mean and the variance of
the Gaussian model are derived from the side information
z, named as mean-scale hyperprior model. Second, a context
model is introduced to further exploit the spatial dependencies
of the elements. The context model uses the elements that
have already been decoded to improve the model accuracy of
the current element. A similar technique was used in image
generative models such as PixelCNN [28–30]. Many recent
LIC systems are based on the hyperprior architecture and the
context model. The authors in [8] use mixture of Gaussian
distributions instead of the Gaussian distribution in the mean-
scale hyperprior model. In [20], the authors further apply
mixture of Gaussian-Lapalacian-Logistic (GLL) distribution to
model the latent representation. In [26], the authors enhance
the context model by exploiting the channel dependencies. The
parameters of the distribution function of the elements in yare
derived from the channels that have already been decoded. A
3D masked CNN is used to improve computational throughput.
In [9], the authors divide the channels into two groups, where
the first group is decoded in the same way as the normal
context model and the second group is decoded using the first
group as its context. With this architecture, the pixels in the
second group are able to use the long-range correlation in
ysince the first group is fully decoded. In [17], the authors
proposed a method that partitions the elements in the latent
representation into two groups along the spatial dimension in
a checkerboard pattern. The elements in the first group are
used as the context for the elements in the second group.
The context model, inspired by PixelCNN, exploits the
spatial and channel correlation further. Although the encoding
can be performed in a batch mode, the main issue of the
PixelCNN-based context model is that the decoding has to
be performed in sequential order, i.e., pixel by pixel, or even
element by element if the channel dependency is exploited.
According to the evaluation reported in [19], in an environment
with GPUs, the average decoding time of Cheng2020 model
[8] is 5.9 seconds per image with a resolution of 512 ×768
and 45.9 seconds per image with an average resolution of
1913 ×1361. To achieve better compression performance,
recent LIC systems are even multiple times slower than the
Cheng2020 system [9, 20, 22]. Furthermore, to avoid excessive
computational complexity, the context model is implemented
with a small neural network with a limited receptive field,
which greatly degrades the system performance. The hyper-
prior architecture also tries to capture the spatial correlation
in the latent representation. However. this architecture signif-
icantly increases the system complexity, for example, a 4-
layer context model network and a 5-layer hyperpior decoder
network are used in [8].
In this paper, we propose a LIC framework using a novel
probability model which significantly improves the compres-
摘要:
展开>>
收起<<
LeveragingprogressivemodelandoverttingforefcientlearnedimagecompressionHongleiZhang,FrancescoCricri,HamedRezazadeganTavakoli,EmreAksu,MiskaM.HannukselaNokiaTechnologiesFinlandAbstractDeeplearningisoverwhelminglydominantintheeldofcomputervisionandimage/videoprocessingforthelastdecade.However,fori...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-12-03 2
-
VIP免费2024-12-03 3
-
VIP免费2024-12-03 2
-
VIP免费2024-12-03 6
-
VIP免费2024-12-03 1
-
VIP免费2024-12-03 4
-
VIP免费2024-12-03 33
-
VIP免费2024-12-03 10
-
VIP免费2024-12-03 7
-
VIP免费2024-12-03 49
分类:图书资源
价格:10玖币
属性:10 页
大小:2.83MB
格式:PDF
时间:2025-04-29
作者详情
-
VP-STO Via-point-based Stochastic Trajectory Optimization for Reactive Robot Behavior Julius Jankowski12 Lara Bruderm uller3 Nick Hawes3and Sylvain Calinon1210 玖币0人下载
-
WA VEFIT AN ITERATIVE AND NON-AUTOREGRESSIVE NEURAL VOCODER BASED ON FIXED-POINT ITERATION Yuma Koizumi1 Kohei Yatabe2 Heiga Zen1 Michiel Bacchiani110 玖币0人下载