Leveraging progressive model and overﬁtting for efﬁcient learned image compression Honglei Zhang Francesco Cricri Hamed Rezazadegan Tavakoli Emre Aksu Miska M. Hannuksela

2025-04-29 0 0 2.83MB 10 页 10玖币

侵权投诉

Leveraging progressive model and overﬁtting for

efﬁcient learned image compression

Honglei Zhang, Francesco Cricri, Hamed Rezazadegan Tavakoli, Emre Aksu, Miska M. Hannuksela

Nokia Technologies

Finland

Abstract—Deep learning is overwhelmingly dominant in the

ﬁeld of computer vision and image/video processing for the

last decade. However, for image and video compression, it

lags behind the traditional techniques based on discrete cosine

transform (DCT) and linear ﬁlters. Built on top of an auto-

encoder architecture, learned image compression (LIC) systems

have drawn enormous attention in recent years. Nevertheless,

the proposed LIC systems are still inferior to the state-of-the-art

traditional techniques, for example, the Versatile Video Coding

(VVC/H.266) standard, due to either their compression perfor-

mance or decoding complexity. Although claimed to outperform

the VVC/H.266 on a limited bit rate range, some proposed LIC

systems take over 40 seconds to decode a 2K image on a GPU

system. In this paper, we introduce a powerful and ﬂexible

LIC framework with multi-scale progressive (MSP) probability

model and latent representation overﬁtting (LOF) technique.

With different predeﬁned proﬁles, the proposed framework can

achieve various balance points between compression efﬁciency

and computational complexity. Experiments show that the pro-

posed framework achieves 2.5%, 1.0%, and 1.3% Bjontegaard

delta bit rate (BD-rate) reduction over the VVC/H.266 standard

on three benchmark datasets on a wide bit rate range. More

importantly, the decoding complexity is reduced from O(n)to

O(1) compared to many other LIC systems, resulting in over 20

times speedup when decoding 2K images.

I. INTRODUCTION

Although deep learning-based technology has achieved

tremendous success in most computer vision and image/video

processing tasks, it has not been able to demonstrate superior

performance over traditional technologies for image and video

compression, in particular, for practical usage. Traditional

image compression techniques, such as JPEG/JPEG 2000 [1],

High Efﬁciency Video Coding (HEVC) (all-intra mode) [2],

and Versatile Video Coding (VVC/H.266) (all-intra mode)

[3], apply carefully designed processing steps such as data

transformation, quantization, entropy coding to compress the

image data while maintaining certain quality for human per-

ception [4–6]. End-to-end learned image compression (LIC)

systems are based on deep learning technology and data-

driven paradigm [4, 7–17]. These systems normally adopt the

variational auto-encoder architecture, as shown in Figure 1,

comprising encoder, decoder and probability model imple-

mented by deep convolutional neural networks (CNN) [7–

14, 18]. LIC systems are trained on datasets with a large

number of natural images by optimizing a rate-distortion (RD)

loss function.

Encoder Quantizer

Dequantizer

Decoder

Probability

Model

Arithmetic

Encoder

Arithmetic

Decoder

Encoder

Decoder

Fig. 1. LIC architecture.

Compared with the state-of-the-art traditional image com-

pression technologies, for example, VVC/H.266, most LIC

systems do not provide better compression performance de-

spite a much higher encoding and decoding complexity [4, 14,

19]. Recently, some proposed LIC systems have improved the

compression performance to be on par with or slightly better

than the VVC/H.266 [9, 20–24] on a limited bit rate range.

However, the decoding procedures of these systems are very

inefﬁcient which prevents them from being used in practice.

In this paper, we propose a ﬂexible and novel LIC frame-

work that achieves various balance points between com-

pression efﬁciency and computational complexity. A system

based on the proposed frame outperforms the VVC/H.266

on three benchmarking datasets over a wide bit rate range.

Compared to most other LIC systems, the proposed system not

only improves the compression performance but also reduces

the decoding complexity from O(n)to O(1) in a parallel

computing environment. Our contributions are summarized as

follows:

•We propose the multi-scale progressive (MSP) probabil-

ity model for lossy image compression that efﬁciently

exploits both spatial and channel correlation of the latent

representation and signiﬁcantly reduces decoding com-

plexity.

•We present a greedy search method in applying the latent

representation overﬁtting (LOF) technique and show that

LOF can considerably improve the performance of LIC

systems and mitigate the domain-shift problem.

II. LIC SYSTEM AND RELATED WORKS

In [7], the authors formulated the LIC codec, named as

transform coding model, from the generative Bayesian model.

Figure 1 shows the architecture of a typical LIC codec. Input

arXiv:2210.04112v1 [cs.CV] 8 Oct 2022

data xis transformed by an analysis function ga(x;θ)to

generate a latent representation ˜yin continuous domain. Next,

˜yis quantized to latent representation yin discrete domain.

Then, the arithmetic encoder encodes yinto a bitstream using

the estimated distribution provided by the probability model.

The encoder operation can be represented by the function

y=Q(ga(x;θ)) .(1)

At the decoder side, an arithmetic decoder reconstructs yfrom

the bitstream with the help of the same probability model

that is used at the encoder side. Next, yis dequantized and

a synthesis function gs(y;φ)is used to generate ˆxas the

reconstructed input data. ga(·;θ)and gs(·;φ)are implemented

using deep neural networks with parameters θand φ, respec-

tively. The codec is trained by optimizing the RD loss function

deﬁned by

L=R(y) + λ·D(x, ˆx)(2)

=Ex∼p(x)[−log p(y)] + λ·Ex∼p(x)[d(x, ˆx)] (3)

In Eq. 2, R(y)is the rate loss measuring the expected number

of bits to encode y,D(x, ˆx)is the expected reconstruction loss

measuring the quality of the reconstructed image, and λis the

Lagrange multiplier that adjusts the weighting of the two loss

terms to achieve difference compression rate. In Eq. 3, p(y),

also known as prior distribution, is the probability distribution

of latent representation y,d(x, ˆx)is the distance function

measuring the quality of the reconstructed data ˆx, where

MSE or MS-SSIM is normally used as the reconstruction

loss [4, 25]. The RD loss function is a weighted sum of the

expected bitstream length and the reconstructed loss.

The rate loss R(y)is calculated by the expected length

of the bitstream to encode latent representation ywhen input

data xis compressed. Let q(y)be the true distribution of y.

Although yis deterministic over random variable xaccording

to Eq. 1, q(y)is still unknown since the true distribution of

xis unknown. To tackle this problem, an LIC codec uses a

variational distribution p(y), either a parametric model from

a known distribution family [8, 9, 11, 18, 20, 22, 26, 27] or a

non-parametric model [7], to replace q(y)in calculating the

rate loss. The rate loss is then calculated as the cross-entropy

between q(y)and p(y), such as

R(y) = H(q(y), p (y)) (4)

=Ey∼q(y)[−log p(y)] (5)

=H(q(y)) + DKL (q(y)kp(y)) .(6)

In [7], the authors model the prior distribution p(y)as

fully factorized where the distribution of each element is

modeled non-parametrically using piece-wise linear functions.

This simple model does not capture spatial dependencies in

the latent representation. In [18], the authors introduced the

scale hyperprior model, where the elements in the latent rep-

resentation are modeled as independent zero-mean Gaussian

distributions and the variance of each element is derived from

side information z.zis modeled using a different distribution

model and transferred separately. The loss function becomes

L=R(y) + R(z) + λD (x, ˆx).(7)

In [11], the authors improved the scale hyperprior model

from two aspects. First, both the mean and the variance of

the Gaussian model are derived from the side information

z, named as mean-scale hyperprior model. Second, a context

model is introduced to further exploit the spatial dependencies

of the elements. The context model uses the elements that

have already been decoded to improve the model accuracy of

the current element. A similar technique was used in image

generative models such as PixelCNN [28–30]. Many recent

LIC systems are based on the hyperprior architecture and the

context model. The authors in [8] use mixture of Gaussian

distributions instead of the Gaussian distribution in the mean-

scale hyperprior model. In [20], the authors further apply

mixture of Gaussian-Lapalacian-Logistic (GLL) distribution to

model the latent representation. In [26], the authors enhance

the context model by exploiting the channel dependencies. The

parameters of the distribution function of the elements in yare

derived from the channels that have already been decoded. A

3D masked CNN is used to improve computational throughput.

In [9], the authors divide the channels into two groups, where

the ﬁrst group is decoded in the same way as the normal

context model and the second group is decoded using the ﬁrst

group as its context. With this architecture, the pixels in the

second group are able to use the long-range correlation in

ysince the ﬁrst group is fully decoded. In [17], the authors

proposed a method that partitions the elements in the latent

representation into two groups along the spatial dimension in

a checkerboard pattern. The elements in the ﬁrst group are

used as the context for the elements in the second group.

The context model, inspired by PixelCNN, exploits the

spatial and channel correlation further. Although the encoding

can be performed in a batch mode, the main issue of the

PixelCNN-based context model is that the decoding has to

be performed in sequential order, i.e., pixel by pixel, or even

element by element if the channel dependency is exploited.

According to the evaluation reported in [19], in an environment

with GPUs, the average decoding time of Cheng2020 model

[8] is 5.9 seconds per image with a resolution of 512 ×768

and 45.9 seconds per image with an average resolution of

1913 ×1361. To achieve better compression performance,

recent LIC systems are even multiple times slower than the

Cheng2020 system [9, 20, 22]. Furthermore, to avoid excessive

computational complexity, the context model is implemented

with a small neural network with a limited receptive ﬁeld,

which greatly degrades the system performance. The hyper-

prior architecture also tries to capture the spatial correlation

in the latent representation. However. this architecture signif-

icantly increases the system complexity, for example, a 4-

layer context model network and a 5-layer hyperpior decoder

network are used in [8].

In this paper, we propose a LIC framework using a novel

probability model which signiﬁcantly improves the compres-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LeveragingprogressivemodelandoverttingforefcientlearnedimagecompressionHongleiZhang,FrancescoCricri,HamedRezazadeganTavakoli,EmreAksu,MiskaM.HannukselaNokiaTechnologiesFinlandAbstractDeeplearningisoverwhelminglydominantintheeldofcomputervisionandimage/videoprocessingforthelastdecade.However,fori...

展开>> 收起<<

Leveraging progressive model and overﬁtting for efﬁcient learned image compression Honglei Zhang Francesco Cricri Hamed Rezazadegan Tavakoli Emre Aksu Miska M. Hannuksela.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Leveraging progressive model and overﬁtting for efﬁcient learned image compression Honglei Zhang Francesco Cricri Hamed Rezazadegan Tavakoli Emre Aksu Miska M. Hannuksela

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: