Improving The Reconstruction Quality by Overfitted
Decoder Bias in Neural Image Compression
Oussama Jourairi
InterDigital, Inc.
Rennes, France
Muhammet Balcilar
InterDigital, Inc.
Rennes, France
Anne Lambert
InterDigital, Inc.
Rennes, France
Franc¸ois Schnitzler
InterDigital, Inc.
Rennes, France
Abstract—End-to-end trainable models have reached the per-
formance of traditional handcrafted compression techniques on
videos and images. Since the parameters of these models are
learned over large training sets, they are not optimal for any
given image to be compressed. In this paper, we propose an
instance-based fine-tuning of a subset of decoder’s bias to improve
the reconstruction quality in exchange for extra encoding time
and minor additional signaling cost. The proposed method is
applicable to any end-to-end compression methods, improving the
state-of-the-art neural image compression BD-rate by 3−5%.
Keywords—Learning based image coding, Overfitting, Fine-
tuning.
I. INTRODUCTION
Image and video compression are an important part of
our everyday life. These technologies have been refined over
decades by experts. Nowadays, compression algorithms, such
as those developed by MPEG, consist in fine-tuned handcrafted
techniques. Recently, deep learning models have been used
to develop end-to-end trainable compression algorithms. State
of the art neural architectures now compete with traditional
compression methods (H.266/VVC [1]) even in terms of peak
signal-to-noise ratio (PSNR) for single image compression [2].
One of the main research directions for end-to-end com-
pression focuses on Rate-Distortion Autoencoder [3], a partic-
ular type of Variational Autoencoder (VAE) models [4]. Opti-
mizing such a model amounts to minimizing the mean square
error (MSE) of the decompressed image and the bitlength of
encoded latent values, estimated by their entropy w.r.t their
priors [5]. In practice, these latents are first quantized and
then typically encoded by an entropy encoder such as range
or arithmetic coding [6]. These encoders exploit the prior
distributions over the encoded values (here, the latents) to
achieve close to optimal compression rates. The priors are also
trainable and can themselves have hyperpriors [5], [7]–[12].
As usual with deep learning, these models are typically
trained on large datasets and fixed whereas traditional encoders
can adapt to a particular image by for example optimizing
the quadtree decomposition. So, any resulting neural model is
likely to be suboptimal for any single image, a problem called
the amortization gap [13]. In the compression context, this
can be leveraged to improve the rate-distortion trade-off, for
example by fine-tuning the encoder or the latent codes [14]–
[17]. These approaches improve distortion without degrading
the rate. Another class of methods fine-tunes the decoder and
the entropy model, improving distortion further but degrading
the rate, as modified parameters must be transmitted as well
[18]–[20]. Because of this added cost, these approaches have
not been applicable to single image compression but only to
set of images [20] or video [21], where the rate increase is
amortized over many images. Another solution is to select
one set of parameter values out of predefined sets [18]. This
decreases encoding time and signaling cost but has again
limited gain compared to strong baselines.
In this paper, we achieve decoder fine-tuning to improve
the reconstruction quality for single image compression which
was found infeasible in the literature so far. This is made pos-
sible thanks to our three contributions: 1) selection of subset
of parameters to be fine-tuned, 2) learning the quantization
parameter of updates jointly and 3) using a new loss function
based on interpolation of the baseline model’s performances.
In our experiment, we show 3−5% BD-rate gain for any given
baseline end-to-end image compression model in exchange for
extra encoding complexity.
II. NEURAL IMAGE COMPRESSION
An input image to be compressed, x∈Rn×n×3is first
processed by a deep encoder y=ga(x;φ).y∈Rm×m×ois
called the latent and is smaller than x. This latent is converted
into a bitstream by going through a quantizer, ˆy =Q(y),
and then through an entropy coder exploiting a prior pf(ˆy|Ψ)
in [8]. pfcan also depend on some side information z=
ha(y)∈Rk×k×fto better model spatial dependencies. ha,
another neural network, is also trained. We denote by ˆz =
Q(z)the quantization of z. Both ˆy are ˆz are encoded and
the encoders respectively use the hyperpriors ph(ˆy|ˆz; Θ), and
pf(ˆz|Ψ). The latent can be processed by a deep decoder ˆx =
gs(ˆy;θ)to obtain the decompressed image ˆx. The parameters
φ, θ, Ψ,Θare trained using the following rate-distortion loss:
L=E
x∼px
∼U
[−log(ph(ˆy|ˆz,Θ)) −log(pf(ˆz|Ψ)) + λd(x,ˆx)] ,(1)
where, d(., .)denotes a distortion loss such as MSE and λ
controls the trade-off between compression ratio and quality.
Note that during training Q(.)is relaxed into Q(x) = x+,
∼ U(−0.5,0.5).
Typically, pf(ˆz|Ψ) is factorized in findependent slices
of size k×k. Each slice has its own trainable cumulative
distribution function (cdf): ¯p(c)
Ψ(.). c = 1 . . . f, From the cdf
and for any value of x, the probability mass function (pmf) isIEEE copyright: ©2022 IEEE, all rights reserved
arXiv:2210.04898v1 [eess.IV] 10 Oct 2022