
MULTI-RATE ADAPTIVE TRANSFORM CODING FOR VIDEO COMPRESSION
Lyndon R. Duong∗
lyndon.duong@nyu.edu
Center for Neural Science, NYU
New York, NY, USA
Bohan Li, Cheng Chen, Jingning Han
{bohanli, chengchen, jingning}@google.com
Open Codecs, Google LLC
Mountain View, CA, USA
ABSTRACT
Contemporary lossy image and video coding standards rely on
transform coding, the process through which pixels are mapped to
an alternative representation to facilitate efficient data compression.
Despite impressive performance of end-to-end optimized compres-
sion with deep neural networks, the high computational and space
demands of these models has prevented them from superseding
the relatively simple transform coding found in conventional video
codecs. In this study, we propose learned transforms and entropy
coding that may either serve as (non)linear drop-in replacements, or
enhancements for linear transforms in existing codecs. These trans-
forms can be multi-rate, allowing a single model to operate along the
entire rate-distortion curve. To demonstrate the utility of our frame-
work, we augmented the DCT with learned quantization matrices
and adaptive entropy coding to compress intra-frame AV1 block
prediction residuals. We report substantial BD-rate and perceptual
quality improvements over more complex nonlinear transforms at a
fraction of the computational cost.
Index Terms—video compression, transform coding, entropy
coding
1. INTRODUCTION
Transform coding is an integral component of image and video
coding [1]. In state-of-the-art video standards such as HEVC [2],
VVC [3], and AV1 [4], transform coding is used to map block pre-
diction residuals to a domain in which the statistics of the transform
coefficients facilitate more effective compression. These codecs use
linear transforms such as the discrete cosine transform (DCT) [5]
and the asymmetric discrete sine transform (ADST) [6], due to their
compression efficiency as well as low computational complexity.
In recent years, impressive results using end-to-end optimized
codecs anticipate a possible shift away from conventional codecs,
whose designs are largely based on heuristics and hand-engineered
components ([7,8] for reviews). Indeed, image compression com-
petitions based on rate-distortion (R-D) performance are now dom-
inated by nonlinear machine learning (ML) models [7]. However,
end-to-end ML codecs have yet to become standardized, primarily
due to their extreme increase in time and space complexity relative
to conventional solutions. For example, one undesirable factor is
that many ML compression approaches train an individual model
for each point along the R-D curve, requiring an entirely separate set
of neural network parameters for each R-D trade-off. This not only
dramatically increases the space needed to store such parameters, but
also limits the ability to fine-tune the R-D trade-off.
∗Work was performed at Open Codecs, Google LLC.
Fig. 1. Multi-rate model architecture. The analysis/synthesis trans-
forms (left column) can be linear or nonlinear, and use a fixed set of
λ-independent parameters (blue), but uses a subset of λ-dependent
to fine-tune the R-D trade-off (pink). A learned hyperprior network
(right column) enables forward-adaptive entropy coding by condi-
tioning the probability over transform coefficients, p(ˆ
y|Φ). AC and
AD denote arithmetic coding and decoding, respectively and orange
boxes indicate quantization operations. See section 3 for details.
In this study, we take steps towards addressing these issues, with
our contributions summarized as follows:
1. We trained transforms and conditional entropy models with
an R-D objective to compress intra-frame prediction residuals
collected from the AVM codec [9], the reference software for
the next codec from Alliance for Open Media. These models
can be used as drop-in replacements, or augmentations for
existing transform coding modules in video codecs.
2. We used a family of architectures and a training procedure
that enable multi-rate compression via adaptive gain control
with context-adaptive entropy coding [7], allowing a single
trained model to operate at any arbitrary point along the R-D
curve. This vastly improves space complexity compared to
training a model for each R-D trade-off.
3. We augmented the DCT with ML components to provide sig-
nificant improvements on BD-rate [10] and structural simi-
larity (SSIM) for intra-frame block prediction residuals com-
pared to learned nonlinear transforms with higher complexity.
arXiv:2210.14308v2 [eess.IV] 18 Feb 2023