
2 Related work
Analyzing which distributions coupling-based normalizing flows can approximate is an active area of
research. A general statement shows that a coupling-based normalizing flow which can approximate
an arbitrary invertible function can learn any probability density weakly [
1
]. This applies to affine
coupling flows [
4
,
5
,
6
], Flow++ [
18
], neural autoregressive flows [
26
], and SOS polynomial flows
[
27
]. Affine coupling flows converge to arbitrary densities in Wasserstein distance [
15
]. Both
universality results, however, require that the couplings become ill-conditioned (i.e. the learnt
functions become increasingly discontinuous as the error decreases, whereas in practice one observes
that functions remain smooth). Also, they consider only a finite subspace of the data space. Even more
importantly, the convergence criterion employed in their proofs (weak convergence resp. convergence
under Wasserstein metric) is critical: Those criteria do not imply convergence in the loss that is
employed in practice [
15
, Remark 3], the Kullback-Leibler divergence (equivalent to maximum
likelihood). An arbitrarily small distance in any of the above metrics can even result in an infinite
KL divergence. In contrast to previous work on affine coupling flows, we work directly on the KL
divergence. We decompose it in two contributions and show the flow’s convergence for one of the
parts.
Regarding when ill-conditioned flows need to arise to fit a distribution, [
28
] showed that well-
conditioned affine couplings can approximate log-concave padded distributions, again in terms of
Wasserstein distance. Lipschitz flows on the other hand cannot model arbitrary tail behavior, but this
can be fixed by adapting the latent distribution [29].
SOS polynomial flows converge in total variation to arbitrary probability densities [30], which also
does not imply convergence in KL divergence; zero-padded affine coupling flows converge weakly
[23], and so do Neural ODEs [31, 32].
Closely related to our work, 48 linear affine coupling blocks can represent any invertible linear
function
Ax +b
with
det(A)>0
[
15
, Theorem 2]. This also allows mapping any Gaussian
distribution
N(m, Σ)
to the standard normal
N(0, I)
. We put this statement into context in terms
of the KL divergence: The loss is exactly composed of the divergence to the nearest Gaussian and
of that Gaussian to the standard normal. We then make strong statements about the convergence
of the latter, concluding that for typical flows a smaller number of layers is required for accurate
approximation than predicted by [15].
3 Coupling-based normalizing flows
Normalizing flows learn an invertible function
fθ(x)
that maps samples
x
from some unknown
distribution
p(x)
given by samples to latent variables
z=fθ(x)
so that
z
follow a simple distribution,
typically the standard normal. The function
fθ
then yields an estimate
q(x)
for the true data
distribution p(x)via the change of variables formula (e.g. [5]):
q(x) = N(fθ(x); 0, I)|det J|,(1)
where
J=∇fθ(x)
is the Jacobian of
fθ(x)
. We can train a normalizing flow via the maximum
likelihood loss, which is equivalent to minimizing the Kullback-Leibler divergence between the
distribution of the latent code
q(z)
, as given by
z=fθ(x)
when
x∼p(x)
, and the standard normal:
L=DKL(q(z)kN (0, I)) = Ex∼p(x)h1
2
fθ(x)
2−log |det J|i+ const .(2)
The invertible architecture that makes up
fθ
has to (i) be computationally easy to invert, (ii) be able
to represent complex transformations, and (iii) have a tractable Jacobian determinant
|det J|
[
9
].
Building such an architecture is an active area of research, see e.g. [
2
] for a review. In this work, we
focus on the family of coupling-based normalizing flows, first presented in the form of the NICE
architecture [
4
]. It is a deep architecture that consists of several blocks, each containing a rotation, a
coupling and an ActNorm layer [6]:
fblock(x)=(fact ◦fcpl ◦frot)(x).(3)
The coupling
fcpl
splits an incoming vector
x0
in two parts along the coordinate axis: The first part
p0
, which we call passive, is left unchanged. The second part
a0
, which we call active, is modified as
3