Whitening Convergence Rate of Coupling-based Normalizing Flows Felix Draxler

2025-05-06 0 0 1.14MB 30 页 10玖币
侵权投诉
Whitening Convergence Rate of
Coupling-based Normalizing Flows
Felix Draxler
Heidelberg University
felix.draxler@iwr.uni-heidelberg.de
Christoph Schnörr
Heidelberg University
schnoerr@math.uni-heidelberg.de
Ullrich Köthe
Heidelberg University
ullrich.koethe@iwr.uni-heidelberg.de
Abstract
Coupling-based normalizing flows (e.g. RealNVP) are a popular family of nor-
malizing flow architectures that work surprisingly well in practice. This calls for
theoretical understanding. Existing work shows that such flows weakly converge
to arbitrary data distributions [
1
]. However, they make no statement about the
stricter convergence criterion used in practice, the maximum likelihood loss. For
the first time, we make a quantitative statement about this kind of convergence:
We prove that all coupling-based normalizing flows perform whitening of the
data distribution (i.e. diagonalize the covariance matrix) and derive corresponding
convergence bounds that show a linear convergence rate in the depth of the flow.
Numerical experiments demonstrate the implications of our theory and point at
open questions.
1 Introduction
Normalizing flows [
2
,
3
] are among the most promising approaches to generative machine learning
and have already demonstrated convincing performance in a wide variety of practical applications,
ranging from image analysis [
4
,
5
,
6
,
7
,
8
] to astrophysics [
9
], mechanical engineering [
10
], causality
[
11
], computational biology [
12
] and medicine [
13
]. As the name suggests, normalizing flows
represent complex data distributions as bijective transformations (also known as flows or push-
forwards) of standard normal or other well-understood distributions.
In this paper, we focus on a theoretical underpinning of coupling-based normalizing flows, a par-
ticularly effective class of normalizing flows in terms of invertible neural networks. All of the
above applications are actually implemented using coupling-based normalizing flows. Their central
building blocks are coupling layers, which decompose the space into two subspaces called active and
passive subspace (see Section 3). Only the active dimensions are transformed conditioned on the
passive dimensions, which makes the mapping computationally easy to invert. In order to vary the
assignment of dimensions to the active and passive subspaces, coupling layers are combined with
preceding orthonormal transformation layers into coupling blocks. These blocks are arranged into
deep networks such that the orthonormal transformations are sampled uniformly at random from the
orthogonal matrices and the coupling layers are trained with the maximum likelihood objective, see
Equation (2). Upon convergence of the training, the sequence of coupling blocks gradually transforms
the probability density that generated the given training data, into a standard normal distribution and
vice versa.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.14032v1 [cs.LG] 25 Oct 2022
non-Gaussianity
non-Standardness
Normalizing Flow Loss
Figure 1: (Left) The Maximum Likelihood Loss
L
(blue) can be split into the non-Gaussianity
G
(orange) [
25
] and the non-Standardness
S
(green) of the latent code
z=fθ(x)
:
L=G+S
(Proposition 1). For the latter, we give explicit guarantees as one more coupling block is added in
Theorems 1 and 2 and show a global convergence rate in Theorem 3. (Right) Typical fit of EMNIST
digits by a standard affine coupling flow for various depths. Our theory (Theorem 1) upper bounds
the average
S
for
L+ 1
coupling blocks given a trained model with
L
coupling blocks (dotted green).
We observe that our bound is predictive for how much end-to-end training reduces S.
Since the resulting normalizing flows deviate significantly from optimal transport flows [
14
] and the
bulk of the mathematical literature is focusing on optimal transport, an analysis tailored to coupling
architectures is lacking. In a landmark paper, [
1
] proved that sufficiently large affine coupling flows
weakly converge to arbitrary data densities. The notion of weak convergence is critical here, as it does
not imply convergence in maximum likelihood [
15
, Remark 3]. Maximum likelihood (or, equivalently,
the Kullback-Leibler (KL) divergence) is the loss that is actually used in practice. It can be used for
gradient descent and it guarantees not only convergence in samples (“
xq(x)xp(x)
”) but
also in density estimates (“
q(x)p(x)
”). It is strong in the sense that the square root of the KL
divergence upper bounds (up to a factor 2) the total variation metric, and hence also the Wasserstein
metric if the underlying space is bounded [
16
]. Moreover, convergence under the KL divergence
implies weak convergence which is fundamental for robust statistics [17].
We take a first step towards showing that coupling blocks also converge in terms of maximum likeli-
hood. To the best of our knowledge, our paper presents for the first time a quantitative convergence
analysis of coupling-based normalizing flows based on this strong notion of convergence.
Specifically, we make the following contributions towards this goal:
We utilize that the loss of a normalizing flow can be decomposed into two parts (Figure 1):
The divergence to the nearest Gaussian (non-Gaussianity) plus the divergence of that
Gaussian to the standard normal (non-Standardness).
The contribution of a single coupling layer on the non-Standardness is analyzed in terms of
matrix operations (Schur complement and scaling).
Explicit bounds for the non-Standardness after a single coupling block in expectation over
all orthonormal transformations are derived.
We use these results to prove that a sequence of coupling blocks whitens the data covariance
and to derive linear convergence rates for this process.
Our results hold for all coupling architectures we are aware of (Appendix C), including: NICE [
4
],
RealNVP [
5
], and GLOW [
6
]; Flow++ [
18
]; nonlinear-squared flow [
19
]; linear, quadratic [
20
],
cubic [
21
], and rational quadratic splines [
22
]; neural autoregressive flows [
23
], and unconstrained
monotonic neural networks [
24
]. We confirm our theoretical findings experimentally and identify
directions for further improvement.
2
2 Related work
Analyzing which distributions coupling-based normalizing flows can approximate is an active area of
research. A general statement shows that a coupling-based normalizing flow which can approximate
an arbitrary invertible function can learn any probability density weakly [
1
]. This applies to affine
coupling flows [
4
,
5
,
6
], Flow++ [
18
], neural autoregressive flows [
26
], and SOS polynomial flows
[
27
]. Affine coupling flows converge to arbitrary densities in Wasserstein distance [
15
]. Both
universality results, however, require that the couplings become ill-conditioned (i.e. the learnt
functions become increasingly discontinuous as the error decreases, whereas in practice one observes
that functions remain smooth). Also, they consider only a finite subspace of the data space. Even more
importantly, the convergence criterion employed in their proofs (weak convergence resp. convergence
under Wasserstein metric) is critical: Those criteria do not imply convergence in the loss that is
employed in practice [
15
, Remark 3], the Kullback-Leibler divergence (equivalent to maximum
likelihood). An arbitrarily small distance in any of the above metrics can even result in an infinite
KL divergence. In contrast to previous work on affine coupling flows, we work directly on the KL
divergence. We decompose it in two contributions and show the flow’s convergence for one of the
parts.
Regarding when ill-conditioned flows need to arise to fit a distribution, [
28
] showed that well-
conditioned affine couplings can approximate log-concave padded distributions, again in terms of
Wasserstein distance. Lipschitz flows on the other hand cannot model arbitrary tail behavior, but this
can be fixed by adapting the latent distribution [29].
SOS polynomial flows converge in total variation to arbitrary probability densities [30], which also
does not imply convergence in KL divergence; zero-padded affine coupling flows converge weakly
[23], and so do Neural ODEs [31, 32].
Closely related to our work, 48 linear affine coupling blocks can represent any invertible linear
function
Ax +b
with
det(A)>0
[
15
, Theorem 2]. This also allows mapping any Gaussian
distribution
N(m, Σ)
to the standard normal
N(0, I)
. We put this statement into context in terms
of the KL divergence: The loss is exactly composed of the divergence to the nearest Gaussian and
of that Gaussian to the standard normal. We then make strong statements about the convergence
of the latter, concluding that for typical flows a smaller number of layers is required for accurate
approximation than predicted by [15].
3 Coupling-based normalizing flows
Normalizing flows learn an invertible function
fθ(x)
that maps samples
x
from some unknown
distribution
p(x)
given by samples to latent variables
z=fθ(x)
so that
z
follow a simple distribution,
typically the standard normal. The function
fθ
then yields an estimate
q(x)
for the true data
distribution p(x)via the change of variables formula (e.g. [5]):
q(x) = N(fθ(x); 0, I)|det J|,(1)
where
J=fθ(x)
is the Jacobian of
fθ(x)
. We can train a normalizing flow via the maximum
likelihood loss, which is equivalent to minimizing the Kullback-Leibler divergence between the
distribution of the latent code
q(z)
, as given by
z=fθ(x)
when
xp(x)
, and the standard normal:
L=DKL(q(z)kN (0, I)) = Exp(x)h1
2
fθ(x)
2log |det J|i+ const .(2)
The invertible architecture that makes up
fθ
has to (i) be computationally easy to invert, (ii) be able
to represent complex transformations, and (iii) have a tractable Jacobian determinant
|det J|
[
9
].
Building such an architecture is an active area of research, see e.g. [
2
] for a review. In this work, we
focus on the family of coupling-based normalizing flows, first presented in the form of the NICE
architecture [
4
]. It is a deep architecture that consists of several blocks, each containing a rotation, a
coupling and an ActNorm layer [6]:
fblock(x)=(fact fcpl frot)(x).(3)
The coupling
fcpl
splits an incoming vector
x0
in two parts along the coordinate axis: The first part
p0
, which we call passive, is left unchanged. The second part
a0
, which we call active, is modified as
3
a function of the passive dimensions:
fcpl(x0) = fcplp0
a0=p0
c(a0;p0)=:p1
a1.(4)
Here, the coupling function
c:RD/2×RD/2RD/2
has to be a function that is easy to invert when
p0
is given, i.e. it is easy to compute
a0=c1(a1;p0)
given
p0
. This makes the coupling easy to
invert: Call
x1= (p1;a1)
the output of the layer, then
p0=p1
. Use this to invert
a1=c(a0;p0)
. For
example, RealNVP [
5
] proposes a simple affine transformation for
c
:
a1=c(a0;p0) = a0s(p0) +
t(p0)
(
means element-wise multiplication).
s(p0)RD/2
+
and
t(p0)RD/2
are computed by a
feed-forward neural network. The coupling functions
c
of other architectures our theory applies to
are listed in Appendix C.
An Activation Normalization (ActNorm) layer [
6
] helps stabilize training and is implemented in
practice like in the popular INN framework FrEIA [33]. It rescales and shifts each dimension:
fact(x) = rx+u, (5)
given parameters rRD
+and uRD. We include it as it simplifies our mathematical arguments.
If we were to concatenate several coupling layers, the entire network would never change the passive
dimensions apart from the element-wise affine transformation in the ActNorm layer. Here, the
rotation layers
frot(x) = Qx
come into play [
6
]. They multiply an orthogonal matrix
Q
to the data,
changing which subspaces are passive respectively active. This matrix is typically fixed at random at
initialization and then left unchanged during training.
4 Coupling layers as whitening transformation
The central mathematical question we answer in this work is the following: How can a deep coupling-
based normalizing flow whiten the data? As the latent distribution is a standard normal, whitening is
a necessary condition for the flow to converge. This is a direct property of the loss:
Proposition 1
(Pythagorean Identity, Proof in Appendix B.1)
.
Given data with distribution
p(x)
with
mean
m
and covariance
Σ
. Then, the Kullback-Leibler divergence to a standard normal distribution
decomposes as follows:
DKL(p(x)kN (0, I)) = DKL(p(x)kN (m, Σ))
| {z }
non-Gaussianity G(p)
+DKL(N(m, Σ)kN (0, I))
| {z }
non-Standardness S(p)
(6)
and the non-Standardness again decomposes:
S(p) = DKL(N(m, Σ)kN (m, Diag(Σ)))
|{z }
Correlation C(p)
+DKL(N(m, Diag(Σ))kN (0, I))
| {z }
Diagonal non-Standardness
.(7)
This splits the transport from the data distribution to the latent standard normal into three parts: (i)
From the data to the nearest Gaussian distribution
N(m, Σ)
, measured by
G
. (ii) From that nearest
Gaussian to the corresponding uncorrelated Gaussian
N(m, Diag(Σ))
, measured by
C
. (iii) From
the uncorrelated Gaussian to standard normal.
We do not make explicit use of the fact that the non-Standardness can again be decomposed,
but we show it nevertheless to relate our result to the literature: The Pythagorean identity
DKL(p(x)kN (m, Diag(Σ))) = G(p) + C(p)
has been shown before by [
25
, Section 2.3]. Both
their and our result are specific applications of the general [
34
, Theorem 3.8] from information
geometry. Our proof is given in Appendix B.1.
Proposition 1 is visualized in Figure 1. In an experiment, we fit a set of Glow [
6
] coupling flows
of increasing depths to the EMNIST digit dataset [
35
] using maximum likelihood loss and measure
the capability of each flow in decreasing
G
and
S
(Details in Appendix A.1). The form of the non-
Standardness
S
is given by the well-known KL divergence between the involved normal distributions,
see Equation (30) in Appendix B.1. It is invariant under rotations
Q
and only depends on the first two
moments m, Σ:
S(m, Σ) :=S(p) = 1
2(kmk2+ tr Σ Dlog det Σ)) = S(Qm, QΣQT).(8)
4
The non-Standardness
S
will be our measure on how far the covariance and mean have approached
the standard normal in the latent space. We give explicit loss guarantees for
S
for a single coupling
block in Theorems 1 and 2 and imply a linear convergence rate for a deep network in Theorem 3.
Deep Normalizing Flows are typically trained end-to-end, i.e. the entire stack of blocks is trained
jointly. In this work, our ansatz is to consider the effect of a single coupling block on the non-
Standardness
S
. Then, we combine the effect of many isolated blocks, disregarding potential further
improvements to
S
due to joint, cooperative learning of all blocks. This simplifies the theoretical
analysis of the network, but it is not a restriction on the model: Any function that is achieved in
block-wise training could also be the solution of end-to-end training.
We aim to strongly reduce
S
while leaving room for a complementary theory explaining how non-
Gaussianity
G
is reduced in practice. Note that affine-linear functions
Ax +b
can never change
G
,
because they jointly transform the distribution
p(x)
at hand and correspondingly the closest Gaussian
to it (see Lemma 1 in Appendix B.2). Thus, if we restrict our coupling layers to be affine-linear
functions, we are able to reduce
S
without increasing
G
in turn. This motivates considering affine-
linear couplings of the following form, spelled out together with ActNorm as given by Equation (5).
The results in this work apply to all coupling architectures
, as they all can represent this coupling,
see Appendix C. p1
a1= (fact fcpl)(Qx) = rI0
T Ip0
a0+u. (9)
For future work considering
G
, we propose to lift the restriction to affine-linear layers while making
sure that
S
behaves as described in what follows. As the convergence of
G
however will strongly
depend on the coupling architecture and data p(x)at hand, this is beyond the scope of this work.
Our first result shows which mean
m1
and covariance
Σ1
a single affine-linear coupling as in
Equation (9) yields to minimize
S(m1,Σ1)
given data with mean
m
and covariance
Σ
, rotated by
Q
:
Proposition 2
(Proof in Appendix B.2)
.
Given
D
-dimensional data with mean
m
and covariance
Σ
and a rotation matrix
Q
. Split the covariance of the rotated data into four blocks, corresponding to
the passive and active dimensions of the coupling layer:
QΣQT= Σ0=Σ0,pp Σ0,pa
Σ0,ap Σ0,aa(10)
Then, the moments m1,Σ1that can be reached by a coupling as in Equation (9) are:
m1= 0,Σ1=M0,pp) 0
0M0,aa Σ0,apΣ1
0,ppΣ0,pa).(11)
This minimizes Sas given in Equation (8), and Gdoes not increase.
The function
M
takes a matrix
A
and rescales the diagonal to
1
as follows. It is a well-known
operation in numerics called Diagonal scaling or Jacobi preconditioning so that M(A)ii = 1:
M(A)ij =pAiiAjj
1Aij = (Diag(A)1/2ADiag(A)1/2)ij .(12)
Proposition 2 shows how the covariance can be brought closer to the identity.
The new covariance has passive and active dimensions uncorrelated. In the active subspace, the
covariance is the Schur complement
Σ0,aa Σ0,apΣ1
0,ppΣ0,pa
. This coincides with the covariance of
the Gaussian
N(0,Σ)
as it is conditioned on any passive value
p
. Afterwards, the diagonal is rescaled
to one, matching the standard deviations of all dimensions with the desired latent code. The proof is
based on a more general result how a single layer maximally reduces the Maximum Likelihood Loss
for arbitrary data [14], which we apply to the non-Standardness S(see Appendix B.2).
Figure 2 shows an experiment in which a single affine-linear layer was trained to bring the covariance
of EMNIST digits [
35
] as close to
I
as possible (Details in Appendix A.2). The experimental result
coincides with the prediction by Proposition 2. Due to the finite batch-size, a small difference between
theory and experiment remains.
5 Explicit convergence rate
In Section 4, we showed how a single coupling layer acts on the first two moments of a given data
distribution to whiten it. We now explicitly demonstrate how much progress this means in terms
5
摘要:

WhiteningConvergenceRateofCoupling-basedNormalizingFlowsFelixDraxlerHeidelbergUniversityfelix.draxler@iwr.uni-heidelberg.deChristophSchnörrHeidelbergUniversityschnoerr@math.uni-heidelberg.deUllrichKötheHeidelbergUniversityullrich.koethe@iwr.uni-heidelberg.deAbstractCoupling-basednormalizingows(e.g....

展开>> 收起<<
Whitening Convergence Rate of Coupling-based Normalizing Flows Felix Draxler.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:1.14MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注