Whitening Convergence Rate of Coupling-based Normalizing Flows Felix Draxler

2025-05-06 1 0 1.14MB 30 页 10玖币

侵权投诉

Whitening Convergence Rate of

Coupling-based Normalizing Flows

Felix Draxler

Heidelberg University

felix.draxler@iwr.uni-heidelberg.de

Christoph Schnörr

Heidelberg University

schnoerr@math.uni-heidelberg.de

Ullrich Köthe

Heidelberg University

ullrich.koethe@iwr.uni-heidelberg.de

Abstract

Coupling-based normalizing ﬂows (e.g. RealNVP) are a popular family of nor-

malizing ﬂow architectures that work surprisingly well in practice. This calls for

theoretical understanding. Existing work shows that such ﬂows weakly converge

to arbitrary data distributions [

]. However, they make no statement about the

stricter convergence criterion used in practice, the maximum likelihood loss. For

the ﬁrst time, we make a quantitative statement about this kind of convergence:

We prove that all coupling-based normalizing ﬂows perform whitening of the

data distribution (i.e. diagonalize the covariance matrix) and derive corresponding

convergence bounds that show a linear convergence rate in the depth of the ﬂow.

Numerical experiments demonstrate the implications of our theory and point at

open questions.

1 Introduction

Normalizing ﬂows [

] are among the most promising approaches to generative machine learning

and have already demonstrated convincing performance in a wide variety of practical applications,

ranging from image analysis [

] to astrophysics [

], mechanical engineering [

], causality

[

], computational biology [

] and medicine [

]. As the name suggests, normalizing ﬂows

represent complex data distributions as bijective transformations (also known as ﬂows or push-

forwards) of standard normal or other well-understood distributions.

In this paper, we focus on a theoretical underpinning of coupling-based normalizing ﬂows, a par-

ticularly effective class of normalizing ﬂows in terms of invertible neural networks. All of the

above applications are actually implemented using coupling-based normalizing ﬂows. Their central

building blocks are coupling layers, which decompose the space into two subspaces called active and

passive subspace (see Section 3). Only the active dimensions are transformed conditioned on the

passive dimensions, which makes the mapping computationally easy to invert. In order to vary the

assignment of dimensions to the active and passive subspaces, coupling layers are combined with

preceding orthonormal transformation layers into coupling blocks. These blocks are arranged into

deep networks such that the orthonormal transformations are sampled uniformly at random from the

orthogonal matrices and the coupling layers are trained with the maximum likelihood objective, see

Equation (2). Upon convergence of the training, the sequence of coupling blocks gradually transforms

the probability density that generated the given training data, into a standard normal distribution and

vice versa.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.14032v1 [cs.LG] 25 Oct 2022

non-Gaussianity

non-Standardness

Normalizing Flow Loss

Figure 1: (Left) The Maximum Likelihood Loss

(blue) can be split into the non-Gaussianity

(orange) [

] and the non-Standardness

(green) of the latent code

z=fθ(x)

L=G+S

(Proposition 1). For the latter, we give explicit guarantees as one more coupling block is added in

Theorems 1 and 2 and show a global convergence rate in Theorem 3. (Right) Typical ﬁt of EMNIST

digits by a standard afﬁne coupling ﬂow for various depths. Our theory (Theorem 1) upper bounds

the average

for

L+ 1

coupling blocks given a trained model with

coupling blocks (dotted green).

We observe that our bound is predictive for how much end-to-end training reduces S.

Since the resulting normalizing ﬂows deviate signiﬁcantly from optimal transport ﬂows [

] and the

bulk of the mathematical literature is focusing on optimal transport, an analysis tailored to coupling

architectures is lacking. In a landmark paper, [

] proved that sufﬁciently large afﬁne coupling ﬂows

weakly converge to arbitrary data densities. The notion of weak convergence is critical here, as it does

not imply convergence in maximum likelihood [

, Remark 3]. Maximum likelihood (or, equivalently,

the Kullback-Leibler (KL) divergence) is the loss that is actually used in practice. It can be used for

gradient descent and it guarantees not only convergence in samples (“

x∼q(x)→x∼p(x)

”) but

also in density estimates (“

q(x)→p(x)

”). It is strong in the sense that the square root of the KL

divergence upper bounds (up to a factor 2) the total variation metric, and hence also the Wasserstein

metric if the underlying space is bounded [

]. Moreover, convergence under the KL divergence

implies weak convergence which is fundamental for robust statistics [17].

We take a ﬁrst step towards showing that coupling blocks also converge in terms of maximum likeli-

hood. To the best of our knowledge, our paper presents for the ﬁrst time a quantitative convergence

analysis of coupling-based normalizing ﬂows based on this strong notion of convergence.

Speciﬁcally, we make the following contributions towards this goal:

•

We utilize that the loss of a normalizing ﬂow can be decomposed into two parts (Figure 1):

The divergence to the nearest Gaussian (non-Gaussianity) plus the divergence of that

Gaussian to the standard normal (non-Standardness).

•

The contribution of a single coupling layer on the non-Standardness is analyzed in terms of

matrix operations (Schur complement and scaling).

•

Explicit bounds for the non-Standardness after a single coupling block in expectation over

all orthonormal transformations are derived.

•

We use these results to prove that a sequence of coupling blocks whitens the data covariance

and to derive linear convergence rates for this process.

Our results hold for all coupling architectures we are aware of (Appendix C), including: NICE [

RealNVP [

], and GLOW [

]; Flow++ [

]; nonlinear-squared ﬂow [

]; linear, quadratic [

cubic [

], and rational quadratic splines [

]; neural autoregressive ﬂows [

], and unconstrained

monotonic neural networks [

]. We conﬁrm our theoretical ﬁndings experimentally and identify

directions for further improvement.

2 Related work

Analyzing which distributions coupling-based normalizing ﬂows can approximate is an active area of

research. A general statement shows that a coupling-based normalizing ﬂow which can approximate

an arbitrary invertible function can learn any probability density weakly [

]. This applies to afﬁne

coupling ﬂows [

], Flow++ [

], neural autoregressive ﬂows [

], and SOS polynomial ﬂows

[

]. Afﬁne coupling ﬂows converge to arbitrary densities in Wasserstein distance [

]. Both

universality results, however, require that the couplings become ill-conditioned (i.e. the learnt

functions become increasingly discontinuous as the error decreases, whereas in practice one observes

that functions remain smooth). Also, they consider only a ﬁnite subspace of the data space. Even more

importantly, the convergence criterion employed in their proofs (weak convergence resp. convergence

under Wasserstein metric) is critical: Those criteria do not imply convergence in the loss that is

employed in practice [

, Remark 3], the Kullback-Leibler divergence (equivalent to maximum

likelihood). An arbitrarily small distance in any of the above metrics can even result in an inﬁnite

KL divergence. In contrast to previous work on afﬁne coupling ﬂows, we work directly on the KL

divergence. We decompose it in two contributions and show the ﬂow’s convergence for one of the

parts.

Regarding when ill-conditioned ﬂows need to arise to ﬁt a distribution, [

] showed that well-

conditioned afﬁne couplings can approximate log-concave padded distributions, again in terms of

Wasserstein distance. Lipschitz ﬂows on the other hand cannot model arbitrary tail behavior, but this

can be ﬁxed by adapting the latent distribution [29].

SOS polynomial ﬂows converge in total variation to arbitrary probability densities [30], which also

does not imply convergence in KL divergence; zero-padded afﬁne coupling ﬂows converge weakly

[23], and so do Neural ODEs [31, 32].

Closely related to our work, 48 linear afﬁne coupling blocks can represent any invertible linear

function

Ax +b

with

det(A)>0

[

, Theorem 2]. This also allows mapping any Gaussian

distribution

N(m, Σ)

to the standard normal

N(0, I)

. We put this statement into context in terms

of the KL divergence: The loss is exactly composed of the divergence to the nearest Gaussian and

of that Gaussian to the standard normal. We then make strong statements about the convergence

of the latter, concluding that for typical ﬂows a smaller number of layers is required for accurate

approximation than predicted by [15].

3 Coupling-based normalizing ﬂows

Normalizing ﬂows learn an invertible function

fθ(x)

that maps samples

from some unknown

distribution

p(x)

given by samples to latent variables

z=fθ(x)

so that

follow a simple distribution,

typically the standard normal. The function

fθ

then yields an estimate

q(x)

for the true data

distribution p(x)via the change of variables formula (e.g. [5]):

q(x) = N(fθ(x); 0, I)|det J|,(1)

where

J=∇fθ(x)

is the Jacobian of

fθ(x)

. We can train a normalizing ﬂow via the maximum

likelihood loss, which is equivalent to minimizing the Kullback-Leibler divergence between the

distribution of the latent code

q(z)

, as given by

z=fθ(x)

when

x∼p(x)

, and the standard normal:

L=DKL(q(z)kN (0, I)) = Ex∼p(x)h1

2

fθ(x)

2−log |det J|i+ const .(2)

The invertible architecture that makes up

fθ

has to (i) be computationally easy to invert, (ii) be able

to represent complex transformations, and (iii) have a tractable Jacobian determinant

|det J|

[

Building such an architecture is an active area of research, see e.g. [

] for a review. In this work, we

focus on the family of coupling-based normalizing ﬂows, ﬁrst presented in the form of the NICE

architecture [

]. It is a deep architecture that consists of several blocks, each containing a rotation, a

coupling and an ActNorm layer [6]:

fblock(x)=(fact ◦fcpl ◦frot)(x).(3)

The coupling

fcpl

splits an incoming vector

in two parts along the coordinate axis: The ﬁrst part

, which we call passive, is left unchanged. The second part

, which we call active, is modiﬁed as

a function of the passive dimensions:

fcpl(x0) = fcplp0

a0=p0

c(a0;p0)=:p1

a1.(4)

Here, the coupling function

c:RD/2×RD/2→RD/2

has to be a function that is easy to invert when

is given, i.e. it is easy to compute

a0=c−1(a1;p0)

given

. This makes the coupling easy to

invert: Call

x1= (p1;a1)

the output of the layer, then

p0=p1

. Use this to invert

a1=c(a0;p0)

. For

example, RealNVP [

] proposes a simple afﬁne transformation for

a1=c(a0;p0) = a0s(p0) +

t(p0)

(



means element-wise multiplication).

s(p0)∈RD/2

and

t(p0)∈RD/2

are computed by a

feed-forward neural network. The coupling functions

of other architectures our theory applies to

are listed in Appendix C.

An Activation Normalization (ActNorm) layer [

] helps stabilize training and is implemented in

practice like in the popular INN framework FrEIA [33]. It rescales and shifts each dimension:

fact(x) = rx+u, (5)

given parameters r∈RD

+and u∈RD. We include it as it simpliﬁes our mathematical arguments.

If we were to concatenate several coupling layers, the entire network would never change the passive

dimensions apart from the element-wise afﬁne transformation in the ActNorm layer. Here, the

rotation layers

frot(x) = Qx

come into play [

]. They multiply an orthogonal matrix

to the data,

changing which subspaces are passive respectively active. This matrix is typically ﬁxed at random at

initialization and then left unchanged during training.

4 Coupling layers as whitening transformation

The central mathematical question we answer in this work is the following: How can a deep coupling-

based normalizing ﬂow whiten the data? As the latent distribution is a standard normal, whitening is

a necessary condition for the ﬂow to converge. This is a direct property of the loss:

Proposition 1

(Pythagorean Identity, Proof in Appendix B.1)

Given data with distribution

p(x)

with

mean

and covariance

. Then, the Kullback-Leibler divergence to a standard normal distribution

decomposes as follows:

DKL(p(x)kN (0, I)) = DKL(p(x)kN (m, Σ))

| {z }

non-Gaussianity G(p)

+DKL(N(m, Σ)kN (0, I))

| {z }

non-Standardness S(p)

(6)

and the non-Standardness again decomposes:

S(p) = DKL(N(m, Σ)kN (m, Diag(Σ)))

|{z }

Correlation C(p)

+DKL(N(m, Diag(Σ))kN (0, I))

| {z }

Diagonal non-Standardness

.(7)

This splits the transport from the data distribution to the latent standard normal into three parts: (i)

From the data to the nearest Gaussian distribution

N(m, Σ)

, measured by

. (ii) From that nearest

Gaussian to the corresponding uncorrelated Gaussian

N(m, Diag(Σ))

, measured by

. (iii) From

the uncorrelated Gaussian to standard normal.

We do not make explicit use of the fact that the non-Standardness can again be decomposed,

but we show it nevertheless to relate our result to the literature: The Pythagorean identity

DKL(p(x)kN (m, Diag(Σ))) = G(p) + C(p)

has been shown before by [

, Section 2.3]. Both

their and our result are speciﬁc applications of the general [

, Theorem 3.8] from information

geometry. Our proof is given in Appendix B.1.

Proposition 1 is visualized in Figure 1. In an experiment, we ﬁt a set of Glow [

] coupling ﬂows

of increasing depths to the EMNIST digit dataset [

] using maximum likelihood loss and measure

the capability of each ﬂow in decreasing

and

(Details in Appendix A.1). The form of the non-

Standardness

is given by the well-known KL divergence between the involved normal distributions,

see Equation (30) in Appendix B.1. It is invariant under rotations

and only depends on the ﬁrst two

moments m, Σ:

S(m, Σ) :=S(p) = 1

2(kmk2+ tr Σ −D−log det Σ)) = S(Qm, QΣQT).(8)

The non-Standardness

will be our measure on how far the covariance and mean have approached

the standard normal in the latent space. We give explicit loss guarantees for

for a single coupling

block in Theorems 1 and 2 and imply a linear convergence rate for a deep network in Theorem 3.

Deep Normalizing Flows are typically trained end-to-end, i.e. the entire stack of blocks is trained

jointly. In this work, our ansatz is to consider the effect of a single coupling block on the non-

Standardness

. Then, we combine the effect of many isolated blocks, disregarding potential further

improvements to

due to joint, cooperative learning of all blocks. This simpliﬁes the theoretical

analysis of the network, but it is not a restriction on the model: Any function that is achieved in

block-wise training could also be the solution of end-to-end training.

We aim to strongly reduce

while leaving room for a complementary theory explaining how non-

Gaussianity

is reduced in practice. Note that afﬁne-linear functions

Ax +b

can never change

because they jointly transform the distribution

p(x)

at hand and correspondingly the closest Gaussian

to it (see Lemma 1 in Appendix B.2). Thus, if we restrict our coupling layers to be afﬁne-linear

functions, we are able to reduce

without increasing

in turn. This motivates considering afﬁne-

linear couplings of the following form, spelled out together with ActNorm as given by Equation (5).

The results in this work apply to all coupling architectures

, as they all can represent this coupling,

see Appendix C. p1

a1= (fact ◦fcpl)(Qx) = rI0

T Ip0

a0+u. (9)

For future work considering

, we propose to lift the restriction to afﬁne-linear layers while making

sure that

behaves as described in what follows. As the convergence of

however will strongly

depend on the coupling architecture and data p(x)at hand, this is beyond the scope of this work.

Our ﬁrst result shows which mean

and covariance

Σ1

a single afﬁne-linear coupling as in

Equation (9) yields to minimize

S(m1,Σ1)

given data with mean

and covariance

, rotated by

Proposition 2

(Proof in Appendix B.2)

Given

-dimensional data with mean

and covariance

and a rotation matrix

. Split the covariance of the rotated data into four blocks, corresponding to

the passive and active dimensions of the coupling layer:

QΣQT= Σ0=Σ0,pp Σ0,pa

Σ0,ap Σ0,aa(10)

Then, the moments m1,Σ1that can be reached by a coupling as in Equation (9) are:

m1= 0,Σ1=M(Σ0,pp) 0

0M(Σ0,aa −Σ0,apΣ−1

0,ppΣ0,pa).(11)

This minimizes Sas given in Equation (8), and Gdoes not increase.

The function

takes a matrix

and rescales the diagonal to

as follows. It is a well-known

operation in numerics called Diagonal scaling or Jacobi preconditioning so that M(A)ii = 1:

M(A)ij =pAiiAjj

−1Aij = (Diag(A)−1/2ADiag(A)−1/2)ij .(12)

Proposition 2 shows how the covariance can be brought closer to the identity.

The new covariance has passive and active dimensions uncorrelated. In the active subspace, the

covariance is the Schur complement

Σ0,aa −Σ0,apΣ−1

0,ppΣ0,pa

. This coincides with the covariance of

the Gaussian

N(0,Σ)

as it is conditioned on any passive value

. Afterwards, the diagonal is rescaled

to one, matching the standard deviations of all dimensions with the desired latent code. The proof is

based on a more general result how a single layer maximally reduces the Maximum Likelihood Loss

for arbitrary data [14], which we apply to the non-Standardness S(see Appendix B.2).

Figure 2 shows an experiment in which a single afﬁne-linear layer was trained to bring the covariance

of EMNIST digits [

] as close to

as possible (Details in Appendix A.2). The experimental result

coincides with the prediction by Proposition 2. Due to the ﬁnite batch-size, a small difference between

theory and experiment remains.

5 Explicit convergence rate

In Section 4, we showed how a single coupling layer acts on the ﬁrst two moments of a given data

distribution to whiten it. We now explicitly demonstrate how much progress this means in terms

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhiteningConvergenceRateofCoupling-basedNormalizingFlowsFelixDraxlerHeidelbergUniversityfelix.draxler@iwr.uni-heidelberg.deChristophSchnörrHeidelbergUniversityschnoerr@math.uni-heidelberg.deUllrichKötheHeidelbergUniversityullrich.koethe@iwr.uni-heidelberg.deAbstractCoupling-basednormalizingows(e.g....

展开>> 收起<<

Whitening Convergence Rate of Coupling-based Normalizing Flows Felix Draxler.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Whitening Convergence Rate of Coupling-based Normalizing Flows Felix Draxler

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: