Statistical Efﬁciency of Score Matching The View from Isoperimetry

2025-04-15 0 0 827.96KB 31 页 10玖币

侵权投诉

Statistical Efﬁciency of Score Matching:

The View from Isoperimetry

Frederic Koehler*Alexander Heckett†Andrej Risteski‡

December 26, 2022

Abstract

Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difﬁcult to

train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly

or efﬁciently written down. Score matching is a training method, whereby instead of ﬁtting the likelihood log p(x)

for the training data, we instead ﬁt the score function ∇xlog p(x)— obviating the need to evaluate the partition

function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efﬁciency is

comparable to that of maximum likelihood — which is known to be (asymptotically) optimal. We initiate this line

of inquiry in this paper, and show a tight connection between statistical efﬁciency of score matching and the isoperi-

metric properties of the distribution being estimated — i.e. the Poincar´

e, log-Sobolev and isoperimetric constant —

quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the

score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small

isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant — even for simple families

of distributions like exponential families with rich enough sufﬁcient statistics — score matching will be substantially

less efﬁcient than maximum likelihood. We suitably formalize these results both in the ﬁnite sample regime, and in

the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical

properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.

1 Introduction

Energy-based models (EBMs) are deep generative models parametrized up to a constant of parametrization, namely

p(x)∝exp(f(x)). The primary training challenge is the fact that evaluating the likelihood (and gradients thereof)

requires evaluating the partition function of the model, which is generally computationally intractable — even when

using relatively sophisticated MCMC techniques. Recent works, including the seminal paper of Song and Ermon

[2019], circumvent this difﬁculty by instead ﬁtting the score function of the model, that is ∇xlog p(x). Though not

obvious how to evaluate this loss from training samples only, Hyv¨

arinen [2005] showed this can be done via integration

by parts, and the estimator is consistent (that is, converges to the correct value in the limit of inﬁnite samples).

The maximum likelihood estimator is the de-facto choice for model-ﬁtting for its well-known property of being

statistically optimal in the limit where the number of samples goes to inﬁnity [Van der Vaart, 2000]. It is unclear how

much worse score matching can be — thus, it’s unclear how much statistical efﬁciency we sacriﬁce for the algorithmic

convenience of avoiding partition functions. In the seminal paper [Song and Ermon, 2019], it was conjectured that

multimodality, as well as a low-dimensional manifold structure may cause difﬁculties for score matching — which was

the reason the authors proposed annealing by convolving the input samples with a sequence of Gaussians with different

variance. Though the intuition for this is natural: having poor estimates for the score in “low probability” regions of

*fkoehler@stanford.edu, Stanford University. Supported in part by NSF award CCF-1704417, NSF award IIS-1908774, and N. Anari’s

Sloan Research Fellowship.

†aheckett@andrew.cmu.edu, Carnegie Mellon University.

‡aristesk@andrew.cmu.edu, Carnegie Mellon University. Supported in part by NSF award IIS-2211907 and an Amazon Research Award

on “Causal + Deep Out-of-Distribution Learning”.

arXiv:2210.00726v2 [cs.LG] 22 Dec 2022

the distribution can “propagate” into bad estimates for the likelihood once the score vector ﬁeld is “integrated” —

making this formal seems challenging.

We show that the right mathematical tools to formalize, and substantially generalize such intuitions are functional

analytic tools that characterize isoperimetric properties of the distribution in question. Namely, we show three quanti-

ties, the Poincar´

e, log-Sobolev and isoperimetric constants (which are all in turn very closely related, see Section 2),

tightly characterize how much worse the efﬁciency of score matching is compared to maximum likelihood. These

quantities can be (equivalently) viewed as: (1) characterizing the mixing time of Langevin dynamics — a stochastic

differential equation used to sample from a distribution p(x)∝exp(f(x)), given access to a gradient oracle for f;

(2) characterizing “sparse cuts” in the distribution: that is sets S, for which the surface area of the set Scan be much

smaller than the volume of S. Notably, multimodal distributions, with well-separated, deep modes have very big log-

Sobolev/Poincar´

e/isoperimetric constants [Gayrard et al., 2004, 2005], as do distributions supported over manifold

with negative curvature [Hsu, 2002] (like hyperbolic manifolds). Since it is commonly thought that complex, high di-

mensional distribution deep generative models are trained to learn do in fact exhibit multimodal and low-dimensional

manifold structure, our paper can be interpreted as showing that in many of these settings, score matching may be

substantially less statistically efﬁcient than maximum likelihood. Thus, our results can be thought of as a formal jus-

tiﬁcation of the conjectured challenges for score matching in Song and Ermon [2019], as well as a vast generalization

of the set of “problem cases” for score matching. This also shows that surprisingly, the same obstructions for efﬁcient

inference (i.e. drawing samples from a trained model, which is usual done using Langevin dynamics for EBMs) are

also an obstacle for efﬁcient learning using score matching.

We roughly show the following results:

1. For ﬁnite number of samples n, we show that if we are trying to estimate a distribution from a class with Rademacher

complexity bounded by Rn, as well as a log-Sobolev constant bounded by CLS , achieving score matching loss at

most implies that we have learned a distribution that’s no more than CLS Rnaway from the data distribution in

KL divergence. The main tool for this is showing that the score matching objective is at most a multiplicative factor

of CLS away from the KL divergence to the data distribution.

2. In the asymptotic limit (i.e. as the number of samples n→ ∞), we focus on the special case of estimating the

parameters θof a probability distribution of an exponential family {pθ(x)∝exp(hθ, F (x)i)for some sufﬁcient

statistics Fusing score matching. If the distribution pθwe are estimating has Poincar´

e constant bounded by CP

have asymptotic efﬁciency that differs by at most a factor of CP. Conversely, we show that if the family of sufﬁcient

statistics is sufﬁciently rich, and the distribution pθwe are estimating has isoperimetric constant lower bounded by

CIS , then the score matching loss is less efﬁcient than the MLE estimator by at least a factor of CIS .

3. Based on our new conceptual framework, we identify a precise analogy between score matching in the continuous

setting and pseudolikelihood methods in the discrete (and continuous) setting. This connection is made by replacing

the Langevin dynamics with its natural analogue — the Glauber dynamics (Gibbs sampler). We show that the

approximation tensorization of entropy inequality [Marton, 2013, Caputo et al., 2015], which guarantees rapid

mixing of the Glauber dynamics, allows us to obtain ﬁnite-sample bounds for learning distributions in KL via

pseudolikelihood in an identical way to the log-Sobolev inequality for score matching. A variant of this connection

is also made for the related ratio matching estimator of Hyv¨

arinen [2007b].

4. In Section 7, we perform several simulations which illustrate the close connection between isoperimetry and the

performance of score matching. We give examples both when ﬁtting the parameters of an exponential family and

when the score function is ﬁt using a neural network.

2 Preliminaries

Deﬁnition 1 (Score matching).Given a ground truth distribution pwith sufﬁcient decay at inﬁnity and a smooth

distribution q, the score matching loss (at the population level) is deﬁned to be

Jp(q) := 1

2EX∼p[k∇log p(X)− ∇log q(X)k2] + Kp=EX∼pTr ∇2log q+1

2k∇log qk2(1)

where Kpis a constant independent of q. The last equality is due to Hyv¨

arinen [2005]. Given samples from p, the

training loss ˆ

Jp(q)is deﬁned by replacing the rightmost expectation with the average over data.

Functional and Isoperimetric Inequalities. Let q(x)be a smooth probability density over Rd. A key role in this

work is played by the log-Sobolev, Poincar´

e, and isoperimetric constants of q— closely related geometric quantities,

connected to the mixing of the Langevin dynamics, which have been deeply studied in probability theory and geometric

and functional analysis (see e.g. [Gross, 1975, Ledoux, 2000, Bakry et al., 2014]).

Deﬁnition 2. The log-Sobolev constant CLS (q)≥0is the smallest constant so that for any probability density p(x)

KL(p, q)≤CLS (q)I(p|q)(2)

where KL(p, q) = EX∼p[log(p(X)/q(X))] is the Kullback-Leibler divergence or relative entropy and the relative

Fisher information I(p|q)is deﬁned 1as I(p|q) := EqD∇log p

q,∇p

qE.

The log-Sobolev inequality is equivalent to exponential ergodicity of the Langevin dynamics for q, a canonical

Markov process which preserves and is used for sampling q, described by the Stochastic Differential Equation dXt=

−∇log q(Xt)dt+√2dBt. Precisely, if ptis the distribution of the continuous-time Langevin Dynamics2for qstarted

from X0∼p, then I(p|q) = −d

dt KL(pt, q)|t=0 and so by integrating

KL(pt, q)≤e−t/CLS KL(p, q).(3)

This holding for any pand tis an equivalent characterization of the log-Sobolev constant (Theorem 3.20 of Van Handel

[2014]). For a class of distributions P, we can also deﬁne the restricted log-Sobolev constant CLS (q, P)to be the

smallest constant such that (2) holds under the additional restriction that p∈ P — see e.g. Anari et al. [2021b]. For P

an inﬁnitesimal neighborhood of p, the restricted log-Sobolev constant of qbecomes half of the Poincar´

e constant or

inverse spectral gap CP(q):

Deﬁnition 3. The Poincar´

e constant CP(q)≥0is the smallest constant so that for any smooth function f,

Varq(f)≤CP(q)Eqk∇fk2.(4)

It is related to the log-Sobolev constant by CP≤2CLS (Lemma 3.28 of Van Handel [2014]).

Similarly, the Poincar´

e inequality implies exponential ergodicity for the χ2-divergence:

χ2(pt, q)≤e−2t/CPχ2(p, q).

This holding for every pand tis an equivalent characterization of the Poincar´

e constant (Theorem 2.18 of Van Handel

[2014]). We can equivalently view the Langevin dynamics in a functional-analytic way through its deﬁnition as a

Markov semigroup, which is equivalent to the SDE deﬁnition via the Fokker-Planck equation [Van Handel, 2014,

Bakry et al., 2014]. From this perspective, we can write pt=qHtp

qwhere Htis the Langevin semigroup for q, so

Ht=etL with generator

Lf =h∇log q, ∇fi+ ∆f.

In this case, the Poincar´

e constant has a direct interpretation in terms of the inverse spectral gap of L, i.e. the inverse

of the gap between its two largest eigenvalues.

Both Poincar´

e and log-Sobolev inequalities measure the isoperimetric properties of qfrom the perspective of

functions; they are closely related to the isoperimetric constant:

1There are several alternatives formulas for I(p|q), see Remark 3.26 of Van Handel [2014].

2See e.g. Vempala and Wibisono [2019] for more background and the connection to the discrete time dynamics.

Deﬁnition 4. The isoperimetric constant CIS (q)is the smallest constant, s.t. for every set S,

min ZS

q(x)dx, ZSC

q(x)dx≤CIS (q) lim inf

→0RSq(x)dx −RSq(x)dx

.(5)

where S={x:d(x, S)≤}and d(x, S)denotes the (Euclidean) distance of xfrom the set S. The isoperimetric

constant is related to the Poincar´

e constant by CP≤4C2

IS (Proposition 8.5.2 of Bakry et al. [2014]). Assuming Sis

chosen so RSq(x)dx < 1/2, the left hand side can be interpreted as the volume and the right hand side as the surface

area of Swith respect to q.

A strengthened isoperimetric inequality (Bobkov inequality) upper bounds the log-Sobolev constant, see Ledoux

[2000], Bobkov [1997].

Molliﬁers We recall the deﬁnition of one of the standard molliﬁers/bump functions, as used in e.g. H¨

ormander

[2015]. Molliﬁers are smooth functions useful for approximating non-smooth functions: convolving a function with a

molliﬁer makes it “smoother”, in the sense of the existence and size of the derivatives. Precisely, deﬁne the (inﬁnitely

differentiable) function ψ:Rd→Ras

ψ(y) = (1

Ide−1/(1−|y|2)for |y|<1

0for |y| ≥ 1

where Id:= Re−1/(1−|y|2)dy.

We will use the basic estimate 8−dBd< Id< Bdwhere Bdis the volume of the unit ball in Rd, which follows

from the fact that e−1/(1−|y|2)≥1/4for kyk ≤ 1/2and e−1/(1−|y|2)≤1everywhere. ψis inﬁnitely differentiable

and its gradient is

∇yψ(y) = −(2/Id)e−1/(1−kyk2)y

(1 − kyk2)2=−2y

(1 − kyk2)2ψ(y)

It is straightforward to check that supyk∇yψ(y)k<1/Id. For γ > 0, we’ll also deﬁne a “sharpening” of ψ, namely

ψγ(y) = γ−dψ(y/γ)so that Rψγ= 1 and (by chain rule)

∇yψγ(y) = γ−d−1(∇ψ)(y/γ) = −2y/γ2

(1 − ky/γk2)ψγ(y/γ)

so in particular k∇yψγk2≤γ−d−1/Id.

Glauber dynamics. The Glauber dynamics will become important in Section 5 as the natural analogue of the

Langevin dynamics. The Glauber dynamics or Gibbs sampler for a distribution pis the standard sampler for dis-

crete spin systems — it repeatedly selects a random coordinate and then resamples the spin Xithere according to the

distribution pconditional on all of the other ones (i.e. conditional on X∼i). See e.g. Levin and Peres [2017]. This is

the standard sampler for discrete systems, but it also applies and has been extensively studied for continuous ones (see

e.g. Marton [2013]).

Reach and Condition Number of a Manifold. For a smooth submanifold Mof Euclidean space, the reach τM

is the smallest radius rso that every point with distance at most rto the manifold Mhas a unique nearest point on

M[Federer, 1959]; the reach is guaranteed to be positive for compact manifolds. The reach has a few equivalent

characterizations (see e.g. Niyogi et al. [2008]); a common terminology is that the condition number of a manifold is

1/τM.

Notation. For a random vector X,ΣX:= E[XXT]−E[X]E[X]Tdenotes its covariance matrix.

3 Learning Distributions from Scores: Nonasymptotic Theory

Though consistency of the score matching estimator was proven in Hyv¨

arinen [2005], it is unclear what one can

conclude about the proximity of the learned distribution from a ﬁnite number of samples. Precisely, we would like a

guarantee that shows that if the training loss (i.e. empirical estimate of (1)) is small, the learned distribution is close

to the ground truth distribution (e.g. in the KL divergence sense). However, this is not always true! We will see an

illustrative example where this is not true in Section 7 and also establish a general negative result in Section 4.

In this section, we prove (Theorem 1) that minimizing the training loss does learn the true distribution, assuming

that the class of distributions we are learning have bounded complexity and small log-Sobolev constant. First, we

formalize the connection to the log-Sobolev constant:

Proposition 1. The log-Sobolev inequality for qis equivalent to the following inequality over all smooth probability

densities p:

KL(p, q)≤2CLS (q)(Jp(q)−Jp(p)).(6)

More generally, for a class of distribution p∈ P the restricted log-Sobolev constant is the smallest constant such that

KL(p, q)≤CLS (q, P)(Jp(q)−Jp(p)) for all distributions p.

Proof. This follows from the following equivalent form for the relative Fisher information (e.g. Shao et al. [2019],

Vempala and Wibisono [2019])

I(p|q) = Eqh∇p

q,∇log p

=Ephq

p∇p

q,∇log p

qi=Eph∇log p

q,∇log p

qi=Epk∇log p− ∇log qk2.(7)

Using this and (1) the log-Sobolev inequality can be rewritten as KL(p, q)≤CLS (Jp(q)−Jp(q)) which proves the

ﬁrst claim, and the same argument shows the second claim.

Remark 1 (Interpretation of Score Matching).The left hand side of (6) is KL(p, q) = Ep[log p]−Ep[log q]. The

ﬁrst term is independent of qand the second term is the likelihood, the objective for Maximum Likelihood Estimation.

So (6) shows that the score matching objective is a relaxation (within a multiplicative factor of CLS (q)) of maximum-

likelihood via the log-Sobolev inequality. We discuss connections to other proposed interpretations in Appendix A.

Remark 2. Interestingly, the log-Sobolev constant which appears in the bound is that of qand not pthe ground truth

distribution. This is useful because qis known to the learner whereas pis only indirectly observed. If qis actually

close to p, the log-Sobolev constants are comparable due to the Holley-Stroock perturbation principle (Proposition

5.1.6 of Bakry et al. [2014]).

The connection between the score matching loss and the relative Fisher information used in (7) is not new to this

work—see the Related Work section for more discussion and references. The useful statistical implications which

we discuss next are new to the best of our knowledge. Combining Proposition 1, bounds on log-Sobolev constants

from the literature, and fundamental tools from generalization theory allows us to derive ﬁnite-sample guarantees for

learning distributions in KL divergence via score matching. 3

Theorem 1. Suppose that Pis a class of probability distributions containing pand deﬁne

CLS (P,P) := sup

q∈P

CLS (q, P)≤sup

q∈P

CLS (q)

to be the worst-case (restricted) log-Sobolev constant in the class of distributions. (For example, if every distribution

in Pis α-strongly log concave then CLS ≤1/2αby Bakry-Emery theory [Bakry et al., 2014].) Let

Rn:= EX1,...,Xn,1,...,nsup

q∈P

i=1

iTr ∇2log q(Xi) + 1

2k∇log q(Xi)k2

3We use the simplest version of Rademacher complexity bounds to illustrate our techniques. Standard literature, e.g. Shalev-Shwartz and

Ben-David [2014], Bartlett et al. [2005] contains more sophisticated versions, and our techniques readily generalize.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StatisticalEfciencyofScoreMatching:TheViewfromIsoperimetryFredericKoehler*AlexanderHeckettAndrejRisteskiDecember26,2022AbstractDeepgenerativemodelsparametrizeduptoanormalizingconstant(e.g.energy-basedmodels)aredifculttotrainbymaximizingthelikelihoodofthedatabecausethelikelihoodand/orgradientsthe...

展开>> 收起<<

Statistical Efﬁciency of Score Matching The View from Isoperimetry.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Statistical Efﬁciency of Score Matching The View from Isoperimetry

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: