
the distribution can “propagate” into bad estimates for the likelihood once the score vector field is “integrated” —
making this formal seems challenging.
We show that the right mathematical tools to formalize, and substantially generalize such intuitions are functional
analytic tools that characterize isoperimetric properties of the distribution in question. Namely, we show three quanti-
ties, the Poincar´
e, log-Sobolev and isoperimetric constants (which are all in turn very closely related, see Section 2),
tightly characterize how much worse the efficiency of score matching is compared to maximum likelihood. These
quantities can be (equivalently) viewed as: (1) characterizing the mixing time of Langevin dynamics — a stochastic
differential equation used to sample from a distribution p(x)∝exp(f(x)), given access to a gradient oracle for f;
(2) characterizing “sparse cuts” in the distribution: that is sets S, for which the surface area of the set Scan be much
smaller than the volume of S. Notably, multimodal distributions, with well-separated, deep modes have very big log-
Sobolev/Poincar´
e/isoperimetric constants [Gayrard et al., 2004, 2005], as do distributions supported over manifold
with negative curvature [Hsu, 2002] (like hyperbolic manifolds). Since it is commonly thought that complex, high di-
mensional distribution deep generative models are trained to learn do in fact exhibit multimodal and low-dimensional
manifold structure, our paper can be interpreted as showing that in many of these settings, score matching may be
substantially less statistically efficient than maximum likelihood. Thus, our results can be thought of as a formal jus-
tification of the conjectured challenges for score matching in Song and Ermon [2019], as well as a vast generalization
of the set of “problem cases” for score matching. This also shows that surprisingly, the same obstructions for efficient
inference (i.e. drawing samples from a trained model, which is usual done using Langevin dynamics for EBMs) are
also an obstacle for efficient learning using score matching.
We roughly show the following results:
1. For finite number of samples n, we show that if we are trying to estimate a distribution from a class with Rademacher
complexity bounded by Rn, as well as a log-Sobolev constant bounded by CLS , achieving score matching loss at
most implies that we have learned a distribution that’s no more than CLS Rnaway from the data distribution in
KL divergence. The main tool for this is showing that the score matching objective is at most a multiplicative factor
of CLS away from the KL divergence to the data distribution.
2. In the asymptotic limit (i.e. as the number of samples n→ ∞), we focus on the special case of estimating the
parameters θof a probability distribution of an exponential family {pθ(x)∝exp(hθ, F (x)i)for some sufficient
statistics Fusing score matching. If the distribution pθwe are estimating has Poincar´
e constant bounded by CP
have asymptotic efficiency that differs by at most a factor of CP. Conversely, we show that if the family of sufficient
statistics is sufficiently rich, and the distribution pθwe are estimating has isoperimetric constant lower bounded by
CIS , then the score matching loss is less efficient than the MLE estimator by at least a factor of CIS .
3. Based on our new conceptual framework, we identify a precise analogy between score matching in the continuous
setting and pseudolikelihood methods in the discrete (and continuous) setting. This connection is made by replacing
the Langevin dynamics with its natural analogue — the Glauber dynamics (Gibbs sampler). We show that the
approximation tensorization of entropy inequality [Marton, 2013, Caputo et al., 2015], which guarantees rapid
mixing of the Glauber dynamics, allows us to obtain finite-sample bounds for learning distributions in KL via
pseudolikelihood in an identical way to the log-Sobolev inequality for score matching. A variant of this connection
is also made for the related ratio matching estimator of Hyv¨
arinen [2007b].
4. In Section 7, we perform several simulations which illustrate the close connection between isoperimetry and the
performance of score matching. We give examples both when fitting the parameters of an exponential family and
when the score function is fit using a neural network.
2 Preliminaries
Definition 1 (Score matching).Given a ground truth distribution pwith sufficient decay at infinity and a smooth
distribution q, the score matching loss (at the population level) is defined to be
Jp(q) := 1
2EX∼p[k∇log p(X)− ∇log q(X)k2] + Kp=EX∼pTr ∇2log q+1
2k∇log qk2(1)
2