In particular, we address the impact of the potential function used in SMD on the parameters of the average ensemble
model. Introduced by Nemirovskii and Yudin [1983] for convex optimization problems with underlying geometric
structure, SMD is as a generalization of stochastic gradient descent (SGD) where the updates are mirrored to a dual
domain using a potential function. It has been shown that choice of the potential function plays a pivotal role in
determining the implicit bias of the SMD in the interpolating regime Azizan et al. [2019], Gunasekar et al. [2020a].
1.1 Contributions and Prior Work
In this paper, we develop a mean-field type result for average ensemble models trained with SMD. We consider a
general class of convex loss functions that includes the squared loss as a special case. In particular, we derive a partial
differential equation that describes the "time evolution" of the distribution of the parameters in the network as training
progresses. This distribution can be used to compute the risk, or generalization error, of the trained model. Our result
generalizes similar mean-field results for SGD, with the difference being that the Hessian of the potential function
appears. We further give a geometric interpretation at the "distributional" level by showing an equivalent formulation of
our result as a geometric flow on an appropriate Riemannian manifold. We finally provide numerical results to illustrate
the applicability of the method and to showcase the effect of the potential function on the performance and implicit bias
of the resulting trained network.
A mean-field characterization of two-layer neural networks trained with SGD under quadratic loss was studied by Mei
et al. [2018], Sirignano and Spiliopoulos [2020], and Rotskoff and Vanden-Eijnden [2019]. A similar line of work is
given by Chizat and Bach [2018] where more general loss functions are considered for gradient descent algorithm. Mean
field results with similar flavor to ours have been given in Raginsky and Bouvrie [2012] and Borovykh et al. [2020]
in the context of distributed convex optimization. These papers consider the problem of minimizing a fixed objective
function with noisy information by many "interacting optimizers" coupled through an interaction matrix. Despite being
insightful, the setting presented in these papers does not fit into the case of learning with average ensemble models. An
important connection between mirror descent and natural gradient descent on dual (Riemannian) space was established
by Raskutti and Mukherjee [2014]. Works by Gunasekar et al. [2020a] and Gunasekar et al. [2020b] advanced this
geometric viewpoint by showing a correspondence between continuous-time mirror descent and Riemannian gradient
flow on the primal domain governed by the Hessian matrix of the potential function.
2 Preliminaries
2.1 Notations
We gather here the basic notations used. Vectors are presented with bold lower-case letters, and bold upper letters are
reserved for matrices. For a vector
v
,
vi
denotes its
ith
entry, and
kvkp
is its
`p
norm.
Sd
++
is the set of symmetric and
positive-definite matrices with dimension d.
For a set
S
in
Rd
and
k∈N
,
Ck
b(S)
denotes the space of bounded and
k
-times differentiable functions with continuous
kth
derivative,
P(S)
denotes the space of Borel probability measures on
S
and
Pm(S)
denotes probability measures
with bounded mth moment.
For a random vector
Z
in
Rp
,
µ(dz)
indicates its probability measure.
δx
is a Dirac mass at point
x
. Integral of a
function f:S → Rwith respect to a measure ρon Sis denoted as,
hρ, fi:= Zf(x)ρ(dx).(1)
Let
{ρn}n∈N
be a sequence of probability measures defined on
S ⊆ Rd
. The sequence is said to be weakly converging
to the measure ρ?if and only if, lim
n→∞|hρn, fi−hρ?, fi| = 0, for all f∈ Cb(S).
2.2 Problem Setup
Consider an online learning setting where we sequentially observe a fresh sample of data, i.e.,
(xk, yk)
for
k= 1,2, . . .
.
Here
xk∈ X ⊆ Rp
is the feature vector, and
yk∈ Y ⊆ R
is the label. As a shorthand, we will interchangeably use
z:= (x, y)
. Our goal is to learn a model that generates an estimate of the label given the feature vector as its input. A
common approach for modeling is known as the average ensemble, where the output is generated by taking the average
of simpler models that only differ in their parameters. This approach is advantageous from the bias-variance perspective.
Since simpler models have a low variance (yet potentially a high bias), the average ensemble generates a low-variance
estimate of the labels with a reasonably good bias.
2