STOCHASTIC MIRROR DESCENT IN AVERAGE ENSEMBLE MODELS Taylan Kargin

2025-05-03 0 0 1.81MB 12 页 10玖币
侵权投诉
STOCHASTIC MIRROR DESCENT IN AVERAGE ENSEMBLE
MODELS
Taylan Kargin
Department of Electrical Engineering
California Institute of Technology
Pasadena, CA 91125
Fariborz Salehi
Google
Seattle, WA 98103
Babak Hassibi
Department of Electrical Engineering
California Institute of Technology
Pasadena, CA 91125
ABSTRACT
The stochastic mirror descent (SMD) algorithm is a general class of training algorithms, which
includes the celebrated stochastic gradient descent (SGD), as a special case. It utilizes a mirror
potential to influence the implicit bias of the training algorithm. In this paper we explore the
performance of the SMD iterates on mean-field ensemble models. Our results generalize earlier ones
obtained for SGD on such models. The evolution of the distribution of parameters is mapped to a
continuous time process in the space of probability distributions. Our main result gives a nonlinear
partial differential equation to which the continuous time process converges in the asymptotic regime
of large networks. The impact of the mirror potential appears through a multiplicative term that is
equal to the inverse of its Hessian and which can be interpreted as defining a gradient flow over an
appropriately defined Riemannian manifold. We provide numerical simulations which allow us to
study and characterize the effect of the mirror potential on the performance of networks trained with
SMD for some binary classification problems.
Keywords Stochastic Mirror Descent ·Ensemble Models ·Mean-field limit
1 Introduction
Machine learning models have been successfully used in a wide variety of applications ranging from spam detection, to
image classification, to natural language processing. Despite this great success, our theoretical understanding on why
various machine learning methods demonstrate the performances they do is still at an early stage.
To obtain some understanding of the behavior of machine learning algorithms we shall focus on the structure/architecture
of the model, as well as the training mechanism that generates a suitable model with respect to the available
data. With regards to the architecture, a common theme among most modern machine learning models (e.g. ar-
tificial/convolutional/recurrent neural networks) is that they repetitively use simple blocks such as perceptrons in ANNs,
and convolution filters in CNNs. The idea of combining multiple simple models has its roots in classical statistics and
is known as boosting. The justification is that a simple model often possesses a low variance while its bias can be
potentially high; therefore, combining multiple such models can reduce the bias while keeping the variance low.
The optimization algorithm maps the training data to model parameters. Its role is especially critical in modern setups
where the number of parameters can be overwhelmingly larger than the number of inputs. In such overparameterized
settings, there are often many models that perfectly fit the training data. Thus, the specific choice of the optimization
algorithm determines the resulting model in the interpolating regime. In fact, understanding the connection between the
optimization algorithm and the generalization performance of the resulting model is an important open problem.
In this paper we attempt to study these aspects by analyzing the convergence behavior of the stochastic mirror descent
(SMD) iterates on a general class of models, known as average ensemble models where the output is generated by
taking an average of similar models that only differ in their choice of parameters. Two-layer neural networks, e.g., can
be viewed as a special case of the average ensemble.
arXiv:2210.15323v1 [cs.LG] 27 Oct 2022
In particular, we address the impact of the potential function used in SMD on the parameters of the average ensemble
model. Introduced by Nemirovskii and Yudin [1983] for convex optimization problems with underlying geometric
structure, SMD is as a generalization of stochastic gradient descent (SGD) where the updates are mirrored to a dual
domain using a potential function. It has been shown that choice of the potential function plays a pivotal role in
determining the implicit bias of the SMD in the interpolating regime Azizan et al. [2019], Gunasekar et al. [2020a].
1.1 Contributions and Prior Work
In this paper, we develop a mean-field type result for average ensemble models trained with SMD. We consider a
general class of convex loss functions that includes the squared loss as a special case. In particular, we derive a partial
differential equation that describes the "time evolution" of the distribution of the parameters in the network as training
progresses. This distribution can be used to compute the risk, or generalization error, of the trained model. Our result
generalizes similar mean-field results for SGD, with the difference being that the Hessian of the potential function
appears. We further give a geometric interpretation at the "distributional" level by showing an equivalent formulation of
our result as a geometric flow on an appropriate Riemannian manifold. We finally provide numerical results to illustrate
the applicability of the method and to showcase the effect of the potential function on the performance and implicit bias
of the resulting trained network.
A mean-field characterization of two-layer neural networks trained with SGD under quadratic loss was studied by Mei
et al. [2018], Sirignano and Spiliopoulos [2020], and Rotskoff and Vanden-Eijnden [2019]. A similar line of work is
given by Chizat and Bach [2018] where more general loss functions are considered for gradient descent algorithm. Mean
field results with similar flavor to ours have been given in Raginsky and Bouvrie [2012] and Borovykh et al. [2020]
in the context of distributed convex optimization. These papers consider the problem of minimizing a fixed objective
function with noisy information by many "interacting optimizers" coupled through an interaction matrix. Despite being
insightful, the setting presented in these papers does not fit into the case of learning with average ensemble models. An
important connection between mirror descent and natural gradient descent on dual (Riemannian) space was established
by Raskutti and Mukherjee [2014]. Works by Gunasekar et al. [2020a] and Gunasekar et al. [2020b] advanced this
geometric viewpoint by showing a correspondence between continuous-time mirror descent and Riemannian gradient
flow on the primal domain governed by the Hessian matrix of the potential function.
2 Preliminaries
2.1 Notations
We gather here the basic notations used. Vectors are presented with bold lower-case letters, and bold upper letters are
reserved for matrices. For a vector
v
,
vi
denotes its
ith
entry, and
kvkp
is its
`p
norm.
Sd
++
is the set of symmetric and
positive-definite matrices with dimension d.
For a set
S
in
Rd
and
kN
,
Ck
b(S)
denotes the space of bounded and
k
-times differentiable functions with continuous
kth
derivative,
P(S)
denotes the space of Borel probability measures on
S
and
Pm(S)
denotes probability measures
with bounded mth moment.
For a random vector
Z
in
Rp
,
µ(dz)
indicates its probability measure.
δx
is a Dirac mass at point
x
. Integral of a
function f:S Rwith respect to a measure ρon Sis denoted as,
hρ, fi:= Zf(x)ρ(dx).(1)
Let
{ρn}nN
be a sequence of probability measures defined on
S Rd
. The sequence is said to be weakly converging
to the measure ρ?if and only if, lim
n→∞|hρn, fi−hρ?, fi| = 0, for all f∈ Cb(S).
2.2 Problem Setup
Consider an online learning setting where we sequentially observe a fresh sample of data, i.e.,
(xk, yk)
for
k= 1,2, . . .
.
Here
xk∈ X Rp
is the feature vector, and
yk∈ Y R
is the label. As a shorthand, we will interchangeably use
z:= (x, y)
. Our goal is to learn a model that generates an estimate of the label given the feature vector as its input. A
common approach for modeling is known as the average ensemble, where the output is generated by taking the average
of simpler models that only differ in their parameters. This approach is advantageous from the bias-variance perspective.
Since simpler models have a low variance (yet potentially a high bias), the average ensemble generates a low-variance
estimate of the labels with a reasonably good bias.
2
Given a parameter space S Rdand σ:X × S Y, consider the following parametric family of model functions,
Hσ={σ(·,θ) : X → Y;θ∈ S}.(2)
This parametric representation of model functions has applications in various settings. For instance, by setting
d=p
and
defining
σ(v1,v2) = σ?(vT
1v2)
, the class
Hσ
represents the generalized linear models with nonlinearity
σ?:RR
.
The average ensemble model generates an estimate by taking the average of
n
models in
Hσ
. In other words, for
x∈ X
the estimate is computed as follows,
ˆy=hn(x;Θ) := 1
n
n
X
i=1
σ(x,θi),(3)
where Θ= (θ1,θ2,..., θn)∈ Sn.
As explained above, average ensemble models are used in many practical settings. A popular instance of
such models is the two-layer neural network, where θi=¯
θi, wiT, and the output is computed as,
ˆyNN =
n
X
i=1
wi
n¯σ(x,¯
θi),(4)
where wRnrepresents the weights of the last layer of the network and ¯σ(·)is the activation function.
Given a loss function `:R2Rand a training dataset {(xk, yk)}K
k=1, our goal is to minimize the empirical risk
inf
Θ∈Sn(ˆ
Rn(Θ) := 1
K
K
X
k=1
`(yk, hn(xk; Θ))),(5)
over parameters
Θ
. We assume that the training data points are generated independently from a distribution, i.e.,
(xk, yk)µ(dz)
. The loss function
`:R2R
is assumed to be convex w.r.t. its second argument. Important
examples are the quadratic loss,
`(u, v) = 1
2(uv)2
, and the logistic loss,
`(u, v) = log 1 + exp(u×v)
. However,
the problem is non-convex whenever
σ
is. One can rewrite the average ensemble model as an integral over the empirical
distribution of the parameters, ˆρn(dθ) := Pn
i=1 1
nδθi(dθ), i.e.,
hn(x;Θ) = hˆρn, σ(x, .)i.(6)
This new form leads us to define ensemble averages by any probability distribution on S. Namely, we define
h(x;ρ) = hρ, σ(x, .)i,(7)
for any
ρ∈ P(S)
. We call this the mean-field ensemble model and the former the finite ensemble model. It is easy to
check that
hn(x;Θ) = h(x; ˆρn)
. Note that whenever the loss function
`
is convex, the mean-field ensemble model
function h(x;·) : P(S)→ Y is also convex in the space of probability measures P(S)regardless of σ(·).
2.2.1 Stochastic Mirror Descent Updates
Mirror descent algorithms introduced by Nemirovskii and Yudin [1983], and their stochastic variants, are commonly
used iterative methods that exploit a potential function to impose certain attributes in the resulting model. Assume
X=Rp
,
Y=R
, and
S=Rd
. Here, we consider the mirror descent updates initialized at
Θ0∈ Sn
where, for
i= 1,2, . . . , n, the updates are defined as,
ψ(θk+1
i) = ψ(θk
i)τ
nθi`(yk+1, hn(xk+1; Θk)),(8)
and
k
runs from
0
to
K1
. Here
ψ(·)
is a strongly convex and differentiable potential function,
τ
a fixed scaling of the
step-size, and
Θk
the parameters after the
kth
update. Strong convexity of
ψ(·)
ensures the invertibility of its gradient
which makes the updates well-defined (see Bubeck [2015]).
After evaluating the gradient of the loss function (w.r.t.
θi
) and using the identity
(6)
, we can rewrite the SMD updates
for i= 1,2, . . . , n and k= 0, . . . , K 1as
(ˆρk
n=1
nPn
i=1 δθk
i,
ψ(θk+1
i) = ψ(θk
i) + τ
nF(θk
i,ˆρk,zk+1),(9)
3
摘要:

STOCHASTICMIRRORDESCENTINAVERAGEENSEMBLEMODELSTaylanKarginDepartmentofElectricalEngineeringCaliforniaInstituteofTechnologyPasadena,CA91125FariborzSalehiGoogleSeattle,WA98103BabakHassibiDepartmentofElectricalEngineeringCaliforniaInstituteofTechnologyPasadena,CA91125ABSTRACTThestochasticmirrordescent(...

展开>> 收起<<
STOCHASTIC MIRROR DESCENT IN AVERAGE ENSEMBLE MODELS Taylan Kargin.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.81MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注