STOCHASTIC MIRROR DESCENT IN AVERAGE ENSEMBLE MODELS Taylan Kargin

2025-05-03 0 0 1.81MB 12 页 10玖币

侵权投诉

STOCHASTIC MIRROR DESCENT IN AVERAGE ENSEMBLE

MODELS

Taylan Kargin

Department of Electrical Engineering

California Institute of Technology

Pasadena, CA 91125

Fariborz Salehi

Google

Seattle, WA 98103

Babak Hassibi

Department of Electrical Engineering

California Institute of Technology

Pasadena, CA 91125

ABSTRACT

The stochastic mirror descent (SMD) algorithm is a general class of training algorithms, which

includes the celebrated stochastic gradient descent (SGD), as a special case. It utilizes a mirror

potential to inﬂuence the implicit bias of the training algorithm. In this paper we explore the

performance of the SMD iterates on mean-ﬁeld ensemble models. Our results generalize earlier ones

obtained for SGD on such models. The evolution of the distribution of parameters is mapped to a

continuous time process in the space of probability distributions. Our main result gives a nonlinear

partial differential equation to which the continuous time process converges in the asymptotic regime

of large networks. The impact of the mirror potential appears through a multiplicative term that is

equal to the inverse of its Hessian and which can be interpreted as deﬁning a gradient ﬂow over an

appropriately deﬁned Riemannian manifold. We provide numerical simulations which allow us to

study and characterize the effect of the mirror potential on the performance of networks trained with

SMD for some binary classiﬁcation problems.

Keywords Stochastic Mirror Descent ·Ensemble Models ·Mean-ﬁeld limit

1 Introduction

Machine learning models have been successfully used in a wide variety of applications ranging from spam detection, to

image classiﬁcation, to natural language processing. Despite this great success, our theoretical understanding on why

various machine learning methods demonstrate the performances they do is still at an early stage.

To obtain some understanding of the behavior of machine learning algorithms we shall focus on the structure/architecture

of the model, as well as the training mechanism that generates a suitable model with respect to the available

data. With regards to the architecture, a common theme among most modern machine learning models (e.g. ar-

tiﬁcial/convolutional/recurrent neural networks) is that they repetitively use simple blocks such as perceptrons in ANNs,

and convolution ﬁlters in CNNs. The idea of combining multiple simple models has its roots in classical statistics and

is known as boosting. The justiﬁcation is that a simple model often possesses a low variance while its bias can be

potentially high; therefore, combining multiple such models can reduce the bias while keeping the variance low.

The optimization algorithm maps the training data to model parameters. Its role is especially critical in modern setups

where the number of parameters can be overwhelmingly larger than the number of inputs. In such overparameterized

settings, there are often many models that perfectly ﬁt the training data. Thus, the speciﬁc choice of the optimization

algorithm determines the resulting model in the interpolating regime. In fact, understanding the connection between the

optimization algorithm and the generalization performance of the resulting model is an important open problem.

In this paper we attempt to study these aspects by analyzing the convergence behavior of the stochastic mirror descent

(SMD) iterates on a general class of models, known as average ensemble models where the output is generated by

taking an average of similar models that only differ in their choice of parameters. Two-layer neural networks, e.g., can

be viewed as a special case of the average ensemble.

arXiv:2210.15323v1 [cs.LG] 27 Oct 2022

In particular, we address the impact of the potential function used in SMD on the parameters of the average ensemble

model. Introduced by Nemirovskii and Yudin [1983] for convex optimization problems with underlying geometric

structure, SMD is as a generalization of stochastic gradient descent (SGD) where the updates are mirrored to a dual

domain using a potential function. It has been shown that choice of the potential function plays a pivotal role in

determining the implicit bias of the SMD in the interpolating regime Azizan et al. [2019], Gunasekar et al. [2020a].

1.1 Contributions and Prior Work

In this paper, we develop a mean-ﬁeld type result for average ensemble models trained with SMD. We consider a

general class of convex loss functions that includes the squared loss as a special case. In particular, we derive a partial

differential equation that describes the "time evolution" of the distribution of the parameters in the network as training

progresses. This distribution can be used to compute the risk, or generalization error, of the trained model. Our result

generalizes similar mean-ﬁeld results for SGD, with the difference being that the Hessian of the potential function

appears. We further give a geometric interpretation at the "distributional" level by showing an equivalent formulation of

our result as a geometric ﬂow on an appropriate Riemannian manifold. We ﬁnally provide numerical results to illustrate

the applicability of the method and to showcase the effect of the potential function on the performance and implicit bias

of the resulting trained network.

A mean-ﬁeld characterization of two-layer neural networks trained with SGD under quadratic loss was studied by Mei

et al. [2018], Sirignano and Spiliopoulos [2020], and Rotskoff and Vanden-Eijnden [2019]. A similar line of work is

given by Chizat and Bach [2018] where more general loss functions are considered for gradient descent algorithm. Mean

ﬁeld results with similar ﬂavor to ours have been given in Raginsky and Bouvrie [2012] and Borovykh et al. [2020]

in the context of distributed convex optimization. These papers consider the problem of minimizing a ﬁxed objective

function with noisy information by many "interacting optimizers" coupled through an interaction matrix. Despite being

insightful, the setting presented in these papers does not ﬁt into the case of learning with average ensemble models. An

important connection between mirror descent and natural gradient descent on dual (Riemannian) space was established

by Raskutti and Mukherjee [2014]. Works by Gunasekar et al. [2020a] and Gunasekar et al. [2020b] advanced this

geometric viewpoint by showing a correspondence between continuous-time mirror descent and Riemannian gradient

ﬂow on the primal domain governed by the Hessian matrix of the potential function.

2 Preliminaries

2.1 Notations

We gather here the basic notations used. Vectors are presented with bold lower-case letters, and bold upper letters are

reserved for matrices. For a vector

denotes its

ith

entry, and

kvkp

is its

norm.

is the set of symmetric and

positive-deﬁnite matrices with dimension d.

For a set

and

k∈N

b(S)

denotes the space of bounded and

-times differentiable functions with continuous

kth

derivative,

P(S)

denotes the space of Borel probability measures on

and

Pm(S)

denotes probability measures

with bounded mth moment.

For a random vector

µ(dz)

indicates its probability measure.

δx

is a Dirac mass at point

. Integral of a

function f:S → Rwith respect to a measure ρon Sis denoted as,

hρ, fi:= Zf(x)ρ(dx).(1)

Let

{ρn}n∈N

be a sequence of probability measures deﬁned on

S ⊆ Rd

. The sequence is said to be weakly converging

to the measure ρ?if and only if, lim

n→∞|hρn, fi−hρ?, fi| = 0, for all f∈ Cb(S).

2.2 Problem Setup

Consider an online learning setting where we sequentially observe a fresh sample of data, i.e.,

(xk, yk)

for

k= 1,2, . . .

Here

xk∈ X ⊆ Rp

is the feature vector, and

yk∈ Y ⊆ R

is the label. As a shorthand, we will interchangeably use

z:= (x, y)

. Our goal is to learn a model that generates an estimate of the label given the feature vector as its input. A

common approach for modeling is known as the average ensemble, where the output is generated by taking the average

of simpler models that only differ in their parameters. This approach is advantageous from the bias-variance perspective.

Since simpler models have a low variance (yet potentially a high bias), the average ensemble generates a low-variance

estimate of the labels with a reasonably good bias.

Given a parameter space S ⊂ Rdand σ:X × S → Y, consider the following parametric family of model functions,

Hσ={σ(·,θ) : X → Y;θ∈ S}.(2)

This parametric representation of model functions has applications in various settings. For instance, by setting

d=p

and

deﬁning

σ(v1,v2) = σ?(vT

1v2)

, the class

Hσ

represents the generalized linear models with nonlinearity

σ?:R→R

The average ensemble model generates an estimate by taking the average of

models in

Hσ

. In other words, for

x∈ X

the estimate is computed as follows,

ˆy=hn(x;Θ) := 1

i=1

σ(x,θi),(3)

where Θ= (θ1,θ2,..., θn)∈ Sn.

As explained above, average ensemble models are used in many practical settings. A popular instance of

such models is the two-layer neural network, where θi=¯

θi, wiT, and the output is computed as,

ˆyNN =

i=1

n¯σ(x,¯

θi),(4)

where w∈Rnrepresents the weights of the last layer of the network and ¯σ(·)is the activation function.

Given a loss function `:R2→Rand a training dataset {(xk, yk)}K

k=1, our goal is to minimize the empirical risk

inf

Θ∈Sn(ˆ

Rn(Θ) := 1

k=1

`(yk, hn(xk; Θ))),(5)

over parameters

. We assume that the training data points are generated independently from a distribution, i.e.,

(xk, yk)∼µ(dz)

. The loss function

`:R2→R

is assumed to be convex w.r.t. its second argument. Important

examples are the quadratic loss,

`(u, v) = 1

2(u−v)2

, and the logistic loss,

`(u, v) = log 1 + exp(−u×v)

. However,

the problem is non-convex whenever

is. One can rewrite the average ensemble model as an integral over the empirical

distribution of the parameters, ˆρn(dθ) := Pn

i=1 1

nδθi(dθ), i.e.,

hn(x;Θ) = hˆρn, σ(x, .)i.(6)

This new form leads us to deﬁne ensemble averages by any probability distribution on S. Namely, we deﬁne

h(x;ρ) = hρ, σ(x, .)i,(7)

for any

ρ∈ P(S)

. We call this the mean-ﬁeld ensemble model and the former the ﬁnite ensemble model. It is easy to

check that

hn(x;Θ) = h(x; ˆρn)

. Note that whenever the loss function

is convex, the mean-ﬁeld ensemble model

function h(x;·) : P(S)→ Y is also convex in the space of probability measures P(S)regardless of σ(·).

2.2.1 Stochastic Mirror Descent Updates

Mirror descent algorithms introduced by Nemirovskii and Yudin [1983], and their stochastic variants, are commonly

used iterative methods that exploit a potential function to impose certain attributes in the resulting model. Assume

X=Rp

Y=R

, and

S=Rd

. Here, we consider the mirror descent updates initialized at

Θ0∈ Sn

where, for

i= 1,2, . . . , n, the updates are deﬁned as,

∇ψ(θk+1

i) = ∇ψ(θk

i)−τ

n∇θi`(yk+1, hn(xk+1; Θk)),(8)

and

runs from

K−1

. Here

ψ(·)

is a strongly convex and differentiable potential function,

a ﬁxed scaling of the

step-size, and

Θk

the parameters after the

kth

update. Strong convexity of

ψ(·)

ensures the invertibility of its gradient

which makes the updates well-deﬁned (see Bubeck [2015]).

After evaluating the gradient of the loss function (w.r.t.

θi

) and using the identity

(6)

, we can rewrite the SMD updates

for i= 1,2, . . . , n and k= 0, . . . , K −1as

(ˆρk

n=1

nPn

i=1 δθk

∇ψ(θk+1

i) = ∇ψ(θk

i) + τ

nF(θk

i,ˆρk,zk+1),(9)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

STOCHASTICMIRRORDESCENTINAVERAGEENSEMBLEMODELSTaylanKarginDepartmentofElectricalEngineeringCaliforniaInstituteofTechnologyPasadena,CA91125FariborzSalehiGoogleSeattle,WA98103BabakHassibiDepartmentofElectricalEngineeringCaliforniaInstituteofTechnologyPasadena,CA91125ABSTRACTThestochasticmirrordescent(...

展开>> 收起<<

STOCHASTIC MIRROR DESCENT IN AVERAGE ENSEMBLE MODELS Taylan Kargin.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

STOCHASTIC MIRROR DESCENT IN AVERAGE ENSEMBLE MODELS Taylan Kargin

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: