PAC-B AYESIAN LEARNING OF OPTIMIZATION ALGORITHMS Michael Sucker Department of Mathematics

2025-05-06 0 0 1.2MB 22 页 10玖币

侵权投诉

PAC-BAYESIAN LEARNING OF OPTIMIZATION ALGORITHMS

Michael Sucker

Department of Mathematics

University of T¨

ubingen

michael.sucker@math.uni-tuebingen.de

Peter Ochs

Department of Mathematics

University of T¨

ubingen

ochs@math.uni-tuebingen.de

ABSTRACT

We apply the PAC-Bayes theory to the setting of learning-to-optimize. To the best of our knowl-

edge, we present the ﬁrst framework to learn optimization algorithms with provable generalization

guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a

high convergence speed. Even in the limit case, where convergence is guaranteed, our learned op-

timization algorithms provably outperform related algorithms based on a (deterministic) worst-case

analysis. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on

exponential families. By generalizing existing ideas, we reformulate the learning procedure into a

one-dimensional minimization problem and study the possibility to ﬁnd a global minimum, which

enables the algorithmic realization of the learning procedure. As a proof-of-concept, we learn hy-

perparameters of standard optimization algorithms to empirically underline our theory.

1 Introduction

Let `(·, θ)be an instance of a class of functions (`(·, θ))θ∈Θ. We consider the minimization problem:

min

x∈Rn`(x, θ).(1)

Our goal is the construction of an algorithm A(α, θ), depending on some hyperparameters α, that is provably the best

(on average) for the given class of problems. We contrast the majority of approaches in continuous optimization in

two ways:

i) Classical optimization theory studies the worst-case behaviour, which guarantees the same convergence for

all problems that arise:

α∗∈arg min

α∈H

sup

θ∈Θ

`(A(α, θ), θ).

Thereby, this is often accompanied by rough estimates and ignores that some problems are more likely to

occur than others. On the other hand, by using the additional information that θis a realization of some

random variable S, we seek for the average case in form of the mean function, usually called the risk:

α∗∈arg min

α∈H

ES[`(A(α, S),S)] .

From an optimization perspective, this is a distinct approach leading to performance guarantees in expectation

or with high probability over the draw of new problem instances. This allows us to exploit features of the

considered class of problems beyond analytical accessible quantities such as the Lipschitz constant (of the

gradient) or the strong convexity modulus, which are usually pessimistic and hard to compute.

ii) Instead of analytically constructing an algorithm driven by intricate worst-case estimates, we train our algo-

rithm (by learning) to be the best one on some samples {`(·, θi)}N

i=1 and prove that the performance general-

izes, in a suitable sense (PAC-Bayes), to the random function `(·,S). This type of problem, i.e. minimizing

the expected loss, is naturally found in the whole area of machine learning and cannot be solved directly,

since the mean function is generally unknown. Consequently, one typically solves an approximate problem

like empirical risk minimization in the hope that the solution found there will transfer:

α∗∈arg min

α∈H

i=1

`(A(α, θi), θi).

arXiv:2210.11113v2 [cs.LG] 15 Feb 2023

However, through this, one is left with the problem of generalization, which is one of the key problems in

machine learning in general. Therefore, one of the main concerns of learning-to-optimize are generalization

bounds. A famous framework to provide such bounds is the PAC-Bayes framework, which allows for giving

high-probability bounds on the true risk relative to the empirical risk.

In this paper, we apply the PAC-Bayesian theory to the setting of learning-to-optimize. In doing so, we provide PAC-

Bayesian generalization bounds for a general optimization algorithm on a general, unbounded loss function and we

show how one can trade-off convergence guarantees for convergence speed. As a proof of concept, we illustrate our

approach by learning, for example, the step-size τand the inertial parameter β, i.e., α= (τ, β), of a ﬁxed number of

iterations of the Heavy-ball update scheme given by:

x(k+1) =HBx(k), x(k−1), α, θ:= x(k)−τ∇`(x(k), θ) + β(x(k)−x(k−1)),(2)

which generalizes Gradient Descent for β= 0.

1.1 Our Contributions

• We provide a general PAC-Bayes theorem for general, unbounded loss functions based on exponential fam-

ilies. In this framework, the role of the reference distribution (called the prior), the data dependence of the

learned distribution (called the posterior) and the divergence term arise directly and naturally from the deﬁ-

nition. Furthermore, this abstract approach allows for a uniﬁed implementation of the learning framework.

• We provide a principled way of excluding the case of the learnt algorithm’s divergence from the considera-

tions, which in turn allows us to apply our PAC theorem under a modiﬁed objective. Based on this, we give

a theoretically grounded way of ensuring a given (user-speciﬁed) convergence probability during learning.

Taken together, this allows us to trade-off convergence speed and the probability of convergence. To the best

of our knowledge, both approaches are completely new and could also be very interesting for other learning

approaches.

• We apply our PAC-Bayesian framework to the problem of learning-to-optimize and learn optimization algo-

rithms by minimizing the PAC-Bayesian upper bound.

2 Related Work

The literature on both learning-to-optimize and the PAC-Bayes learning approach is vast. Hence, in the discussion of

learning-to-optimize, we will mainly focus on approaches that provide certain theoretical guarantees. Especially, this

excludes most model-free approaches, which replace the whole update step with a learnable mapping such as a neural

network. Chen et al. (2021) provide a good overview of the variety of approaches in learning-to-optimize. Good

introductory references for the PAC-Bayes approach are given by Guedj (2019) and Alquier (2021).

Learning-to-Optimize with Guarantees. Chen et al. (2021) point out that learned optimization methods may lack

theoretical guarantees for the sake of convergence speed. That said, there are applications where convergence guar-

antee is of highest priority. To underline this problem, Moeller et al. (2019) provide an example where a purely

learning-based approach fails to reconstruct the crucial details in medical image reconstruction. Also, they prove con-

vergence of their method by restricting the output to descent directions, for which mathematical guarantees exist. The

basic idea is to trace the learned object back to, or constrain it to, a mathematical object with convergence guarantees.

Similarly, Sreehari et al. (2016) provide sufﬁcient conditions under which the learned mapping is a proximal mapping.

Related schemes under different assumptions and guarantees are given by Chan et al. (2016), Teodoro et al. (2017),

Tirer and Giryes (2018), Buzzard et al. (2018), Ryu et al. (2019), Sun et al. (2019), Terris et al. (2021) and Cohen et al.

(2021). A major advantage of these methods is the fact that the number of iterations is not restricted a priori. However,

a major drawback is their restriction to speciﬁc algorithms and problems. Another approach, which limits the number

of iterations, yet in principle can be applied to every iterative optimization algorithm, is unrolling, pioneered by Gregor

and LeCun (2010). Xin et al. (2016) study the IHT algorithm and show that it is, under some assumptions, able to

achieve a linear convergence rate. Likewise, Chen et al. (2018) establish a linear convergence rate for the unrolled

ISTA. However, a difﬁculty in the theoretical analysis of unrolled algorithms is actually the notion of convergence

itself, such that one rather has to consider the generalization performance. Only few works have addressed this: Either

directly by means of Rademacher complexity (Chen et al., 2020), or indirectly in form of a stability analysis (Kobler

et al., 2020), as algorithmic stability is linked to generalization and learnability (Bousquet and Elisseeff, 2000, 2002;

Shalev-Shwartz et al., 2010). We consider the model-based approach of unrolling a general iterative optimization

algorithm and provide generalization guarantees in form of PAC-bounds.

PAC-Bounds through Change-of-Measure. The PAC-Bayesian framework allows us to give high probability

bounds on the risk, either as an oracle bound or as an empirical bound. The key ingredients are so-called change-

of-measure inequalities. The choice of such an inequality strongly inﬂuences the corresponding bound. The one used

most often is based on a variational representation of the Kullback–Leibler divergence due to Donsker and Varadhan

(1975), employed, for example, by Catoni (2004, 2007). Yet, not all bounds are based on a variational representation,

i.e., holding uniformly over all posterior distributions (Rivasplata et al., 2020). However, most bounds involve the

Kullback–Leibler divergence as a measure of proximity, e.g. those by McAllester (2003b,a), Seeger (2002), Langford

and Shawe-Taylor (2002), or the general PAC-Bayes bound of Germain et al. (2009). More recently, other divergences

have been used: Honorio and Jaakkola (2014) prove an inequality for the χ2-divergence, which is also used by Lon-

don (2017). B´

egin et al. (2016) and Alquier and Guedj (2018) use the Renyi-divergence (α-divergence). Ohnishi

and Honorio (2021) propose several PAC-bounds based on the general notion of f-divergences, which includes the

Kullback–Leibler-, α- and χ2-divergences. We develop a general PAC theorem based on exponential families. In this

general approach, the role of prior, posterior, divergence and data dependence will be given naturally. Moreover, this

approach allows us to implement a general learning framework that can be applied to a wide variety of algorithms.

Boundedness of the Loss Function. A major drawback of many of the existing PAC-Bayes bounds is the assumption

of a bounded loss-function. However, this assumption is mainly there to apply some exponential moment inequality

like the Hoeffding- or Bernstein-inequality (Rivasplata et al., 2020; Alquier, 2021). Several ways have been developed

to solve this problem: Germain et al. (2009) explicitly include the exponential moment in the bound, Alquier et al.

(2016) use so-called Hoeffding- and Bernstein-assumptions, Catoni (2004) restricts to the sub-Gaussian or sub-Gamma

case. Another possibility, of which we make use of here, is to ensure the generalization or exponential moment bounds

by properties of the algorithm in question. London (2017) uses algorithmic stability to provide PAC-Bayes bounds

for SGD. We consider suitable properties of optimization algorithms aside from algorithmic stability to ensure the

exponential moment bounds. To the best of our knowledge, this has not been done before.

Minimization of the PAC-Bound. The PAC-bound is a relative bound and relates the risk to other terms such as

the empirical risk. Yet, it does not directly say anything about the actual numbers. Thus, one aims to minimize the

bound: Langford and Caruana (2001) compute non-vacuous numerical generalization bounds through a combination

of PAC-bounds with a sensitivity analysis. Dziugaite and Roy (2017) extend this by minimizing the PAC-bound

directly. P´

erez-Ortiz et al. (2021) also consider learning by minimizing the PAC-Bayes bound and provide very tight

generalization bounds. Thiemann et al. (2017) are able to solve the minimization problem resulting from their PAC-

bound by alternating minimization. Further, they provide sufﬁcient conditions under which the resulting minimization

problem is quasi-convex. We also follow this approach and consider learning as minimization of the PAC bound,

however, applied to the context of learning-to-optimize.

Choice of the Prior. A common difﬁculty in learning with PAC-Bayes bounds is the choice of the prior distribution,

as it heavily inﬂuences the performance of the learned models and the generalization bound (Catoni, 2004; Dziugaite

et al., 2021; P´

erez-Ortiz et al., 2021). In part, this is due to the fact that the divergence term can dominate the bound,

keeping the posterior close to the prior. This leads to the idea to choose a data- or distribution-dependent prior (Seeger,

2002; Parrado-Hern´

andez et al., 2012; Lever et al., 2013; Dziugaite and Roy, 2018; P´

erez-Ortiz et al., 2021). As we

also found the choice of the prior distribution to be crucial for the performance of our learned algorithms, we use

a data-dependent prior. Further, we point out how the prior is essential in preserving necessary properties during

learning. It is key to control the trade-off between convergence guarantee and convergence speed.

More Generalization Bounds. There are many areas of research that study generalization bounds and have not been

discussed here. Importantly, the vast ﬁeld of stochastic optimization (SO) provides generalization bounds for speciﬁc

algorithms. The main differences to our setting are the learning approach and the assumptions made:

• Learning approach: In most of the cases, the concrete algorithms studied in SO generate a single point by either

minimizing the (regularized) empirical risk functional over a possibly large dataset, or by repeatedly updating the

point estimate based on a newly drawn (small) batch of samples. Then, one studies the properties of this point in

terms of the stationarity measure of the true risk functional (see e.g. Bottou et al. (2018); Davis and Drusvyatskiy

(2022); Bianchi et al. (2022)).

• Assumptions: Since the setting in SO is more explicit, more assumptions have to be made. Typical assumptions in

SO are (weak) convexity (Shalev-Shwartz et al., 2009; Davis and Drusvyatskiy, 2019), bounded gradients (D´

efossez

et al., 2022), bounded noise (Davis and Drusvyatskiy, 2022), or at least smoothness (Kavis et al., 2022), just to name

a few.

We provide a principled way to learn a distribution over hyperparameters of an abstract algorithm under weak as-

sumptions. Further, the methodology is independent of a concrete implementation and independent of the concrete

choice of hyperparameters. Furthermore, we go explicitly beyond analytically tractable quantities.

3 Preliminaries and Notation

If not further speciﬁed, we will endow every topological space Xwith the corresponding Borel-σ-algebra B(X). If we

consider a product space X×Yof two measurable spaces (X, A)and (Y, B), we endow it with the product-σ-algebra

A⊗B. We use the Fraktur-font to denote random variables. Let Ω,F,Pbe a probability space, Θbe a Polish space

and

S:Ω,F,P−→ Θ

be a random variable. Its distribution is denoted by PS, following the general notation PXto denote the distribution of

a random variable X. Integration w.r.t. PXis denoted by EX[g] := EX[g(X)] := Rg(x)PX(dx). Finally, 1Adenotes

the indicator function of a set A, which is one for x∈Aand zero else, and log denotes the natural logarithm.

Deﬁnition 3.1. Let N∈N. Further, let Ω,F,Pbe a probability space, Si:Ω,F,P−→ Θ,i= 1, ..., N, be

random variables. A measurable function

DN:Ω,F,P−→ N

i=1

Θ,

i=1 B(Θ), ω 7→

i=1

Si(ω)

is called a dataset. If the induced distribution PDNfactorizes into the product of the marginals, i.e., if it satisﬁes

PDN=NN

i=1 PSi, it is called independent and if, additionally, it satisﬁes PDN=NN

i=1 PS, it is called i.i.d.

Notation 3.2. The space (QN

i=1 Θ,NN

i=1 B(Θ)) will be denoted by (DN,B(DN)). Since NN

i=1 B(Θ) is indeed the

Borel-σ-algebra of DN, it will not be mentioned anymore.

In the PAC-Bayesian framework, generalization bounds typically involve a so-called posterior distribution, which in

turn is referred to as a data-dependent distribution. Often, this term is left unspeciﬁed. However, as also pointed out by

Rivasplata et al. (2020), this is an instance of a Markov kernel. Another commonly used name are regular conditional

probabilities, following the deﬁnition of a regular conditional distribution (Catoni, 2004; Alquier, 2008).

Deﬁnition 3.3. Let DN:Ω,F,P−→ DNbe a dataset and Ha Polish space. A Markov kernel from DNto His

called a data-dependent distribution.

Remark 3.4. The assumption of a Polish space is not very restrictive (for practical considerations) and sufﬁcient to

ensure the existence of such Markov kernels. Both deﬁnitions can be found in the supplementary material C.1.

The following theory will be based on exponential families, which are a special class of probability distributions with

a speciﬁc, mathematically convenient form.

Deﬁnition 3.5. Let Λ⊂Rk. A family of probability measures (Qλ)λ∈Λon a measurable space (H,B(H)) is called

an exponential family, if there is a dominating probability measure PH, measurable functions η1, ..., ηk: Λ −→ R, a

measurable function A: Λ −→ R>0, measurable functions T1, ..., Tk:H −→ Rand h:H −→ R>0, such that every

Qλhas a PH-density of the form:

dQλ

dPH

(α) = h(α)A(λ) exphη(λ), T (α)i,PH−a.s.

where η:= (η1, ..., ηk)and T:= (T1, ..., Tk).

In the PAC-Bayesian setting, the dominating measure PHis usually referred to as the prior and every distribution

QPHis referred to as a posterior. Note that this deviates from the standard deﬁnitions of prior and posterior in

Bayesian statistics, which are linked through the likelihood. We use a similar notation as in Barndorff–Nielsen (2014)

and denote

c(λ) := ZH

h(α) exp(hη(λ), T (α)i)PH(dα)

κ(λ) := log(c(λ)),

(3)

or short, κ= log(c). It holds that A(λ) = c(λ)−1.

Remark 3.6. In the case h= 1 and η(λ) = λ,cis the Laplace transform (moment generating function) of the push-

forward measure PH◦T−1and κthe corresponding log-Laplace transform (cumulant-generating function). Further,

if η(λ)actually describes a lower-dimensional manifold or curve in Rk,(Qλ)λ∈Λis sometimes also called a curved

exponential family (Efron, 1975).

Remark 3.7. In the following we will consider data-dependent exponential families, i.e., the sufﬁcient statistic T

additionally depends on a dataset DN. Hence, also cand κdo depend on DN. Thus, we will assume that T:

H × DN−→ Ris measurable. In this case, Qλis indeed a data-dependent distribution.

Notation 3.8. For notational simplicity, we will omit the dependence of Qλ,T,cand κon the dataset DN.

For the rest of the paper, we assume that we are given an exponential family (Qλ)λ∈Λw.r.t. PHof the form:

dQλ

dPH

(α) = h(α)

c(λ)exphη(λ), T (α)i.(4)

Finally, since the loss-function is neither assumed to be bounded nor to satisfy any self-bounding or bounded-difference

property, the following result will be needed. It states that non-negative random variables with ﬁnite second moment

satisfy a one-sided sub-Gaussian inequality. It can be found as Exercise 2.9 on page 47 in the book by Boucheron et al.

(2013).

Lemma 3.9. Let Xbe a non-negative random variable with ﬁnite second moment. Then, for every λ > 0it holds:

Ehexp−λ(X−E[X])i≤expλ2

2E[X2].

4 Problem Setup

As described in the introduction, we aim to solve the following minimization problem with a random objective function

`under Assumption 1:

min

x∈Rn`(x, S).

Assumption 1. Θis a Polish space, S:Ω,F,P−→ Θis a random variable, and `:Rn×Θ−→ Ris measurable

and non-negative.

Remark 4.1. The non-negativity assumption is not restrictive, as any lower-bounded function fcan be rescaled

according to `(x, θ) := f(x, θ)−infx∈Rnf(x, θ).

To actually solve this problem for a concrete realization θ, we apply an optimization algorithm Ato `. For this, we will

consider a similar setting as in London (2017), i.e., randomized algorithms are considered as deterministic algorithms

with randomized hyperparameters.

Deﬁnition 4.2. Let Hbe a Polish space. A measurable function

A:H × Rn×Θ−→ Rn

is called a parametric algorithm and His called the hyperparameter space of A. A random variable

H:Ω,F,P−→ H

is called a hyperparameter of A.

Remark 4.3. Hcorresponds to the hyperparameters of the algorithm, Rnto the initialization and Θto the parameters

of the problem instance.

Learning now refers to learning a distribution Qon H. For this, one needs a reference distribution:

Assumption 2. Ais a parametric optimization algorithm with hyperparameter space H. The prior PHis induced by

hyperparameters H:Ω,F,P−→ H that are independent of the dataset DNand S. The initialization x(0) ∈Rnis

given and ﬁxed.

The initialization and the probability space Ω,F,Pwill not be mentioned anymore. We deﬁne the risk of a random-

ized parametric algorithm in the usual way:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PAC-BAYESIANLEARNINGOFOPTIMIZATIONALGORITHMSMichaelSuckerDepartmentofMathematicsUniversityofT¨ubingenmichael.sucker@math.uni-tuebingen.dePeterOchsDepartmentofMathematicsUniversityofT¨ubingenochs@math.uni-tuebingen.deABSTRACTWeapplythePAC-Bayestheorytothesettingoflearning-to-optimize.Tothebestofourkn...

展开>> 收起<<

PAC-B AYESIAN LEARNING OF OPTIMIZATION ALGORITHMS Michael Sucker Department of Mathematics.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PAC-B AYESIAN LEARNING OF OPTIMIZATION ALGORITHMS Michael Sucker Department of Mathematics

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: