PAC-B AYESIAN LEARNING OF OPTIMIZATION ALGORITHMS Michael Sucker Department of Mathematics

2025-05-06 0 0 1.2MB 22 页 10玖币
侵权投诉
PAC-BAYESIAN LEARNING OF OPTIMIZATION ALGORITHMS
Michael Sucker
Department of Mathematics
University of T¨
ubingen
michael.sucker@math.uni-tuebingen.de
Peter Ochs
Department of Mathematics
University of T¨
ubingen
ochs@math.uni-tuebingen.de
ABSTRACT
We apply the PAC-Bayes theory to the setting of learning-to-optimize. To the best of our knowl-
edge, we present the first framework to learn optimization algorithms with provable generalization
guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a
high convergence speed. Even in the limit case, where convergence is guaranteed, our learned op-
timization algorithms provably outperform related algorithms based on a (deterministic) worst-case
analysis. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on
exponential families. By generalizing existing ideas, we reformulate the learning procedure into a
one-dimensional minimization problem and study the possibility to find a global minimum, which
enables the algorithmic realization of the learning procedure. As a proof-of-concept, we learn hy-
perparameters of standard optimization algorithms to empirically underline our theory.
1 Introduction
Let `(·, θ)be an instance of a class of functions (`(·, θ))θΘ. We consider the minimization problem:
min
xRn`(x, θ).(1)
Our goal is the construction of an algorithm A(α, θ), depending on some hyperparameters α, that is provably the best
(on average) for the given class of problems. We contrast the majority of approaches in continuous optimization in
two ways:
i) Classical optimization theory studies the worst-case behaviour, which guarantees the same convergence for
all problems that arise:
αarg min
α∈H
sup
θΘ
`(A(α, θ), θ).
Thereby, this is often accompanied by rough estimates and ignores that some problems are more likely to
occur than others. On the other hand, by using the additional information that θis a realization of some
random variable S, we seek for the average case in form of the mean function, usually called the risk:
αarg min
α∈H
ES[`(A(α, S),S)] .
From an optimization perspective, this is a distinct approach leading to performance guarantees in expectation
or with high probability over the draw of new problem instances. This allows us to exploit features of the
considered class of problems beyond analytical accessible quantities such as the Lipschitz constant (of the
gradient) or the strong convexity modulus, which are usually pessimistic and hard to compute.
ii) Instead of analytically constructing an algorithm driven by intricate worst-case estimates, we train our algo-
rithm (by learning) to be the best one on some samples {`(·, θi)}N
i=1 and prove that the performance general-
izes, in a suitable sense (PAC-Bayes), to the random function `(·,S). This type of problem, i.e. minimizing
the expected loss, is naturally found in the whole area of machine learning and cannot be solved directly,
since the mean function is generally unknown. Consequently, one typically solves an approximate problem
like empirical risk minimization in the hope that the solution found there will transfer:
αarg min
α∈H
1
N
N
X
i=1
`(A(α, θi), θi).
arXiv:2210.11113v2 [cs.LG] 15 Feb 2023
However, through this, one is left with the problem of generalization, which is one of the key problems in
machine learning in general. Therefore, one of the main concerns of learning-to-optimize are generalization
bounds. A famous framework to provide such bounds is the PAC-Bayes framework, which allows for giving
high-probability bounds on the true risk relative to the empirical risk.
In this paper, we apply the PAC-Bayesian theory to the setting of learning-to-optimize. In doing so, we provide PAC-
Bayesian generalization bounds for a general optimization algorithm on a general, unbounded loss function and we
show how one can trade-off convergence guarantees for convergence speed. As a proof of concept, we illustrate our
approach by learning, for example, the step-size τand the inertial parameter β, i.e., α= (τ, β), of a fixed number of
iterations of the Heavy-ball update scheme given by:
x(k+1) =HBx(k), x(k1), α, θ:= x(k)τ`(x(k), θ) + β(x(k)x(k1)),(2)
which generalizes Gradient Descent for β= 0.
1.1 Our Contributions
We provide a general PAC-Bayes theorem for general, unbounded loss functions based on exponential fam-
ilies. In this framework, the role of the reference distribution (called the prior), the data dependence of the
learned distribution (called the posterior) and the divergence term arise directly and naturally from the defi-
nition. Furthermore, this abstract approach allows for a unified implementation of the learning framework.
We provide a principled way of excluding the case of the learnt algorithm’s divergence from the considera-
tions, which in turn allows us to apply our PAC theorem under a modified objective. Based on this, we give
a theoretically grounded way of ensuring a given (user-specified) convergence probability during learning.
Taken together, this allows us to trade-off convergence speed and the probability of convergence. To the best
of our knowledge, both approaches are completely new and could also be very interesting for other learning
approaches.
We apply our PAC-Bayesian framework to the problem of learning-to-optimize and learn optimization algo-
rithms by minimizing the PAC-Bayesian upper bound.
2 Related Work
The literature on both learning-to-optimize and the PAC-Bayes learning approach is vast. Hence, in the discussion of
learning-to-optimize, we will mainly focus on approaches that provide certain theoretical guarantees. Especially, this
excludes most model-free approaches, which replace the whole update step with a learnable mapping such as a neural
network. Chen et al. (2021) provide a good overview of the variety of approaches in learning-to-optimize. Good
introductory references for the PAC-Bayes approach are given by Guedj (2019) and Alquier (2021).
Learning-to-Optimize with Guarantees. Chen et al. (2021) point out that learned optimization methods may lack
theoretical guarantees for the sake of convergence speed. That said, there are applications where convergence guar-
antee is of highest priority. To underline this problem, Moeller et al. (2019) provide an example where a purely
learning-based approach fails to reconstruct the crucial details in medical image reconstruction. Also, they prove con-
vergence of their method by restricting the output to descent directions, for which mathematical guarantees exist. The
basic idea is to trace the learned object back to, or constrain it to, a mathematical object with convergence guarantees.
Similarly, Sreehari et al. (2016) provide sufficient conditions under which the learned mapping is a proximal mapping.
Related schemes under different assumptions and guarantees are given by Chan et al. (2016), Teodoro et al. (2017),
Tirer and Giryes (2018), Buzzard et al. (2018), Ryu et al. (2019), Sun et al. (2019), Terris et al. (2021) and Cohen et al.
(2021). A major advantage of these methods is the fact that the number of iterations is not restricted a priori. However,
a major drawback is their restriction to specific algorithms and problems. Another approach, which limits the number
of iterations, yet in principle can be applied to every iterative optimization algorithm, is unrolling, pioneered by Gregor
and LeCun (2010). Xin et al. (2016) study the IHT algorithm and show that it is, under some assumptions, able to
achieve a linear convergence rate. Likewise, Chen et al. (2018) establish a linear convergence rate for the unrolled
ISTA. However, a difficulty in the theoretical analysis of unrolled algorithms is actually the notion of convergence
itself, such that one rather has to consider the generalization performance. Only few works have addressed this: Either
directly by means of Rademacher complexity (Chen et al., 2020), or indirectly in form of a stability analysis (Kobler
et al., 2020), as algorithmic stability is linked to generalization and learnability (Bousquet and Elisseeff, 2000, 2002;
Shalev-Shwartz et al., 2010). We consider the model-based approach of unrolling a general iterative optimization
algorithm and provide generalization guarantees in form of PAC-bounds.
2
PAC-Bounds through Change-of-Measure. The PAC-Bayesian framework allows us to give high probability
bounds on the risk, either as an oracle bound or as an empirical bound. The key ingredients are so-called change-
of-measure inequalities. The choice of such an inequality strongly influences the corresponding bound. The one used
most often is based on a variational representation of the Kullback–Leibler divergence due to Donsker and Varadhan
(1975), employed, for example, by Catoni (2004, 2007). Yet, not all bounds are based on a variational representation,
i.e., holding uniformly over all posterior distributions (Rivasplata et al., 2020). However, most bounds involve the
Kullback–Leibler divergence as a measure of proximity, e.g. those by McAllester (2003b,a), Seeger (2002), Langford
and Shawe-Taylor (2002), or the general PAC-Bayes bound of Germain et al. (2009). More recently, other divergences
have been used: Honorio and Jaakkola (2014) prove an inequality for the χ2-divergence, which is also used by Lon-
don (2017). B´
egin et al. (2016) and Alquier and Guedj (2018) use the Renyi-divergence (α-divergence). Ohnishi
and Honorio (2021) propose several PAC-bounds based on the general notion of f-divergences, which includes the
Kullback–Leibler-, α- and χ2-divergences. We develop a general PAC theorem based on exponential families. In this
general approach, the role of prior, posterior, divergence and data dependence will be given naturally. Moreover, this
approach allows us to implement a general learning framework that can be applied to a wide variety of algorithms.
Boundedness of the Loss Function. A major drawback of many of the existing PAC-Bayes bounds is the assumption
of a bounded loss-function. However, this assumption is mainly there to apply some exponential moment inequality
like the Hoeffding- or Bernstein-inequality (Rivasplata et al., 2020; Alquier, 2021). Several ways have been developed
to solve this problem: Germain et al. (2009) explicitly include the exponential moment in the bound, Alquier et al.
(2016) use so-called Hoeffding- and Bernstein-assumptions, Catoni (2004) restricts to the sub-Gaussian or sub-Gamma
case. Another possibility, of which we make use of here, is to ensure the generalization or exponential moment bounds
by properties of the algorithm in question. London (2017) uses algorithmic stability to provide PAC-Bayes bounds
for SGD. We consider suitable properties of optimization algorithms aside from algorithmic stability to ensure the
exponential moment bounds. To the best of our knowledge, this has not been done before.
Minimization of the PAC-Bound. The PAC-bound is a relative bound and relates the risk to other terms such as
the empirical risk. Yet, it does not directly say anything about the actual numbers. Thus, one aims to minimize the
bound: Langford and Caruana (2001) compute non-vacuous numerical generalization bounds through a combination
of PAC-bounds with a sensitivity analysis. Dziugaite and Roy (2017) extend this by minimizing the PAC-bound
directly. P´
erez-Ortiz et al. (2021) also consider learning by minimizing the PAC-Bayes bound and provide very tight
generalization bounds. Thiemann et al. (2017) are able to solve the minimization problem resulting from their PAC-
bound by alternating minimization. Further, they provide sufficient conditions under which the resulting minimization
problem is quasi-convex. We also follow this approach and consider learning as minimization of the PAC bound,
however, applied to the context of learning-to-optimize.
Choice of the Prior. A common difficulty in learning with PAC-Bayes bounds is the choice of the prior distribution,
as it heavily influences the performance of the learned models and the generalization bound (Catoni, 2004; Dziugaite
et al., 2021; P´
erez-Ortiz et al., 2021). In part, this is due to the fact that the divergence term can dominate the bound,
keeping the posterior close to the prior. This leads to the idea to choose a data- or distribution-dependent prior (Seeger,
2002; Parrado-Hern´
andez et al., 2012; Lever et al., 2013; Dziugaite and Roy, 2018; P´
erez-Ortiz et al., 2021). As we
also found the choice of the prior distribution to be crucial for the performance of our learned algorithms, we use
a data-dependent prior. Further, we point out how the prior is essential in preserving necessary properties during
learning. It is key to control the trade-off between convergence guarantee and convergence speed.
More Generalization Bounds. There are many areas of research that study generalization bounds and have not been
discussed here. Importantly, the vast field of stochastic optimization (SO) provides generalization bounds for specific
algorithms. The main differences to our setting are the learning approach and the assumptions made:
Learning approach: In most of the cases, the concrete algorithms studied in SO generate a single point by either
minimizing the (regularized) empirical risk functional over a possibly large dataset, or by repeatedly updating the
point estimate based on a newly drawn (small) batch of samples. Then, one studies the properties of this point in
terms of the stationarity measure of the true risk functional (see e.g. Bottou et al. (2018); Davis and Drusvyatskiy
(2022); Bianchi et al. (2022)).
Assumptions: Since the setting in SO is more explicit, more assumptions have to be made. Typical assumptions in
SO are (weak) convexity (Shalev-Shwartz et al., 2009; Davis and Drusvyatskiy, 2019), bounded gradients (D´
efossez
et al., 2022), bounded noise (Davis and Drusvyatskiy, 2022), or at least smoothness (Kavis et al., 2022), just to name
a few.
3
We provide a principled way to learn a distribution over hyperparameters of an abstract algorithm under weak as-
sumptions. Further, the methodology is independent of a concrete implementation and independent of the concrete
choice of hyperparameters. Furthermore, we go explicitly beyond analytically tractable quantities.
3 Preliminaries and Notation
If not further specified, we will endow every topological space Xwith the corresponding Borel-σ-algebra B(X). If we
consider a product space X×Yof two measurable spaces (X, A)and (Y, B), we endow it with the product-σ-algebra
A⊗B. We use the Fraktur-font to denote random variables. Let ,F,Pbe a probability space, Θbe a Polish space
and
S:,F,PΘ
be a random variable. Its distribution is denoted by PS, following the general notation PXto denote the distribution of
a random variable X. Integration w.r.t. PXis denoted by EX[g] := EX[g(X)] := Rg(x)PX(dx). Finally, 1Adenotes
the indicator function of a set A, which is one for xAand zero else, and log denotes the natural logarithm.
Definition 3.1. Let NN. Further, let ,F,Pbe a probability space, Si:,F,PΘ,i= 1, ..., N, be
random variables. A measurable function
DN:,F,PN
Y
i=1
Θ,
N
O
i=1 B(Θ), ω 7→
N
Y
i=1
Si(ω)
is called a dataset. If the induced distribution PDNfactorizes into the product of the marginals, i.e., if it satisfies
PDN=NN
i=1 PSi, it is called independent and if, additionally, it satisfies PDN=NN
i=1 PS, it is called i.i.d.
Notation 3.2. The space (QN
i=1 Θ,NN
i=1 B(Θ)) will be denoted by (DN,B(DN)). Since NN
i=1 B(Θ) is indeed the
Borel-σ-algebra of DN, it will not be mentioned anymore.
In the PAC-Bayesian framework, generalization bounds typically involve a so-called posterior distribution, which in
turn is referred to as a data-dependent distribution. Often, this term is left unspecified. However, as also pointed out by
Rivasplata et al. (2020), this is an instance of a Markov kernel. Another commonly used name are regular conditional
probabilities, following the definition of a regular conditional distribution (Catoni, 2004; Alquier, 2008).
Definition 3.3. Let DN:,F,P→ DNbe a dataset and Ha Polish space. A Markov kernel from DNto His
called a data-dependent distribution.
Remark 3.4. The assumption of a Polish space is not very restrictive (for practical considerations) and sufficient to
ensure the existence of such Markov kernels. Both definitions can be found in the supplementary material C.1.
The following theory will be based on exponential families, which are a special class of probability distributions with
a specific, mathematically convenient form.
Definition 3.5. Let ΛRk. A family of probability measures (Qλ)λΛon a measurable space (H,B(H)) is called
an exponential family, if there is a dominating probability measure PH, measurable functions η1, ..., ηk: Λ R, a
measurable function A: Λ R>0, measurable functions T1, ..., Tk:H −Rand h:H −R>0, such that every
Qλhas a PH-density of the form:
dQλ
dPH
(α) = h(α)A(λ) exphη(λ), T (α)i,PHa.s.
where η:= (η1, ..., ηk)and T:= (T1, ..., Tk).
In the PAC-Bayesian setting, the dominating measure PHis usually referred to as the prior and every distribution
QPHis referred to as a posterior. Note that this deviates from the standard definitions of prior and posterior in
Bayesian statistics, which are linked through the likelihood. We use a similar notation as in Barndorff–Nielsen (2014)
and denote
c(λ) := ZH
h(α) exp(hη(λ), T (α)i)PH()
κ(λ) := log(c(λ)),
(3)
or short, κ= log(c). It holds that A(λ) = c(λ)1.
4
Remark 3.6. In the case h= 1 and η(λ) = λ,cis the Laplace transform (moment generating function) of the push-
forward measure PHT1and κthe corresponding log-Laplace transform (cumulant-generating function). Further,
if η(λ)actually describes a lower-dimensional manifold or curve in Rk,(Qλ)λΛis sometimes also called a curved
exponential family (Efron, 1975).
Remark 3.7. In the following we will consider data-dependent exponential families, i.e., the sufficient statistic T
additionally depends on a dataset DN. Hence, also cand κdo depend on DN. Thus, we will assume that T:
H × DNRis measurable. In this case, Qλis indeed a data-dependent distribution.
Notation 3.8. For notational simplicity, we will omit the dependence of Qλ,T,cand κon the dataset DN.
For the rest of the paper, we assume that we are given an exponential family (Qλ)λΛw.r.t. PHof the form:
dQλ
dPH
(α) = h(α)
c(λ)exphη(λ), T (α)i.(4)
Finally, since the loss-function is neither assumed to be bounded nor to satisfy any self-bounding or bounded-difference
property, the following result will be needed. It states that non-negative random variables with finite second moment
satisfy a one-sided sub-Gaussian inequality. It can be found as Exercise 2.9 on page 47 in the book by Boucheron et al.
(2013).
Lemma 3.9. Let Xbe a non-negative random variable with finite second moment. Then, for every λ > 0it holds:
Ehexpλ(XE[X])iexpλ2
2E[X2].
4 Problem Setup
As described in the introduction, we aim to solve the following minimization problem with a random objective function
`under Assumption 1:
min
xRn`(x, S).
Assumption 1. Θis a Polish space, S:,F,PΘis a random variable, and `:Rn×ΘRis measurable
and non-negative.
Remark 4.1. The non-negativity assumption is not restrictive, as any lower-bounded function fcan be rescaled
according to `(x, θ) := f(x, θ)infxRnf(x, θ).
To actually solve this problem for a concrete realization θ, we apply an optimization algorithm Ato `. For this, we will
consider a similar setting as in London (2017), i.e., randomized algorithms are considered as deterministic algorithms
with randomized hyperparameters.
Definition 4.2. Let Hbe a Polish space. A measurable function
A:H × Rn×ΘRn
is called a parametric algorithm and His called the hyperparameter space of A. A random variable
H:,F,P→ H
is called a hyperparameter of A.
Remark 4.3. Hcorresponds to the hyperparameters of the algorithm, Rnto the initialization and Θto the parameters
of the problem instance.
Learning now refers to learning a distribution Qon H. For this, one needs a reference distribution:
Assumption 2. Ais a parametric optimization algorithm with hyperparameter space H. The prior PHis induced by
hyperparameters H:,F,P→ H that are independent of the dataset DNand S. The initialization x(0) Rnis
given and fixed.
The initialization and the probability space ,F,Pwill not be mentioned anymore. We define the risk of a random-
ized parametric algorithm in the usual way:
5
摘要:

PAC-BAYESIANLEARNINGOFOPTIMIZATIONALGORITHMSMichaelSuckerDepartmentofMathematicsUniversityofT¨ubingenmichael.sucker@math.uni-tuebingen.dePeterOchsDepartmentofMathematicsUniversityofT¨ubingenochs@math.uni-tuebingen.deABSTRACTWeapplythePAC-Bayestheorytothesettingoflearning-to-optimize.Tothebestofourkn...

展开>> 收起<<
PAC-B AYESIAN LEARNING OF OPTIMIZATION ALGORITHMS Michael Sucker Department of Mathematics.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.2MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注