PAC-Bounds through Change-of-Measure. The PAC-Bayesian framework allows us to give high probability
bounds on the risk, either as an oracle bound or as an empirical bound. The key ingredients are so-called change-
of-measure inequalities. The choice of such an inequality strongly influences the corresponding bound. The one used
most often is based on a variational representation of the Kullback–Leibler divergence due to Donsker and Varadhan
(1975), employed, for example, by Catoni (2004, 2007). Yet, not all bounds are based on a variational representation,
i.e., holding uniformly over all posterior distributions (Rivasplata et al., 2020). However, most bounds involve the
Kullback–Leibler divergence as a measure of proximity, e.g. those by McAllester (2003b,a), Seeger (2002), Langford
and Shawe-Taylor (2002), or the general PAC-Bayes bound of Germain et al. (2009). More recently, other divergences
have been used: Honorio and Jaakkola (2014) prove an inequality for the χ2-divergence, which is also used by Lon-
don (2017). B´
egin et al. (2016) and Alquier and Guedj (2018) use the Renyi-divergence (α-divergence). Ohnishi
and Honorio (2021) propose several PAC-bounds based on the general notion of f-divergences, which includes the
Kullback–Leibler-, α- and χ2-divergences. We develop a general PAC theorem based on exponential families. In this
general approach, the role of prior, posterior, divergence and data dependence will be given naturally. Moreover, this
approach allows us to implement a general learning framework that can be applied to a wide variety of algorithms.
Boundedness of the Loss Function. A major drawback of many of the existing PAC-Bayes bounds is the assumption
of a bounded loss-function. However, this assumption is mainly there to apply some exponential moment inequality
like the Hoeffding- or Bernstein-inequality (Rivasplata et al., 2020; Alquier, 2021). Several ways have been developed
to solve this problem: Germain et al. (2009) explicitly include the exponential moment in the bound, Alquier et al.
(2016) use so-called Hoeffding- and Bernstein-assumptions, Catoni (2004) restricts to the sub-Gaussian or sub-Gamma
case. Another possibility, of which we make use of here, is to ensure the generalization or exponential moment bounds
by properties of the algorithm in question. London (2017) uses algorithmic stability to provide PAC-Bayes bounds
for SGD. We consider suitable properties of optimization algorithms aside from algorithmic stability to ensure the
exponential moment bounds. To the best of our knowledge, this has not been done before.
Minimization of the PAC-Bound. The PAC-bound is a relative bound and relates the risk to other terms such as
the empirical risk. Yet, it does not directly say anything about the actual numbers. Thus, one aims to minimize the
bound: Langford and Caruana (2001) compute non-vacuous numerical generalization bounds through a combination
of PAC-bounds with a sensitivity analysis. Dziugaite and Roy (2017) extend this by minimizing the PAC-bound
directly. P´
erez-Ortiz et al. (2021) also consider learning by minimizing the PAC-Bayes bound and provide very tight
generalization bounds. Thiemann et al. (2017) are able to solve the minimization problem resulting from their PAC-
bound by alternating minimization. Further, they provide sufficient conditions under which the resulting minimization
problem is quasi-convex. We also follow this approach and consider learning as minimization of the PAC bound,
however, applied to the context of learning-to-optimize.
Choice of the Prior. A common difficulty in learning with PAC-Bayes bounds is the choice of the prior distribution,
as it heavily influences the performance of the learned models and the generalization bound (Catoni, 2004; Dziugaite
et al., 2021; P´
erez-Ortiz et al., 2021). In part, this is due to the fact that the divergence term can dominate the bound,
keeping the posterior close to the prior. This leads to the idea to choose a data- or distribution-dependent prior (Seeger,
2002; Parrado-Hern´
andez et al., 2012; Lever et al., 2013; Dziugaite and Roy, 2018; P´
erez-Ortiz et al., 2021). As we
also found the choice of the prior distribution to be crucial for the performance of our learned algorithms, we use
a data-dependent prior. Further, we point out how the prior is essential in preserving necessary properties during
learning. It is key to control the trade-off between convergence guarantee and convergence speed.
More Generalization Bounds. There are many areas of research that study generalization bounds and have not been
discussed here. Importantly, the vast field of stochastic optimization (SO) provides generalization bounds for specific
algorithms. The main differences to our setting are the learning approach and the assumptions made:
• Learning approach: In most of the cases, the concrete algorithms studied in SO generate a single point by either
minimizing the (regularized) empirical risk functional over a possibly large dataset, or by repeatedly updating the
point estimate based on a newly drawn (small) batch of samples. Then, one studies the properties of this point in
terms of the stationarity measure of the true risk functional (see e.g. Bottou et al. (2018); Davis and Drusvyatskiy
(2022); Bianchi et al. (2022)).
• Assumptions: Since the setting in SO is more explicit, more assumptions have to be made. Typical assumptions in
SO are (weak) convexity (Shalev-Shwartz et al., 2009; Davis and Drusvyatskiy, 2019), bounded gradients (D´
efossez
et al., 2022), bounded noise (Davis and Drusvyatskiy, 2022), or at least smoothness (Kavis et al., 2022), just to name
a few.
3