
exploited by [
17
], which derived samplewise versions of these bounds.
1
Intriguingly, these bounds
are both easier to evaluate than their hypothesis-based counterparts, due to the lower dimensionality
of the random variables involved, and quantitatively tighter for deep neural networks. In particular,
while bounds involving information measures based on the hypothesis space tend to increase as
training progresses [11, 18], the bound of [17] remains stable.
The results that have so far been derived for the CMI setting pertain only to the (weighted) difference
between population loss and training loss, or to its squared or absolute value [
8
,
9
,
14
,
16
,
17
,
18
]. In
the PAC-Bayesian literature, other types of discrepancy measures have been considered, and shown
to result in tighter bounds on the population loss. For example, [
19
,
20
,
21
,
22
] consider the binary
KL divergence (i.e., the KL divergence between two Bernoulli distributions) with parameters given
by the training and population loss, respectively, while [
23
,
24
] consider arbitrary jointly convex
functions. Finally, [
25
] allows for arbitrary functions. It should be noted that in all of these results, a
moment-generating function that depends on the selected function has to be controlled for the bound
to be computable.
Recently, the e-CMI framework was proven to be expressive enough to allow one to rederive known
results in learning theory, e.g., generalization bounds expressed in terms of algorithmic stability,
VC dimension, and related complexity measures [
17
,
26
]. Tightening and extending e-CMI bounds,
which is the main objective of this paper, has the potential to further increase the unifying nature of
the e-CMI framework.
Contributions
Leveraging a basic inequality involving a generic convex function of two random
variables (Lemma 1), we establish several novel disintegrated, samplewise, e-CMI bounds on the
average generalization error. Specifically, we present i) a square-root bound (Theorem 1) on the
generalization error, together with a mean-squared error extension, which tightens the bound recently
reported in [
17
]; ii) a linear bound (Theorem 3) that tightens the bound given in [
18
]; iii) a binary
KL bound (Theorem 4), which is a natural extension to the e-CMI setting of a well-known bound in
the PAC-Bayes literature [
22
]. While the derivation of the first two bounds involves an adaptation
of results available in the literature, to obtain the binary KL bound we need a novel concentration
inequality involving independent but not identically distributed random variables (Lemma 2). As
an additional contribution, we show how to adapt the techniques presented in the paper to obtain
high-probability (rather than average) e-CMI bounds (Theorem 7). Furthermore, we illustrate the
expressiveness of the e-CMI framework by using our bounds to recover average and high-probability
generalization bounds for multiclass classification with finite Natarajan dimension (Theorem 8).
Finally, we conduct numerical experiments on MNIST and CIFAR10, which reveal that the binary
KL bound results in a tighter characterization of the population loss compared to the square-root
bound and linear bound for some deep learning scenarios.
Preliminaries and Notation
We let
D(P||Q)
denote the KL divergence between the two probability measures
P
and
Q
. This
quantity is well-defined if
P
is absolutely continuous with respect to
Q
. When
P
and
Q
are Bernoulli
distributions with parameters
p
and
q
respectively, we let
D(P||Q) = d(p||q) = plog(p/q) +
(1 −p) log((1 −p)/(1 −q))
, and we refer to
d(p||q)
as the binary KL divergence. For two
random variables
X
and
Y
with joint distribution
PXY
and respective marginals
PX
and
PY
, we
let
I(X;Y) = D(PXY ||PXPY)
denote their mutual information. Throughout the paper, we use
uppercase letters to denote random variables and lowercase letters to denote their realizations. We
denote the conditional joint distribution of
X
and
Y
given an instance
Z=z
by
PXY |Z=z
and the
corresponding conditional distribution for the product of marginals by
PX|Z=zPY|Z=z
. Furthermore,
we let
Iz(X;Y) = D(PXY |Z=z||PX|Z=zPY|Z=z)
, which is referred to as the disintegrated mutual
information [
11
]. Its expectation is the conditional mutual information
I(X;Y|Z) = EZIZ(X;Y)
.
CMI Setting
Let
Z
be the sample space and let
D
denote the data-generating distribution. Consider
a supersample
e
Z∈ Zn×2
, where each entry is generated independently from
D
. For convenience, we
index the columns of
e
Z
starting from
0
and the rows starting from
1
. Furthermore, we denote the
i
th
1
While [
17
] considers the information stored in predictions rather than losses, referred to as the
f
-CMI, the
derivations therein can be adapted to obtain bounds that depend on the losses, as we clarify in Section 2.2.
2