A New Family of Generalization Bounds Using Samplewise Evaluated CMI Fredrik Hellström

2025-04-27 0 0 6.76MB 27 页 10玖币
侵权投诉
A New Family of Generalization Bounds
Using Samplewise Evaluated CMI
Fredrik Hellström
Chalmers University of Technology
Gothenburg, Sweden
frehells@chalmers.se
Giuseppe Durisi
Chalmers University of Technology
Gothenburg, Sweden
durisi@chalmers.se
Abstract
We present a new family of information-theoretic generalization bounds, in which
the training loss and the population loss are compared through a jointly convex
function. This function is upper-bounded in terms of the disintegrated, sample-
wise, evaluated conditional mutual information (CMI), an information measure
that depends on the losses incurred by the selected hypothesis, rather than on the
hypothesis itself, as is common in probably approximately correct (PAC)-Bayesian
results. We demonstrate the generality of this framework by recovering and ex-
tending previously known information-theoretic bounds. Furthermore, using the
evaluated CMI, we derive a samplewise, average version of Seeger’s PAC-Bayesian
bound, where the convex function is the binary KL divergence. In some scenarios,
this novel bound results in a tighter characterization of the population loss of
deep neural networks than previous bounds. Finally, we derive high-probability
versions of some of these average bounds. We demonstrate the unifying nature of
the evaluated CMI bounds by using them to recover average and high-probability
generalization bounds for multiclass classification with finite Natarajan dimension.
1 Introduction
Information-theoretic generalization bounds, i.e., generalization bounds that are expressed in terms
of information-theoretic metrics such as the Kullback-Leibler (KL) divergence, mutual information,
and conditional mutual information (CMI), have emerged as useful tools to obtain an accurate
characterization of the performance of deep neural networks. To obtain such bounds, one compares
aposterior, that is, the distribution over the hypotheses induced by the learning algorithm, to a
reference distribution called the prior, an approach first introduced to provide probably approximately
correct (PAC) bounds for Bayesian classifiers [
1
,
2
,
3
,
4
]. The connection between these results and
classic information-theoretic metrics was clarified in [
5
,
6
], where the generalization gap, averaged
with respect to the joint posterior and data distribution, is bounded in terms of the mutual information
between the training data and the hypothesis. These bounds have since been extended in several
ways [7, 8, 9, 10, 11, 12, 13, 14, 15].
A major step was taken in [
16
], in which a setting where the training set is randomly selected from a
larger supersample is considered. We refer to this setup as the CMI setting, as it leads to generalization
bounds in terms of the CMI between the hypothesis and training set selection, given the supersample.
It turns out that these bounds can be further tightened by observing that the selected hypothesis enters
the derivation only through the loss that it induces on the supersample. Using this observation, [
16
]
also derives bounds in terms of the information encoded in these losses rather than in the hypothesis
itself, a quantity referred to as the evaluated CMI (e-CMI). Due to the data-processing inequality,
these bounds are always tighter than the regular CMI bounds. This observation was recently further
arXiv:2210.06422v2 [cs.LG] 27 Mar 2023
exploited by [
17
], which derived samplewise versions of these bounds.
1
Intriguingly, these bounds
are both easier to evaluate than their hypothesis-based counterparts, due to the lower dimensionality
of the random variables involved, and quantitatively tighter for deep neural networks. In particular,
while bounds involving information measures based on the hypothesis space tend to increase as
training progresses [11, 18], the bound of [17] remains stable.
The results that have so far been derived for the CMI setting pertain only to the (weighted) difference
between population loss and training loss, or to its squared or absolute value [
8
,
9
,
14
,
16
,
17
,
18
]. In
the PAC-Bayesian literature, other types of discrepancy measures have been considered, and shown
to result in tighter bounds on the population loss. For example, [
19
,
20
,
21
,
22
] consider the binary
KL divergence (i.e., the KL divergence between two Bernoulli distributions) with parameters given
by the training and population loss, respectively, while [
23
,
24
] consider arbitrary jointly convex
functions. Finally, [
25
] allows for arbitrary functions. It should be noted that in all of these results, a
moment-generating function that depends on the selected function has to be controlled for the bound
to be computable.
Recently, the e-CMI framework was proven to be expressive enough to allow one to rederive known
results in learning theory, e.g., generalization bounds expressed in terms of algorithmic stability,
VC dimension, and related complexity measures [
17
,
26
]. Tightening and extending e-CMI bounds,
which is the main objective of this paper, has the potential to further increase the unifying nature of
the e-CMI framework.
Contributions
Leveraging a basic inequality involving a generic convex function of two random
variables (Lemma 1), we establish several novel disintegrated, samplewise, e-CMI bounds on the
average generalization error. Specifically, we present i) a square-root bound (Theorem 1) on the
generalization error, together with a mean-squared error extension, which tightens the bound recently
reported in [
17
]; ii) a linear bound (Theorem 3) that tightens the bound given in [
18
]; iii) a binary
KL bound (Theorem 4), which is a natural extension to the e-CMI setting of a well-known bound in
the PAC-Bayes literature [
22
]. While the derivation of the first two bounds involves an adaptation
of results available in the literature, to obtain the binary KL bound we need a novel concentration
inequality involving independent but not identically distributed random variables (Lemma 2). As
an additional contribution, we show how to adapt the techniques presented in the paper to obtain
high-probability (rather than average) e-CMI bounds (Theorem 7). Furthermore, we illustrate the
expressiveness of the e-CMI framework by using our bounds to recover average and high-probability
generalization bounds for multiclass classification with finite Natarajan dimension (Theorem 8).
Finally, we conduct numerical experiments on MNIST and CIFAR10, which reveal that the binary
KL bound results in a tighter characterization of the population loss compared to the square-root
bound and linear bound for some deep learning scenarios.
Preliminaries and Notation
We let
D(P||Q)
denote the KL divergence between the two probability measures
P
and
Q
. This
quantity is well-defined if
P
is absolutely continuous with respect to
Q
. When
P
and
Q
are Bernoulli
distributions with parameters
p
and
q
respectively, we let
D(P||Q) = d(p||q) = plog(p/q) +
(1 p) log((1 p)/(1 q))
, and we refer to
d(p||q)
as the binary KL divergence. For two
random variables
X
and
Y
with joint distribution
PXY
and respective marginals
PX
and
PY
, we
let
I(X;Y) = D(PXY ||PXPY)
denote their mutual information. Throughout the paper, we use
uppercase letters to denote random variables and lowercase letters to denote their realizations. We
denote the conditional joint distribution of
X
and
Y
given an instance
Z=z
by
PXY |Z=z
and the
corresponding conditional distribution for the product of marginals by
PX|Z=zPY|Z=z
. Furthermore,
we let
Iz(X;Y) = D(PXY |Z=z||PX|Z=zPY|Z=z)
, which is referred to as the disintegrated mutual
information [
11
]. Its expectation is the conditional mutual information
I(X;Y|Z) = EZIZ(X;Y)
.
CMI Setting
Let
Z
be the sample space and let
D
denote the data-generating distribution. Consider
a supersample
e
Z∈ Zn×2
, where each entry is generated independently from
D
. For convenience, we
index the columns of
e
Z
starting from
0
and the rows starting from
1
. Furthermore, we denote the
i
th
1
While [
17
] considers the information stored in predictions rather than losses, referred to as the
f
-CMI, the
derivations therein can be adapted to obtain bounds that depend on the losses, as we clarify in Section 2.2.
2
row of e
Zas e
Zi. Let S= (S1, . . . , Sn)denote a membership vector, with entries generated indepen-
dently according to a
Bern(1/2)
distribution, independently of
e
Z
. Let
¯
S= (1 S1,...,1Sn)
denote the modulo-2 complement of
S
. We refer to
S
as a membership vector because it is used to
divide the supersample into an
n
-dimensional training vector
e
ZS
with entries
[e
ZS]i=e
Zi,Si
and an
n
-
dimensional test vector
e
Z¯
S
with entries
[e
Z¯
S]i=e
Zi, ¯
Si
. To be able to handle arbitrary learning settings,
we consider learning algorithms as maps
A:Zn× R → F
, where
R∈ R
is a random variable
(independent of
e
Z
and
S
) that captures the stochasticity of the algorithm and
F
is a space, for instance,
a parameter space or function space. For a fixed
R=r
and
e
ZS= ˜zs
,
A(˜zs, r)
is a deterministic
function of
˜zs
. The quality of the learning algorithm’s output is evaluated using a bounded loss func-
tion
`:F × Z [0,1]
. We let
U
be a vector of size
m
, the elements of which are sampled without
replacement uniformly at random from
(1, . . . , n)
. For a given realization
U=u= (u1, . . . , um)
,
we let
˜zu
denote the
m×2
matrix obtained by stacking the vectors
˜zui
for
i= 1, . . . , m
. Simi-
larly,
`(A(˜zs, R),˜zu)
denotes the
m×2
matrix of losses obtained by applying
`(A(˜zs, R),·)
elemen-
twise to
˜zu
. We denote the population loss as
LD(A,˜zs, r) = EZ0[`(A(˜zs, r), Z0)]
, where
Z0∼ D
.
The training loss is
L˜zs(A,˜zs, r) = 1
nPn
i=1 `(A(˜zs, r),[˜zs]i)
. In general, for any
p
-dimensional
vector of data points
ˆz
, we let
Lˆz(A,˜zs, r) = 1
pPp
i=1 `(A(˜zs, r),ˆzi)
. Since
A(˜zs, r)
does not
depend on
˜z¯s
,
L˜z¯s(A,˜zs, r)
is a test loss. Finally, we let
LD=Ee
Z,S,R[LD(A,e
ZS, R)]
denote the
average population loss and ˆ
L=Ee
Z,S,R[Le
ZS(A,e
ZS, R)] denote the average training loss.
2 Average Generalization Bounds
In this section, we present the main results of this paper: a new family of disintegrated samplewise
e-CMI bounds for the average population loss. High-probability versions of these bounds are given
in Section 3.
2.1 Main Lemma
We now present the generic inequality upon which the bounds in this section are based. This inequality,
which is similar in nature to the ones provided in [
23
] and [
25
], gives us a generic framework to derive
generalization bounds, as it allows for a wide choice of functions for comparing training loss and
population loss. A crucial difference compared to [
23
] and [
25
] is that our focus in this section is on
average rather than PAC-Bayesian generalization bounds, and on bounds based on the disintegrated,
samplewise e-CMI, rather than traditional KL-based bounds.
Lemma 1.
Let
fγ: [0,1]2R
be a function that is jointly convex in its arguments and is
parameterized by
γ
. Let
X
and
Y
be two random variables, and let
Y0
be a random variable
with the same marginal distribution as
Y
such that
Y0
and
X
are independent. Assume that the
joint distribution of
X, Y
is absolutely continuous with respect to the joint distribution of
X, Y 0
.
Let
g1(X, Y )
and
g2(X, Y )
be measurable functions with range
[0,1]
and finite first moments, such
that, for all γ,EX,Y [fγ(g1(X, Y ), g2(X, Y ))] is finite. Let
ξγ= log EX,Y 0hefγ(g1(X,Y 0),g2(X,Y 0))i(1)
and assume that ξγis finite. Then,
sup
γ
fγ(EX,Y [g1(X, Y )] ,EX,Y [g2(X, Y )]) ξγsup
γ
EX,Y [fγ(g1(X, Y ), g2(X, Y ))] ξγ
I(X;Y).(2)
The proof, which is an application of Jensen’s inequality and Donsker-Varadhan’s variational repre-
sentation of the KL divergence, is given in Appendix A, along with the proofs of all other results
provided in this paper.
Note that we are allowed to optimize
γ
because, in this section, we only consider average bounds. In
contrast, when we derive high-probability bounds in Section 3, we need to fix
γ
. Optimizing
γ
over a
set of candidate values in the high-probability scenario incurs a union bound cost [27, Sec. 1.2.2].
To apply Lemma 1, we need to identify functions
fγ(·,·)
for which the moment generating function
in
(1)
can be controlled. A theme throughout this section is that we identify functions
fγ(·,·)
for
3
which there exist concentration results that imply that
ξγ0
. This allows us to loosen the bound
in (2) by discarding ξγ.
Throughout this section, we assume that the loss is bounded, so that
`(·,·)[0,1]
. Note that the
reported results can be extended to more general losses by use of scaling. In particular, assume that
there is a function
K:F R+
such that, for all
f∈ F
,
supz`(f, z)K(f)
(referred to as the
hypothesis-dependent range condition in [
28
]). Then, all of the results that we present for bounded
losses also hold for the scaled loss `(f, z)/K(f).
2.2 Extending (f)-CMI Bounds to e-CMI
We now apply Lemma 1 to recover the average bounds of [
17
,
18
]. These works derive bounds using
the information contained either in the hypothesis itself or the resulting predictions, i.e., the CMI or
the
f
-CMI. We instead derive bounds in terms of the information captured by the matrix of losses,
i.e., the e-CMI. For parametric supervised learning algorithms, we can recover the original bounds
via the data-processing inequality. This is formalized in the following remark.
Remark 1.
Consider a parametric supervised learning setting, where
Z=X ×Y
and
A(˜zS, R) =
W∈ F
are the parameters of a function
φW:X → Y
. Let
˜xu
denote the
m×2
matrix obtained by
projecting each element of
˜zu
onto
X
, i.e., the matrix of unlabeled examples. Let
φW(˜xu)
denote the
matrix of predictions obtained by elementwise application of
φW
to
˜xu
. Then, for any fixed
˜z
and
u
,
I˜z,u(`(W, ˜zu); Su)I˜z,u(φW(˜xu); Su)I˜z,u(W;Su).(3)
We now use Lemma 1 to recover some of the results reported in [
17
]. Let
∆ = L˜zS0
i
(A,˜zs, R)
L˜z¯
S0
i
(A,˜zs, R)
, where
S0
i
is an independent copy of
Si
. For a fixed
˜z
, symmetry implies
that
ES0
i[∆] = 0
. Furthermore, since
`(·,·)[0,1]
, we have that
[1,1]
. Thus,
is
1
-
sub-Gaussian. By using properties of sub-Gaussian random variables, we can control
ξγ
in Lemma 1
when choosing fγ(·,·)as γor γ2. The resulting bounds are given in the following theorem.
Theorem 1 (Square-root bound and squared bound).Consider the CMI setting. Then,
LDˆ
L1
n
n
X
i=1
Ee
Zq2Ie
Z(`(A(e
ZS, R),e
Zi); Si)(4)
1
n
n
X
i=1 q2I(`(A(e
ZS, R),e
Zi); Si|e
Z).(5)
Furthermore,
Ee
Z,R,S LD(A,e
ZS, R)Le
ZS(A,e
ZS, R)
28
mI(`(A(e
ZS, R),e
ZU); SU|e
Z, U )+2.(6)
In
(4)
, the expectation over
e
Z
is taken outside of the square root, and the generalization bound is in
terms of the disintegrated mutual information. By Jensen’s inequality, this is tighter than (5).
As shown in Appendix A, one can further generalize (4) to obtain an upper bound of the form
Ee
Z,U q2Ie
Z,U (`(A(e
ZS, R),e
ZU); SU).(7)
However, since
(7)
increases with the size
m
of the random subset
U
[
17
, Prop. 1], we focus only
on the case
m= 1
, as this leads to the tightest bound. The same holds for
(5)
, as well as for all of
the bounds that we present in the remainder of this section, where we also focus only on
m= 1
. In
contrast, the choice of
m= 1
in the squared bound in
(6)
is suboptimal and actually yields a vacuous
bound due to the 2/m term.
It is possible to obtain a bound similar to
(4)
, but where an expectation over
R
is taken outside
the square root and the mutual information is also conditioned on
R
. We present this bound in the
following theorem.
4
Theorem 2 (R-conditioned square-root bound).Consider the CMI setting. Then,
LDˆ
L1
n
n
X
i=1
Ee
Z,Rq2Ie
Z,R(`(A(e
ZS, R),e
Zi); Si)(8)
1
n
n
X
i=1 q2I(`(A(e
ZS, R),e
Zi); Si|e
Z, R).(9)
A similar bound, but given in terms of mutual information rather than e-CMI, is reported in [
11
,
Thm. 2.4]. The randomness of the learning algorithm can reduce the mutual information between its
output and the selection variable Si. Specifically, one can show that
I˜z(`(A(˜zS, R),˜zi); Si)I˜z(`(A(˜zS, R),˜zi); Si|R)(10)
which implies that the square-root bound in
(5)
is tighter than the
R
-conditioned square-root bound in
(9)
. However, the ordering between the disintegrated square-root bound in
(4)
and the disintegrated
R
-
conditioned square-root bound in (8) is unclear.
Next, we generalize the linear bounds of [
18
], which are samplewise extensions of the bounds in [
16
],
to the e-CMI framework. We consider two scenarios. For the first one, we assume that
γ= (γ1, γ2)
are positive constants that satisfy a certain constraint, and let
fγ(ˆ
L, LD) = γ1n(LDγ2ˆ
L)
. For
the second one, we assume that
ˆ
L= 0
, the so-called interpolating setting, and let
fγ(ˆ
L, LD) =
nlog(2)LD. For both of these scenarios, it can be shown that ξγ0, yielding the following.
Theorem 3
(Linear bound and interpolation bound)
.
Consider the CMI setting. Let
ΓR2
+
denote
the set of parameters γ= (γ1, γ2)that satisfy γ1(1 γ2)+(eγ11γ1)(1 + γ2
2)0. Then,
LDmin
γΓγ2ˆ
L+
n
X
i=1
I(`(A(e
ZS, R),e
Zi); Si|e
Z)
γ1n.(11)
Furthermore, if ˆ
L= 0,
LD
n
X
i=1
I(`(A(e
ZS, R),e
Zi); Si|e
Z)
nlog(2) .(12)
The interpolation bound in
(12)
improves on the linear bound in
(11)
when
ˆ
L= 0
, as the constraint
implies that γ2
14(eγ11)(eγ11γ1)0. This means that γ1<0.37, whereas log 2 >0.69.
2.3 Binary KL Bound with Samplewise e-CMI
We now derive bounds in terms of the binary KL divergence between the training loss and the test
loss. To this end, similar to [27, 29], we define
dγ(q||p) = γq log(1 p+peγ).(13)
An important property of this function is that
sup
γ
dγ(q||p) = d(q||p).(14)
Note that both
dγ(·||·)
and
d(·||·)
are jointly convex in their arguments. For our next result, we need
the following lemma.
Lemma 2.
For
i= 1, . . . , n
, let
XiPXi
,
E[Xi] = µi
,
ˆµ=1
nPn
i=1 Xi
, and
¯µ=1
nPn
i=1 µi
.
Assume that
Xi[0,1]
almost surely and that all
Xi
are independent. Then, for every fixed
γ > 0
,
E[exp(ndγ(ˆµ|| ¯µ))] 1.(15)
This inequality, which, to the best of our knowledge, has previously been reported only for identically
distributed random variables, follows by combining [
30
, Lemma 1], [
31
, Thm. 3], and [
29
, Eq. (17)]
(which is a generalization of [27, Lemma 1.1.1] from binary to bounded random variables).2
2A similar result, with d(· || ·)instead of dγ(· || ·), is established in [32, Lemma 2].
5
摘要:

ANewFamilyofGeneralizationBoundsUsingSamplewiseEvaluatedCMIFredrikHellströmChalmersUniversityofTechnologyGothenburg,Swedenfrehells@chalmers.seGiuseppeDurisiChalmersUniversityofTechnologyGothenburg,Swedendurisi@chalmers.seAbstractWepresentanewfamilyofinformation-theoreticgeneralizationbounds,inwhicht...

展开>> 收起<<
A New Family of Generalization Bounds Using Samplewise Evaluated CMI Fredrik Hellström.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:6.76MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注