A New Family of Generalization Bounds Using Samplewise Evaluated CMI Fredrik Hellström

2025-04-27 0 0 6.76MB 27 页 10玖币

侵权投诉

A New Family of Generalization Bounds

Using Samplewise Evaluated CMI

Fredrik Hellström

Chalmers University of Technology

Gothenburg, Sweden

frehells@chalmers.se

Giuseppe Durisi

Chalmers University of Technology

Gothenburg, Sweden

durisi@chalmers.se

Abstract

We present a new family of information-theoretic generalization bounds, in which

the training loss and the population loss are compared through a jointly convex

function. This function is upper-bounded in terms of the disintegrated, sample-

wise, evaluated conditional mutual information (CMI), an information measure

that depends on the losses incurred by the selected hypothesis, rather than on the

hypothesis itself, as is common in probably approximately correct (PAC)-Bayesian

results. We demonstrate the generality of this framework by recovering and ex-

tending previously known information-theoretic bounds. Furthermore, using the

evaluated CMI, we derive a samplewise, average version of Seeger’s PAC-Bayesian

bound, where the convex function is the binary KL divergence. In some scenarios,

this novel bound results in a tighter characterization of the population loss of

deep neural networks than previous bounds. Finally, we derive high-probability

versions of some of these average bounds. We demonstrate the unifying nature of

the evaluated CMI bounds by using them to recover average and high-probability

generalization bounds for multiclass classiﬁcation with ﬁnite Natarajan dimension.

1 Introduction

Information-theoretic generalization bounds, i.e., generalization bounds that are expressed in terms

of information-theoretic metrics such as the Kullback-Leibler (KL) divergence, mutual information,

and conditional mutual information (CMI), have emerged as useful tools to obtain an accurate

characterization of the performance of deep neural networks. To obtain such bounds, one compares

aposterior, that is, the distribution over the hypotheses induced by the learning algorithm, to a

reference distribution called the prior, an approach ﬁrst introduced to provide probably approximately

correct (PAC) bounds for Bayesian classiﬁers [

]. The connection between these results and

classic information-theoretic metrics was clariﬁed in [

], where the generalization gap, averaged

with respect to the joint posterior and data distribution, is bounded in terms of the mutual information

between the training data and the hypothesis. These bounds have since been extended in several

ways [7, 8, 9, 10, 11, 12, 13, 14, 15].

A major step was taken in [

], in which a setting where the training set is randomly selected from a

larger supersample is considered. We refer to this setup as the CMI setting, as it leads to generalization

bounds in terms of the CMI between the hypothesis and training set selection, given the supersample.

It turns out that these bounds can be further tightened by observing that the selected hypothesis enters

the derivation only through the loss that it induces on the supersample. Using this observation, [

]

also derives bounds in terms of the information encoded in these losses rather than in the hypothesis

itself, a quantity referred to as the evaluated CMI (e-CMI). Due to the data-processing inequality,

these bounds are always tighter than the regular CMI bounds. This observation was recently further

arXiv:2210.06422v2 [cs.LG] 27 Mar 2023

exploited by [

], which derived samplewise versions of these bounds.

Intriguingly, these bounds

are both easier to evaluate than their hypothesis-based counterparts, due to the lower dimensionality

of the random variables involved, and quantitatively tighter for deep neural networks. In particular,

while bounds involving information measures based on the hypothesis space tend to increase as

training progresses [11, 18], the bound of [17] remains stable.

The results that have so far been derived for the CMI setting pertain only to the (weighted) difference

between population loss and training loss, or to its squared or absolute value [

]. In

the PAC-Bayesian literature, other types of discrepancy measures have been considered, and shown

to result in tighter bounds on the population loss. For example, [

] consider the binary

KL divergence (i.e., the KL divergence between two Bernoulli distributions) with parameters given

by the training and population loss, respectively, while [

] consider arbitrary jointly convex

functions. Finally, [

] allows for arbitrary functions. It should be noted that in all of these results, a

moment-generating function that depends on the selected function has to be controlled for the bound

to be computable.

Recently, the e-CMI framework was proven to be expressive enough to allow one to rederive known

results in learning theory, e.g., generalization bounds expressed in terms of algorithmic stability,

VC dimension, and related complexity measures [

]. Tightening and extending e-CMI bounds,

which is the main objective of this paper, has the potential to further increase the unifying nature of

the e-CMI framework.

Contributions

Leveraging a basic inequality involving a generic convex function of two random

variables (Lemma 1), we establish several novel disintegrated, samplewise, e-CMI bounds on the

average generalization error. Speciﬁcally, we present i) a square-root bound (Theorem 1) on the

generalization error, together with a mean-squared error extension, which tightens the bound recently

reported in [

]; ii) a linear bound (Theorem 3) that tightens the bound given in [

]; iii) a binary

KL bound (Theorem 4), which is a natural extension to the e-CMI setting of a well-known bound in

the PAC-Bayes literature [

]. While the derivation of the ﬁrst two bounds involves an adaptation

of results available in the literature, to obtain the binary KL bound we need a novel concentration

inequality involving independent but not identically distributed random variables (Lemma 2). As

an additional contribution, we show how to adapt the techniques presented in the paper to obtain

high-probability (rather than average) e-CMI bounds (Theorem 7). Furthermore, we illustrate the

expressiveness of the e-CMI framework by using our bounds to recover average and high-probability

generalization bounds for multiclass classiﬁcation with ﬁnite Natarajan dimension (Theorem 8).

Finally, we conduct numerical experiments on MNIST and CIFAR10, which reveal that the binary

KL bound results in a tighter characterization of the population loss compared to the square-root

bound and linear bound for some deep learning scenarios.

Preliminaries and Notation

We let

D(P||Q)

denote the KL divergence between the two probability measures

and

. This

quantity is well-deﬁned if

is absolutely continuous with respect to

. When

and

are Bernoulli

distributions with parameters

and

respectively, we let

D(P||Q) = d(p||q) = plog(p/q) +

(1 −p) log((1 −p)/(1 −q))

, and we refer to

d(p||q)

as the binary KL divergence. For two

random variables

and

with joint distribution

PXY

and respective marginals

and

, we

let

I(X;Y) = D(PXY ||PXPY)

denote their mutual information. Throughout the paper, we use

uppercase letters to denote random variables and lowercase letters to denote their realizations. We

denote the conditional joint distribution of

and

given an instance

Z=z

PXY |Z=z

and the

corresponding conditional distribution for the product of marginals by

PX|Z=zPY|Z=z

. Furthermore,

we let

Iz(X;Y) = D(PXY |Z=z||PX|Z=zPY|Z=z)

, which is referred to as the disintegrated mutual

information [

]. Its expectation is the conditional mutual information

I(X;Y|Z) = EZIZ(X;Y)

CMI Setting

Let

be the sample space and let

denote the data-generating distribution. Consider

a supersample

Z∈ Zn×2

, where each entry is generated independently from

. For convenience, we

index the columns of

starting from

and the rows starting from

. Furthermore, we denote the

While [

] considers the information stored in predictions rather than losses, referred to as the

-CMI, the

derivations therein can be adapted to obtain bounds that depend on the losses, as we clarify in Section 2.2.

row of e

Zas e

Zi. Let S= (S1, . . . , Sn)denote a membership vector, with entries generated indepen-

dently according to a

Bern(1/2)

distribution, independently of

. Let

S= (1 −S1,...,1−Sn)

denote the modulo-2 complement of

. We refer to

as a membership vector because it is used to

divide the supersample into an

-dimensional training vector

with entries

ZS]i=e

Zi,Si

and an

dimensional test vector

Z¯

with entries

Z¯

S]i=e

Zi, ¯

. To be able to handle arbitrary learning settings,

we consider learning algorithms as maps

A:Zn× R → F

, where

R∈ R

is a random variable

(independent of

and

) that captures the stochasticity of the algorithm and

is a space, for instance,

a parameter space or function space. For a ﬁxed

R=r

and

ZS= ˜zs

A(˜zs, r)

is a deterministic

function of

˜zs

. The quality of the learning algorithm’s output is evaluated using a bounded loss func-

tion

`:F × Z → [0,1]

. We let

be a vector of size

, the elements of which are sampled without

replacement uniformly at random from

(1, . . . , n)

. For a given realization

U=u= (u1, . . . , um)

we let

˜zu

denote the

m×2

matrix obtained by stacking the vectors

˜zui

for

i= 1, . . . , m

. Simi-

larly,

`(A(˜zs, R),˜zu)

denotes the

m×2

matrix of losses obtained by applying

`(A(˜zs, R),·)

elemen-

twise to

˜zu

. We denote the population loss as

LD(A,˜zs, r) = EZ0[`(A(˜zs, r), Z0)]

, where

Z0∼ D

The training loss is

L˜zs(A,˜zs, r) = 1

nPn

i=1 `(A(˜zs, r),[˜zs]i)

. In general, for any

-dimensional

vector of data points

ˆz

, we let

Lˆz(A,˜zs, r) = 1

pPp

i=1 `(A(˜zs, r),ˆzi)

. Since

A(˜zs, r)

does not

depend on

˜z¯s

L˜z¯s(A,˜zs, r)

is a test loss. Finally, we let

LD=Ee

Z,S,R[LD(A,e

ZS, R)]

denote the

average population loss and ˆ

L=Ee

Z,S,R[Le

ZS(A,e

ZS, R)] denote the average training loss.

2 Average Generalization Bounds

In this section, we present the main results of this paper: a new family of disintegrated samplewise

e-CMI bounds for the average population loss. High-probability versions of these bounds are given

in Section 3.

2.1 Main Lemma

We now present the generic inequality upon which the bounds in this section are based. This inequality,

which is similar in nature to the ones provided in [

] and [

], gives us a generic framework to derive

generalization bounds, as it allows for a wide choice of functions for comparing training loss and

population loss. A crucial difference compared to [

] and [

] is that our focus in this section is on

average rather than PAC-Bayesian generalization bounds, and on bounds based on the disintegrated,

samplewise e-CMI, rather than traditional KL-based bounds.

Lemma 1.

Let

fγ: [0,1]2→R

be a function that is jointly convex in its arguments and is

parameterized by

. Let

and

be two random variables, and let

be a random variable

with the same marginal distribution as

such that

and

are independent. Assume that the

joint distribution of

X, Y

is absolutely continuous with respect to the joint distribution of

X, Y 0

Let

g1(X, Y )

and

g2(X, Y )

be measurable functions with range

[0,1]

and ﬁnite ﬁrst moments, such

that, for all γ,EX,Y [fγ(g1(X, Y ), g2(X, Y ))] is ﬁnite. Let

ξγ= log EX,Y 0hefγ(g1(X,Y 0),g2(X,Y 0))i(1)

and assume that ξγis ﬁnite. Then,

sup

fγ(EX,Y [g1(X, Y )] ,EX,Y [g2(X, Y )]) −ξγ≤sup

EX,Y [fγ(g1(X, Y ), g2(X, Y ))] −ξγ

≤I(X;Y).(2)

The proof, which is an application of Jensen’s inequality and Donsker-Varadhan’s variational repre-

sentation of the KL divergence, is given in Appendix A, along with the proofs of all other results

provided in this paper.

Note that we are allowed to optimize

because, in this section, we only consider average bounds. In

contrast, when we derive high-probability bounds in Section 3, we need to ﬁx

. Optimizing

over a

set of candidate values in the high-probability scenario incurs a union bound cost [27, Sec. 1.2.2].

To apply Lemma 1, we need to identify functions

fγ(·,·)

for which the moment generating function

(1)

can be controlled. A theme throughout this section is that we identify functions

fγ(·,·)

for

which there exist concentration results that imply that

ξγ≤0

. This allows us to loosen the bound

in (2) by discarding ξγ.

Throughout this section, we assume that the loss is bounded, so that

`(·,·)∈[0,1]

. Note that the

reported results can be extended to more general losses by use of scaling. In particular, assume that

there is a function

K:F → R+

such that, for all

f∈ F

supz`(f, z)≤K(f)

(referred to as the

hypothesis-dependent range condition in [

]). Then, all of the results that we present for bounded

losses also hold for the scaled loss `(f, z)/K(f).

2.2 Extending (f)-CMI Bounds to e-CMI

We now apply Lemma 1 to recover the average bounds of [

]. These works derive bounds using

the information contained either in the hypothesis itself or the resulting predictions, i.e., the CMI or

the

-CMI. We instead derive bounds in terms of the information captured by the matrix of losses,

i.e., the e-CMI. For parametric supervised learning algorithms, we can recover the original bounds

via the data-processing inequality. This is formalized in the following remark.

Remark 1.

Consider a parametric supervised learning setting, where

Z=X ×Y

and

A(˜zS, R) =

W∈ F

are the parameters of a function

φW:X → Y

. Let

˜xu

denote the

m×2

matrix obtained by

projecting each element of

˜zu

onto

, i.e., the matrix of unlabeled examples. Let

φW(˜xu)

denote the

matrix of predictions obtained by elementwise application of

φW

˜xu

. Then, for any ﬁxed

˜z

and

I˜z,u(`(W, ˜zu); Su)≤I˜z,u(φW(˜xu); Su)≤I˜z,u(W;Su).(3)

We now use Lemma 1 to recover some of the results reported in [

]. Let

∆ = L˜zS0

(A,˜zs, R)−

L˜z¯

(A,˜zs, R)

, where

is an independent copy of

. For a ﬁxed

˜z

, symmetry implies

that

ES0

i[∆] = 0

. Furthermore, since

`(·,·)∈[0,1]

, we have that

∆∈[−1,1]

. Thus,

∆

sub-Gaussian. By using properties of sub-Gaussian random variables, we can control

ξγ

in Lemma 1

when choosing fγ(·,·)as γ∆or γ∆2. The resulting bounds are given in the following theorem.

Theorem 1 (Square-root bound and squared bound).Consider the CMI setting. Then,

LD−ˆ

L≤1

i=1

Zq2Ie

Z(`(A(e

ZS, R),e

Zi); Si)(4)

≤1

i=1 q2I(`(A(e

ZS, R),e

Zi); Si|e

Z).(5)

Furthermore,

Z,R,S LD(A,e

ZS, R)−Le

ZS(A,e

ZS, R)

2≤8

mI(`(A(e

ZS, R),e

ZU); SU|e

Z, U )+2.(6)

(4)

, the expectation over

is taken outside of the square root, and the generalization bound is in

terms of the disintegrated mutual information. By Jensen’s inequality, this is tighter than (5).

As shown in Appendix A, one can further generalize (4) to obtain an upper bound of the form

Z,U q2Ie

Z,U (`(A(e

ZS, R),e

ZU); SU).(7)

However, since

(7)

increases with the size

of the random subset

[

, Prop. 1], we focus only

on the case

m= 1

, as this leads to the tightest bound. The same holds for

(5)

, as well as for all of

the bounds that we present in the remainder of this section, where we also focus only on

m= 1

. In

contrast, the choice of

m= 1

in the squared bound in

(6)

is suboptimal and actually yields a vacuous

bound due to the 2/m term.

It is possible to obtain a bound similar to

(4)

, but where an expectation over

is taken outside

the square root and the mutual information is also conditioned on

. We present this bound in the

following theorem.

Theorem 2 (R-conditioned square-root bound).Consider the CMI setting. Then,

LD−ˆ

L≤1

i=1

Z,Rq2Ie

Z,R(`(A(e

ZS, R),e

Zi); Si)(8)

≤1

i=1 q2I(`(A(e

ZS, R),e

Zi); Si|e

Z, R).(9)

A similar bound, but given in terms of mutual information rather than e-CMI, is reported in [

Thm. 2.4]. The randomness of the learning algorithm can reduce the mutual information between its

output and the selection variable Si. Speciﬁcally, one can show that

I˜z(`(A(˜zS, R),˜zi); Si)≤I˜z(`(A(˜zS, R),˜zi); Si|R)(10)

which implies that the square-root bound in

(5)

is tighter than the

-conditioned square-root bound in

(9)

. However, the ordering between the disintegrated square-root bound in

(4)

and the disintegrated

conditioned square-root bound in (8) is unclear.

Next, we generalize the linear bounds of [

], which are samplewise extensions of the bounds in [

to the e-CMI framework. We consider two scenarios. For the ﬁrst one, we assume that

γ= (γ1, γ2)

are positive constants that satisfy a certain constraint, and let

fγ(ˆ

L, LD) = γ1n(LD−γ2ˆ

. For

the second one, we assume that

L= 0

, the so-called interpolating setting, and let

fγ(ˆ

L, LD) =

nlog(2)LD. For both of these scenarios, it can be shown that ξγ≤0, yielding the following.

Theorem 3

(Linear bound and interpolation bound)

Consider the CMI setting. Let

Γ⊂R2

denote

the set of parameters γ= (γ1, γ2)that satisfy γ1(1 −γ2)+(eγ1−1−γ1)(1 + γ2

2)≤0. Then,

LD≤min

γ∈Γγ2ˆ

L+

i=1

I(`(A(e

ZS, R),e

Zi); Si|e

γ1n.(11)

Furthermore, if ˆ

L= 0,

LD≤

i=1

I(`(A(e

ZS, R),e

Zi); Si|e

nlog(2) .(12)

The interpolation bound in

(12)

improves on the linear bound in

(11)

when

L= 0

, as the constraint

implies that γ2

1−4(eγ1−1)(eγ1−1−γ1)≥0. This means that γ1<0.37, whereas log 2 >0.69.

2.3 Binary KL Bound with Samplewise e-CMI

We now derive bounds in terms of the binary KL divergence between the training loss and the test

loss. To this end, similar to [27, 29], we deﬁne

dγ(q||p) = γq −log(1 −p+peγ).(13)

An important property of this function is that

sup

dγ(q||p) = d(q||p).(14)

Note that both

dγ(·||·)

and

d(·||·)

are jointly convex in their arguments. For our next result, we need

the following lemma.

Lemma 2.

For

i= 1, . . . , n

, let

Xi∼PXi

E[Xi] = µi

ˆµ=1

nPn

i=1 Xi

, and

¯µ=1

nPn

i=1 µi

Assume that

Xi∈[0,1]

almost surely and that all

are independent. Then, for every ﬁxed

γ > 0

E[exp(ndγ(ˆµ|| ¯µ))] ≤1.(15)

This inequality, which, to the best of our knowledge, has previously been reported only for identically

distributed random variables, follows by combining [

, Lemma 1], [

, Thm. 3], and [

, Eq. (17)]

(which is a generalization of [27, Lemma 1.1.1] from binary to bounded random variables).2

2A similar result, with d(· || ·)instead of dγ(· || ·), is established in [32, Lemma 2].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ANewFamilyofGeneralizationBoundsUsingSamplewiseEvaluatedCMIFredrikHellströmChalmersUniversityofTechnologyGothenburg,Swedenfrehells@chalmers.seGiuseppeDurisiChalmersUniversityofTechnologyGothenburg,Swedendurisi@chalmers.seAbstractWepresentanewfamilyofinformation-theoreticgeneralizationbounds,inwhicht...

展开>> 收起<<

A New Family of Generalization Bounds Using Samplewise Evaluated CMI Fredrik Hellström.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A New Family of Generalization Bounds Using Samplewise Evaluated CMI Fredrik Hellström

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: