Bagging in overparameterized learning Risk characterization and risk monotonization Pratik Patil

2025-05-02 0 0 2.48MB 102 页 10玖币

侵权投诉

Bagging in overparameterized learning:

Risk characterization and risk monotonization

Pratik Patil†‡

pratikpatil@berkeley.edu

Jin-Hong Du†§¶

jinhongd@andrew.cmu.edu

Arun Kumar Kuchibhotla§

arunku@cmu.edu

October 26, 2023

Abstract

Bagging is a commonly used ensemble technique in statistics and machine learning to improve the

performance of prediction procedures. In this paper, we study the prediction risk of variants of bagged

predictors under the proportional asymptotics regime, in which the ratio of the number of features to the

number of observations converges to a constant. Speciﬁcally, we propose a general strategy to analyze

the prediction risk under squared error loss of bagged predictors using classical results on simple random

sampling. Specializing the strategy, we derive the exact asymptotic risk of the bagged ridge and ridgeless

predictors with an arbitrary number of bags under a well-speciﬁed linear model with arbitrary feature

covariance matrices and signal vectors. Furthermore, we prescribe a generic cross-validation procedure

to select the optimal subsample size for bagging and discuss its utility to eliminate the non-monotonic

behavior of the limiting risk in the sample size (i.e., double or multiple descents). In demonstrating

the proposed procedure for bagged ridge and ridgeless predictors, we thoroughly investigate the oracle

properties of the optimal subsample size and provide an in-depth comparison between diﬀerent bagging

variants.

Contents

1 Introduction 3

1.1 Summary of main results ....................................... 5

1.2 Related work .............................................. 6

1.3 Organization .............................................. 7

2 Bagging general predictors 7

2.1 Conditional risk decompositions ................................... 7

2.2 General reduction strategy ...................................... 11

3 Bagging ridge and ridgeless predictors 11

3.1 Predictors and assumptions ...................................... 12

3.2 Subagging with replacement ..................................... 13

3.3 Splagging without replacement .................................... 18

4 Risk proﬁle monotonization 20

4.1 Bagged general predictors ....................................... 20

4.2 Bagged ridge and ridgeless predictors ................................ 23

4.3 Optimal subagging versus optimal splagging ............................ 23

5 Illustrations and insights 26

6 Discussion 29

†Equal contribution.

‡Department of Statistics, University of California, Berkeley, CA 94720, USA.

§Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

¶Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA.

arXiv:2210.11445v3 [math.ST] 24 Oct 2023

S.0 Background on simple random sampling 37

S.1 Proofs in Section 2 (general bagged predictors) 38

S.1.1 Proof of Proposition 2.2 ....................................... 38

S.1.2 Proof of Proposition 2.3 ....................................... 40

S.1.3 Proof of Proposition 2.6 ....................................... 43

S.1.4 Proof of Lemma 2.8 .......................................... 44

S.1.5 Proof of Theorem 2.9 ......................................... 45

S.2 Proof of Theorem 3.1 (subagging with replacement, ridge predictor) 45

S.2.1 Proof assembly ............................................. 45

S.2.2 Component concentrations ...................................... 47

S.2.3 Component deterministic approximations .............................. 49

S.2.4 Boundary case: diverging subsample aspect ratio .......................... 55

S.3 Proof of Theorem 3.1 (subagging with replacement, ridgeless predictor) 56

S.3.1 Proof assembly ............................................. 56

S.3.2 Component concentrations ...................................... 58

S.3.3 Component deterministic approximations .............................. 59

S.3.4 Boundary case: diverging subsample aspect ratio .......................... 63

S.4 Proof of Theorem 3.6 (splagging without replacement, ridge, ridgeless predictors) 65

S.5 Proofs related to bagged risk properties 66

S.5.1 Proof of Proposition 3.5 ....................................... 66

S.5.2 Proof of Proposition 3.10 ....................................... 66

S.5.3 Proof of Theorem 4.1 ......................................... 67

S.5.4 Proof of Theorem 4.5 ......................................... 67

S.5.5 Proof of Proposition 4.6 ....................................... 68

S.5.6 Proof of Proposition 4.7 ....................................... 69

S.6 Proofs in Section 5 (isotropic features) 72

S.6.1 Proof of Corollary 5.1 ......................................... 72

S.6.2 Proof of Proposition 5.2 ....................................... 73

S.6.3 Proof of Theorem 5.3 ......................................... 75

S.7 Auxiliary asymptotic equivalency results 75

S.7.1 Preliminaries .............................................. 75

S.7.2 Conditioning and calculus ....................................... 77

S.7.3 Standard ridge resolvents and extensions .............................. 78

S.7.4 Analytic properties of associated ﬁxed-point equations ....................... 82

S.8 Helper concentration results 87

S.8.1 Size of the intersection of randomly sampled datasets ....................... 87

S.8.2 Convergence of random linear and quadratic forms ......................... 89

S.8.3 Convergence of Ce`saro-type mean and max for triangular array .................. 89

S.9 Additional numerical illustrations 92

S.9.1 Additional illustrations for Theorem 3.1 ............................... 92

S.9.2 Additional illustrations for Theorem 3.6 ............................... 96

S.9.3 Additional illustrations for Theorem 4.1 ............................... 99

S.9.4 Additional illustrations in Section 5 .................................102

1 Introduction

Modern machine learning models often use a large number of parameters relative to the number of observa-

tions. In this regime, several commonly used procedures exhibit a peculiar risk behavior, which is referred to

as double or multiple descents in the risk proﬁle (Belkin et al.,2019;Zhang et al.,2017,2021). The precise

nature of the double or multiple descent behavior in the generalization error has been studied for various

procedures: e.g., linear regression (Belkin et al.,2020;Muthukumar et al.,2020;Hastie et al.,2022), logistic

regression (Deng et al.,2022), random features regression (Mei and Montanari,2022), kernel regression (Liu

et al.,2021), among others. We refer the readers to the survey papers by Bartlett et al. (2021); Belkin (2021);

Dar et al. (2021) for a more comprehensive review and other related references. In these cases, the asymptotic

prediction risk behavior is often studied as a function of the data aspect ratio (the ratio of the number of

parameters/features to the number of observations). The double descent behavior refers to the phenomenon

where the (asymptotic) risk of a sequence of predictors ﬁrst increases as a function of the aspect ratio, peaks

at a certain point (or diverges to inﬁnity), and then decreases with the aspect ratio. From a traditional

statistical point of view, the desirable behavior as a function of aspect ratio is not immediately obvious. We

can, however, reformulate this behavior as a function of ϕ=p/n, in terms of the observation size nwith a

ﬁxed p; imagine a large but ﬁxed pand nchanging from 1to ∞. In this reformulation, the double descent

behavior translates to a pattern in which the risk ﬁrst decreases as nincreases, then increases, peaks at a

certain point, and then decreases again with n. This is a rather counter-intuitive and sub-optimal behavior

for a prediction procedure. The least one would expect from a good prediction procedure is that it yields

better performance with more information (i.e., more data). However, the aforementioned works show that

many commonly used predictors may not exhibit such “good” behavior. Simply put, the non-monotonicity

of the asymptotic risk as a function of the number of observations or the limiting aspect ratio implies that

more data may hurt generalization (Nakkiran,2019).

Several ad hoc regularization techniques have been proposed in the literature to mitigate the double/multiple

descent behaviors. Most of these methods are trial-and-error in nature in the sense that they do not directly

target monotonizing the asymptotic risk but instead try a modiﬁcation and check that it yields a monotonic

risk. The recent work of Patil et al. (2022b) introduces a generic cross-validation framework that directly

addresses the problem and yields a modiﬁcation of any given prediction procedure that provably monotonizes

the risk. In a nutshell, the method works by training the predictor on subsets of the full data (with diﬀerent

subset sizes) and picking the optimal subset size based on the estimated prediction risk computed using

testing data. Intuitively, it is clear that this yields a prediction procedure whose risk is a decreasing function

of the observation size. In the proportional asymptotic regime, where p/n →ϕas n, p → ∞, the paper

proves that this strategy returns a prediction procedure whose asymptotic risk is monotonically increasing in

ϕ. The paper theoretically analyzes the case where only one subset is used for each subset size and illustrates

via numerical simulations that using multiple subsets of the data of the same size (i.e., subsampling) can

yield better prediction performance in addition to monotonizing the risk proﬁle. Note that averaging a

predictor computed on Mdiﬀerent subsets of the data of the same size is referred to in the literature as

subagging, a variant of the classical bagging (bootstrap aggregation) proposed by Breiman (1996). The focus

of the current paper is to analyze the properties of bagged predictors in two directions (in the proportional

asymptotics regime): (1) what is the asymptotic predictive risk of the bagged predictors with Mbags as

a function of M, and (2) does the cross-validated bagged predictor provably yield improvements over the

predictor computed on full data and does it have a monotone risk proﬁle (i.e., the asymptotic risk is a

monotonic function of ϕ)?

In this paper, we investigate several variants of bagging, including subagging as a special case. The second

variant of bagging, which we call splagging (that stands for split-aggregating), is the same as the divide-

and-conquer or the data-splitting approach (Rosenblatt and Nadler,2016;Banerjee et al.,2019). The divide-

and-conquer approach is widely used in distributed learning, although not commonly featured in the bagging

literature (Dobriban and Sheng,2020,2021;Mücke et al.,2022). Formally, splagging splits the data into

non-overlapping parts of equal size and averages the predictors trained on these non-overlapping parts. We

refer to the equal size of each part of the data as subsample size. We use the same terminology for subagging

also for the sake of simplicity. Using classical results from survey sampling and some simple lemmas about

Figure 1: Overview of optimal bagging over both the subsample aspect ratio and the number of bags. (a) Optimal

asymptotic excess risk curves for ridgeless predictors with and without bagging, under model (M-AR1-LI) when

ρar1 = 0.25 and σ2= 1. The excess risk is the diﬀerence between the prediction risk and the noise level σ2. The risk

for the unbagged ridgeless predictor is represented by a blue dashed line, and the null risk is marked as a gray dotted

line. (b) The corresponding optimal limiting subsample aspect ratio ϕs=p/k versus the data aspect ratio ϕ=p/n

for bagged ridgeless predictors. The line ϕs=ϕis colored in green. The optimal subsample aspect ratios are larger

than one (above the horizontal red dashed line). See Section 4.3 for more details on the setup and further related

discussion.

almost sure convergence, we are able to analyze the behavior of subagged and splagged predictors1with

Mbags for arbitrary prediction procedures and general M≥1. In fact, we show that the asymptotic risk

of bagged predictors for general M≥1(or simply, M-bagged predictor) can be written in terms of the

asymptotic risks of bagged predictors with M= 1 and M= 2. Rather interestingly, we prove that the

M-bagged predictor’s ﬁnite sample predictive risk is uniformly close to its asymptotic limit over all M≥1.

These results are established in a model-agnostic setting and do not require the proportional asymptotic

regime. Deriving the asymptotic risk behavior of bagged predictors with M= 1 and M= 2 has to be done

on a case-by-case basis, which we perform for ridge and ridgeless prediction procedures. In the context of

bagging for general predictors, we further analyze the cross-validation procedure with M-bagged predictors

for arbitrary M≥1to select the “best” subsample size for both subagging and splagging. These results show

that subagging and splagging for any M≥1outperform the predictor computed on the full data. We further

present conditions under which the cross-validated predictor with M-bagged predictors has an asymptotic

risk monotone in the aspect ratio. Specializing these results to the ridge and ridgeless predictors leads to

somewhat surprising results connecting subagging to optimal ridge regression as well as the advantages of

interpolation.

Before proceeding to discuss our speciﬁc contributions, we pause to highlight the two most signiﬁcant take-

away messages from our work. These messages hold under a well-speciﬁed linear model, where the features

possess an arbitrary covariance structure, and the response depends on an arbitrary signal vector, both of

which are subject to certain bounded norm regularity constraints.

(T1) Subagging and splagging (the data-splitting approach) of the ridge and ridgeless predictors, when prop-

erly tuned, can signiﬁcantly improve the prediction risks of these standalone predictors trained on the

full data. This improvement is most pronounced near the interpolation threshold. Importantly, sub-

agging always outperforms splagging. See the left panel of Figure 1for a numerical illustration and

Proposition 4.6 for a formal statement of this result.

(T2) A model-agnostic algorithm exists to tune the subsample size for subagging. This algorithm produces

a predictor whose risk matches that of the oracle-tuned subagged predictor. Notably, the oracle-tuned

subsample size for the ridgeless predictor is always smaller than the number of features. As a result,

subagged ridgeless interpolators always outperform subagged least squares, even when the full data

1A note on terminology for the paper: when referring to subagging and splagging together, we use the generic term bagging.

Similarly, when referring to subagged and splagged predictors together, we simply say bagged predictors.

has more observations than the number of features. The same observation holds true for splagging

whenever it provides an improvement. See the right panel of Figure 1for numerical illustrations and

Proposition 4.7 for formal statements of this result.

Intuitively, although bagging may induce bias due to subsampling, it can signiﬁcantly reduce the prediction

risk by reducing the variance for a suitably chosen subsample size that is smaller than the feature size. This

tradeoﬀ arises because of the diﬀerent rates at which the bias and variance of the ridgeless predictor increase

near the interpolation threshold. This advantage of interpolation or overparameterization is distinct from

other beneﬁts discussed in the literature, such as self-induced regularization (Bartlett et al.,2021).

1.1 Summary of main results

Below we provide a summary of the main results of this paper.

1. General predictors. In Section 2, we formulate a generic strategy for analyzing the limiting squared

data conditional risk (expected squared error on a future data point, conditional on the full data) of

general M-bagged predictors, showing that the existence of the limiting risk for M= 1 and M= 2

implies the existence of the limiting risk for every M≥1. Moreover, we show that the limiting risk

of the M-bagged predictor can be written as a linear combination of the limiting risks of M-bagged

predictors with M= 1 and M= 2. Interestingly, the same strategy also works for analyzing the limit

of the subsample conditional risk, which considers conditioning on both the full data and the randomly

drawn subsamples. See Theorem 2.9 for a formal statement. In this general framework, Theorem 2.9

implies that both the data conditional and subsample conditional risks are asymptotically monotone

in the number of bags M. Moreover, for general strongly convex and smooth loss functions, we can

sandwich the risks between quantities of the form C1+C2/M, for some ﬁxed random variables C1and

C2(Proposition 2.6).

2. Ridge and ridgeless predictors. In Section 3, we specialize the aforementioned general strategy

to characterize the data conditional and subsample conditional risks of M-bagged ridge and ridgeless

predictors. The results are formalized in Theorem 3.1 for subagging with and without replacement,

and Theorem 3.6 for splagging without replacement. All these results assume a well-speciﬁed linear

model, with an arbitrary covariance matrix for the features and an arbitrary signal vector. Notably, we

assume neither Gaussian features nor isotropic features nor a randomly generated signal. These results

reveal that for the three aforementioned bagging strategies, the bias and variance risk components are

non-increasing in the number of bags M.

3. Cross-validation. In Section 4, we develop a generic cross-validation strategy to select the optimal

subsample or split size (or equivalently, the subsample aspect ratio) and present a general result to

understand the limiting risks of cross-validated predictors. Our theoretical results provide a way to

verify the monotonicity of the limiting risk of the cross-validated predictor in terms of the limiting data

aspect ratio ϕ(Theorem 4.1). In Section 4.2, we specialize in the cross-validated ridge and ridgeless

predictors to obtain the optimal subsample aspect ratio for every M(Theorem 4.5). Moreover, when

optimizing over both the subsample aspect ratio and the number of bags, we show that optimal subagging

always outperforms optimal splagging (Proposition 4.6). Rather surprisingly, in our investigation of the

oracle choice of the subsample size for optimal subagging with M=∞, we ﬁnd that the subsample

ratio is always large than one (Proposition 4.7). In Section 5, we also show optimally-tuned subbaged

ridgeless predictor yields the same prediction risk as the optimal ridge predictor for isotropic features

(Theorem 5.3).

From a technical perspective, during the course of our risk analysis of the bagged ridge and ridgeless predic-

tors, we derive novel deterministic equivalents for ridge resolvents with random Tikhonov-type regularization.

We extend ideas of conditional asymptotic equivalents and related calculus, which may be of independent

interest. See Appendix S.7, and in particular Appendix S.7.3.2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Bagginginoverparameterizedlearning:RiskcharacterizationandriskmonotonizationPratikPatil†‡pratikpatil@berkeley.eduJin-HongDu†§¶jinhongd@andrew.cmu.eduArunKumarKuchibhotla§arunku@cmu.eduOctober26,2023AbstractBaggingisacommonlyusedensembletechniqueinstatisticsandmachinelearningtoimprovetheperformanceof...

展开>> 收起<<

Bagging in overparameterized learning Risk characterization and risk monotonization Pratik Patil.pdf

共102页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bagging in overparameterized learning Risk characterization and risk monotonization Pratik Patil

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: