Bagging in overparameterized learning Risk characterization and risk monotonization Pratik Patil

2025-05-02 0 0 2.48MB 102 页 10玖币
侵权投诉
Bagging in overparameterized learning:
Risk characterization and risk monotonization
Pratik Patil
pratikpatil@berkeley.edu
Jin-Hong Du§
jinhongd@andrew.cmu.edu
Arun Kumar Kuchibhotla§
arunku@cmu.edu
October 26, 2023
Abstract
Bagging is a commonly used ensemble technique in statistics and machine learning to improve the
performance of prediction procedures. In this paper, we study the prediction risk of variants of bagged
predictors under the proportional asymptotics regime, in which the ratio of the number of features to the
number of observations converges to a constant. Specifically, we propose a general strategy to analyze
the prediction risk under squared error loss of bagged predictors using classical results on simple random
sampling. Specializing the strategy, we derive the exact asymptotic risk of the bagged ridge and ridgeless
predictors with an arbitrary number of bags under a well-specified linear model with arbitrary feature
covariance matrices and signal vectors. Furthermore, we prescribe a generic cross-validation procedure
to select the optimal subsample size for bagging and discuss its utility to eliminate the non-monotonic
behavior of the limiting risk in the sample size (i.e., double or multiple descents). In demonstrating
the proposed procedure for bagged ridge and ridgeless predictors, we thoroughly investigate the oracle
properties of the optimal subsample size and provide an in-depth comparison between different bagging
variants.
Contents
1 Introduction 3
1.1 Summary of main results ....................................... 5
1.2 Related work .............................................. 6
1.3 Organization .............................................. 7
2 Bagging general predictors 7
2.1 Conditional risk decompositions ................................... 7
2.2 General reduction strategy ...................................... 11
3 Bagging ridge and ridgeless predictors 11
3.1 Predictors and assumptions ...................................... 12
3.2 Subagging with replacement ..................................... 13
3.3 Splagging without replacement .................................... 18
4 Risk profile monotonization 20
4.1 Bagged general predictors ....................................... 20
4.2 Bagged ridge and ridgeless predictors ................................ 23
4.3 Optimal subagging versus optimal splagging ............................ 23
5 Illustrations and insights 26
6 Discussion 29
Equal contribution.
Department of Statistics, University of California, Berkeley, CA 94720, USA.
§Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
1
arXiv:2210.11445v3 [math.ST] 24 Oct 2023
S.0 Background on simple random sampling 37
S.1 Proofs in Section 2 (general bagged predictors) 38
S.1.1 Proof of Proposition 2.2 ....................................... 38
S.1.2 Proof of Proposition 2.3 ....................................... 40
S.1.3 Proof of Proposition 2.6 ....................................... 43
S.1.4 Proof of Lemma 2.8 .......................................... 44
S.1.5 Proof of Theorem 2.9 ......................................... 45
S.2 Proof of Theorem 3.1 (subagging with replacement, ridge predictor) 45
S.2.1 Proof assembly ............................................. 45
S.2.2 Component concentrations ...................................... 47
S.2.3 Component deterministic approximations .............................. 49
S.2.4 Boundary case: diverging subsample aspect ratio .......................... 55
S.3 Proof of Theorem 3.1 (subagging with replacement, ridgeless predictor) 56
S.3.1 Proof assembly ............................................. 56
S.3.2 Component concentrations ...................................... 58
S.3.3 Component deterministic approximations .............................. 59
S.3.4 Boundary case: diverging subsample aspect ratio .......................... 63
S.4 Proof of Theorem 3.6 (splagging without replacement, ridge, ridgeless predictors) 65
S.5 Proofs related to bagged risk properties 66
S.5.1 Proof of Proposition 3.5 ....................................... 66
S.5.2 Proof of Proposition 3.10 ....................................... 66
S.5.3 Proof of Theorem 4.1 ......................................... 67
S.5.4 Proof of Theorem 4.5 ......................................... 67
S.5.5 Proof of Proposition 4.6 ....................................... 68
S.5.6 Proof of Proposition 4.7 ....................................... 69
S.6 Proofs in Section 5 (isotropic features) 72
S.6.1 Proof of Corollary 5.1 ......................................... 72
S.6.2 Proof of Proposition 5.2 ....................................... 73
S.6.3 Proof of Theorem 5.3 ......................................... 75
S.7 Auxiliary asymptotic equivalency results 75
S.7.1 Preliminaries .............................................. 75
S.7.2 Conditioning and calculus ....................................... 77
S.7.3 Standard ridge resolvents and extensions .............................. 78
S.7.4 Analytic properties of associated fixed-point equations ....................... 82
S.8 Helper concentration results 87
S.8.1 Size of the intersection of randomly sampled datasets ....................... 87
S.8.2 Convergence of random linear and quadratic forms ......................... 89
S.8.3 Convergence of Ce`saro-type mean and max for triangular array .................. 89
S.9 Additional numerical illustrations 92
S.9.1 Additional illustrations for Theorem 3.1 ............................... 92
S.9.2 Additional illustrations for Theorem 3.6 ............................... 96
S.9.3 Additional illustrations for Theorem 4.1 ............................... 99
S.9.4 Additional illustrations in Section 5 .................................102
2
1 Introduction
Modern machine learning models often use a large number of parameters relative to the number of observa-
tions. In this regime, several commonly used procedures exhibit a peculiar risk behavior, which is referred to
as double or multiple descents in the risk profile (Belkin et al.,2019;Zhang et al.,2017,2021). The precise
nature of the double or multiple descent behavior in the generalization error has been studied for various
procedures: e.g., linear regression (Belkin et al.,2020;Muthukumar et al.,2020;Hastie et al.,2022), logistic
regression (Deng et al.,2022), random features regression (Mei and Montanari,2022), kernel regression (Liu
et al.,2021), among others. We refer the readers to the survey papers by Bartlett et al. (2021); Belkin (2021);
Dar et al. (2021) for a more comprehensive review and other related references. In these cases, the asymptotic
prediction risk behavior is often studied as a function of the data aspect ratio (the ratio of the number of
parameters/features to the number of observations). The double descent behavior refers to the phenomenon
where the (asymptotic) risk of a sequence of predictors first increases as a function of the aspect ratio, peaks
at a certain point (or diverges to infinity), and then decreases with the aspect ratio. From a traditional
statistical point of view, the desirable behavior as a function of aspect ratio is not immediately obvious. We
can, however, reformulate this behavior as a function of ϕ=p/n, in terms of the observation size nwith a
fixed p; imagine a large but fixed pand nchanging from 1to . In this reformulation, the double descent
behavior translates to a pattern in which the risk first decreases as nincreases, then increases, peaks at a
certain point, and then decreases again with n. This is a rather counter-intuitive and sub-optimal behavior
for a prediction procedure. The least one would expect from a good prediction procedure is that it yields
better performance with more information (i.e., more data). However, the aforementioned works show that
many commonly used predictors may not exhibit such “good” behavior. Simply put, the non-monotonicity
of the asymptotic risk as a function of the number of observations or the limiting aspect ratio implies that
more data may hurt generalization (Nakkiran,2019).
Several ad hoc regularization techniques have been proposed in the literature to mitigate the double/multiple
descent behaviors. Most of these methods are trial-and-error in nature in the sense that they do not directly
target monotonizing the asymptotic risk but instead try a modification and check that it yields a monotonic
risk. The recent work of Patil et al. (2022b) introduces a generic cross-validation framework that directly
addresses the problem and yields a modification of any given prediction procedure that provably monotonizes
the risk. In a nutshell, the method works by training the predictor on subsets of the full data (with different
subset sizes) and picking the optimal subset size based on the estimated prediction risk computed using
testing data. Intuitively, it is clear that this yields a prediction procedure whose risk is a decreasing function
of the observation size. In the proportional asymptotic regime, where p/n ϕas n, p → ∞, the paper
proves that this strategy returns a prediction procedure whose asymptotic risk is monotonically increasing in
ϕ. The paper theoretically analyzes the case where only one subset is used for each subset size and illustrates
via numerical simulations that using multiple subsets of the data of the same size (i.e., subsampling) can
yield better prediction performance in addition to monotonizing the risk profile. Note that averaging a
predictor computed on Mdifferent subsets of the data of the same size is referred to in the literature as
subagging, a variant of the classical bagging (bootstrap aggregation) proposed by Breiman (1996). The focus
of the current paper is to analyze the properties of bagged predictors in two directions (in the proportional
asymptotics regime): (1) what is the asymptotic predictive risk of the bagged predictors with Mbags as
a function of M, and (2) does the cross-validated bagged predictor provably yield improvements over the
predictor computed on full data and does it have a monotone risk profile (i.e., the asymptotic risk is a
monotonic function of ϕ)?
In this paper, we investigate several variants of bagging, including subagging as a special case. The second
variant of bagging, which we call splagging (that stands for split-aggregating), is the same as the divide-
and-conquer or the data-splitting approach (Rosenblatt and Nadler,2016;Banerjee et al.,2019). The divide-
and-conquer approach is widely used in distributed learning, although not commonly featured in the bagging
literature (Dobriban and Sheng,2020,2021;Mücke et al.,2022). Formally, splagging splits the data into
non-overlapping parts of equal size and averages the predictors trained on these non-overlapping parts. We
refer to the equal size of each part of the data as subsample size. We use the same terminology for subagging
also for the sake of simplicity. Using classical results from survey sampling and some simple lemmas about
3
Figure 1: Overview of optimal bagging over both the subsample aspect ratio and the number of bags. (a) Optimal
asymptotic excess risk curves for ridgeless predictors with and without bagging, under model (M-AR1-LI) when
ρar1 = 0.25 and σ2= 1. The excess risk is the difference between the prediction risk and the noise level σ2. The risk
for the unbagged ridgeless predictor is represented by a blue dashed line, and the null risk is marked as a gray dotted
line. (b) The corresponding optimal limiting subsample aspect ratio ϕs=p/k versus the data aspect ratio ϕ=p/n
for bagged ridgeless predictors. The line ϕs=ϕis colored in green. The optimal subsample aspect ratios are larger
than one (above the horizontal red dashed line). See Section 4.3 for more details on the setup and further related
discussion.
almost sure convergence, we are able to analyze the behavior of subagged and splagged predictors1with
Mbags for arbitrary prediction procedures and general M1. In fact, we show that the asymptotic risk
of bagged predictors for general M1(or simply, M-bagged predictor) can be written in terms of the
asymptotic risks of bagged predictors with M= 1 and M= 2. Rather interestingly, we prove that the
M-bagged predictor’s finite sample predictive risk is uniformly close to its asymptotic limit over all M1.
These results are established in a model-agnostic setting and do not require the proportional asymptotic
regime. Deriving the asymptotic risk behavior of bagged predictors with M= 1 and M= 2 has to be done
on a case-by-case basis, which we perform for ridge and ridgeless prediction procedures. In the context of
bagging for general predictors, we further analyze the cross-validation procedure with M-bagged predictors
for arbitrary M1to select the “best” subsample size for both subagging and splagging. These results show
that subagging and splagging for any M1outperform the predictor computed on the full data. We further
present conditions under which the cross-validated predictor with M-bagged predictors has an asymptotic
risk monotone in the aspect ratio. Specializing these results to the ridge and ridgeless predictors leads to
somewhat surprising results connecting subagging to optimal ridge regression as well as the advantages of
interpolation.
Before proceeding to discuss our specific contributions, we pause to highlight the two most significant take-
away messages from our work. These messages hold under a well-specified linear model, where the features
possess an arbitrary covariance structure, and the response depends on an arbitrary signal vector, both of
which are subject to certain bounded norm regularity constraints.
(T1) Subagging and splagging (the data-splitting approach) of the ridge and ridgeless predictors, when prop-
erly tuned, can significantly improve the prediction risks of these standalone predictors trained on the
full data. This improvement is most pronounced near the interpolation threshold. Importantly, sub-
agging always outperforms splagging. See the left panel of Figure 1for a numerical illustration and
Proposition 4.6 for a formal statement of this result.
(T2) A model-agnostic algorithm exists to tune the subsample size for subagging. This algorithm produces
a predictor whose risk matches that of the oracle-tuned subagged predictor. Notably, the oracle-tuned
subsample size for the ridgeless predictor is always smaller than the number of features. As a result,
subagged ridgeless interpolators always outperform subagged least squares, even when the full data
1A note on terminology for the paper: when referring to subagging and splagging together, we use the generic term bagging.
Similarly, when referring to subagged and splagged predictors together, we simply say bagged predictors.
4
has more observations than the number of features. The same observation holds true for splagging
whenever it provides an improvement. See the right panel of Figure 1for numerical illustrations and
Proposition 4.7 for formal statements of this result.
Intuitively, although bagging may induce bias due to subsampling, it can significantly reduce the prediction
risk by reducing the variance for a suitably chosen subsample size that is smaller than the feature size. This
tradeoff arises because of the different rates at which the bias and variance of the ridgeless predictor increase
near the interpolation threshold. This advantage of interpolation or overparameterization is distinct from
other benefits discussed in the literature, such as self-induced regularization (Bartlett et al.,2021).
1.1 Summary of main results
Below we provide a summary of the main results of this paper.
1. General predictors. In Section 2, we formulate a generic strategy for analyzing the limiting squared
data conditional risk (expected squared error on a future data point, conditional on the full data) of
general M-bagged predictors, showing that the existence of the limiting risk for M= 1 and M= 2
implies the existence of the limiting risk for every M1. Moreover, we show that the limiting risk
of the M-bagged predictor can be written as a linear combination of the limiting risks of M-bagged
predictors with M= 1 and M= 2. Interestingly, the same strategy also works for analyzing the limit
of the subsample conditional risk, which considers conditioning on both the full data and the randomly
drawn subsamples. See Theorem 2.9 for a formal statement. In this general framework, Theorem 2.9
implies that both the data conditional and subsample conditional risks are asymptotically monotone
in the number of bags M. Moreover, for general strongly convex and smooth loss functions, we can
sandwich the risks between quantities of the form C1+C2/M, for some fixed random variables C1and
C2(Proposition 2.6).
2. Ridge and ridgeless predictors. In Section 3, we specialize the aforementioned general strategy
to characterize the data conditional and subsample conditional risks of M-bagged ridge and ridgeless
predictors. The results are formalized in Theorem 3.1 for subagging with and without replacement,
and Theorem 3.6 for splagging without replacement. All these results assume a well-specified linear
model, with an arbitrary covariance matrix for the features and an arbitrary signal vector. Notably, we
assume neither Gaussian features nor isotropic features nor a randomly generated signal. These results
reveal that for the three aforementioned bagging strategies, the bias and variance risk components are
non-increasing in the number of bags M.
3. Cross-validation. In Section 4, we develop a generic cross-validation strategy to select the optimal
subsample or split size (or equivalently, the subsample aspect ratio) and present a general result to
understand the limiting risks of cross-validated predictors. Our theoretical results provide a way to
verify the monotonicity of the limiting risk of the cross-validated predictor in terms of the limiting data
aspect ratio ϕ(Theorem 4.1). In Section 4.2, we specialize in the cross-validated ridge and ridgeless
predictors to obtain the optimal subsample aspect ratio for every M(Theorem 4.5). Moreover, when
optimizing over both the subsample aspect ratio and the number of bags, we show that optimal subagging
always outperforms optimal splagging (Proposition 4.6). Rather surprisingly, in our investigation of the
oracle choice of the subsample size for optimal subagging with M=, we find that the subsample
ratio is always large than one (Proposition 4.7). In Section 5, we also show optimally-tuned subbaged
ridgeless predictor yields the same prediction risk as the optimal ridge predictor for isotropic features
(Theorem 5.3).
From a technical perspective, during the course of our risk analysis of the bagged ridge and ridgeless predic-
tors, we derive novel deterministic equivalents for ridge resolvents with random Tikhonov-type regularization.
We extend ideas of conditional asymptotic equivalents and related calculus, which may be of independent
interest. See Appendix S.7, and in particular Appendix S.7.3.2.
5
摘要:

Bagginginoverparameterizedlearning:RiskcharacterizationandriskmonotonizationPratikPatil†‡pratikpatil@berkeley.eduJin-HongDu†§¶jinhongd@andrew.cmu.eduArunKumarKuchibhotla§arunku@cmu.eduOctober26,2023AbstractBaggingisacommonlyusedensembletechniqueinstatisticsandmachinelearningtoimprovetheperformanceof...

展开>> 收起<<
Bagging in overparameterized learning Risk characterization and risk monotonization Pratik Patil.pdf

共102页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:102 页 大小:2.48MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 102
客服
关注