1 Introduction
Modern machine learning models often use a large number of parameters relative to the number of observa-
tions. In this regime, several commonly used procedures exhibit a peculiar risk behavior, which is referred to
as double or multiple descents in the risk profile (Belkin et al.,2019;Zhang et al.,2017,2021). The precise
nature of the double or multiple descent behavior in the generalization error has been studied for various
procedures: e.g., linear regression (Belkin et al.,2020;Muthukumar et al.,2020;Hastie et al.,2022), logistic
regression (Deng et al.,2022), random features regression (Mei and Montanari,2022), kernel regression (Liu
et al.,2021), among others. We refer the readers to the survey papers by Bartlett et al. (2021); Belkin (2021);
Dar et al. (2021) for a more comprehensive review and other related references. In these cases, the asymptotic
prediction risk behavior is often studied as a function of the data aspect ratio (the ratio of the number of
parameters/features to the number of observations). The double descent behavior refers to the phenomenon
where the (asymptotic) risk of a sequence of predictors first increases as a function of the aspect ratio, peaks
at a certain point (or diverges to infinity), and then decreases with the aspect ratio. From a traditional
statistical point of view, the desirable behavior as a function of aspect ratio is not immediately obvious. We
can, however, reformulate this behavior as a function of ϕ=p/n, in terms of the observation size nwith a
fixed p; imagine a large but fixed pand nchanging from 1to ∞. In this reformulation, the double descent
behavior translates to a pattern in which the risk first decreases as nincreases, then increases, peaks at a
certain point, and then decreases again with n. This is a rather counter-intuitive and sub-optimal behavior
for a prediction procedure. The least one would expect from a good prediction procedure is that it yields
better performance with more information (i.e., more data). However, the aforementioned works show that
many commonly used predictors may not exhibit such “good” behavior. Simply put, the non-monotonicity
of the asymptotic risk as a function of the number of observations or the limiting aspect ratio implies that
more data may hurt generalization (Nakkiran,2019).
Several ad hoc regularization techniques have been proposed in the literature to mitigate the double/multiple
descent behaviors. Most of these methods are trial-and-error in nature in the sense that they do not directly
target monotonizing the asymptotic risk but instead try a modification and check that it yields a monotonic
risk. The recent work of Patil et al. (2022b) introduces a generic cross-validation framework that directly
addresses the problem and yields a modification of any given prediction procedure that provably monotonizes
the risk. In a nutshell, the method works by training the predictor on subsets of the full data (with different
subset sizes) and picking the optimal subset size based on the estimated prediction risk computed using
testing data. Intuitively, it is clear that this yields a prediction procedure whose risk is a decreasing function
of the observation size. In the proportional asymptotic regime, where p/n →ϕas n, p → ∞, the paper
proves that this strategy returns a prediction procedure whose asymptotic risk is monotonically increasing in
ϕ. The paper theoretically analyzes the case where only one subset is used for each subset size and illustrates
via numerical simulations that using multiple subsets of the data of the same size (i.e., subsampling) can
yield better prediction performance in addition to monotonizing the risk profile. Note that averaging a
predictor computed on Mdifferent subsets of the data of the same size is referred to in the literature as
subagging, a variant of the classical bagging (bootstrap aggregation) proposed by Breiman (1996). The focus
of the current paper is to analyze the properties of bagged predictors in two directions (in the proportional
asymptotics regime): (1) what is the asymptotic predictive risk of the bagged predictors with Mbags as
a function of M, and (2) does the cross-validated bagged predictor provably yield improvements over the
predictor computed on full data and does it have a monotone risk profile (i.e., the asymptotic risk is a
monotonic function of ϕ)?
In this paper, we investigate several variants of bagging, including subagging as a special case. The second
variant of bagging, which we call splagging (that stands for split-aggregating), is the same as the divide-
and-conquer or the data-splitting approach (Rosenblatt and Nadler,2016;Banerjee et al.,2019). The divide-
and-conquer approach is widely used in distributed learning, although not commonly featured in the bagging
literature (Dobriban and Sheng,2020,2021;Mücke et al.,2022). Formally, splagging splits the data into
non-overlapping parts of equal size and averages the predictors trained on these non-overlapping parts. We
refer to the equal size of each part of the data as subsample size. We use the same terminology for subagging
also for the sake of simplicity. Using classical results from survey sampling and some simple lemmas about
3