
(MSE), comprising the bias and variance, can be characterized in terms of the effective ranks of the spectrum
of the data distribution. The main insight is that, contrary to traditional wisdom, perfect interpolation of the
data may not have a harmful effect on the generalization error in highly overparameterized models. In the
context of these advances, we identify the principal impact of DA as spectral manipulation which directly
modifies the effective ranks, thus either improving or worsening generalization. We build in particular on the
work of [68], who provide non-asymptotic characterizations of generalization error for general sub-Gaussian
design, with some additional technical assumptions that also carry over to our framework2.
Subsequently, this type of “harmless interpolation” was shown to occur for classification tasks [52, 14, 70,
17, 63, 25, 50]. In particular, [52, 63] showed that classification can be significantly easier than regression
due to the relative benignness of the 0-1 test loss. Our analysis also compares classification and regression
and shows that the potentially harmful biases generated by DA are frequently nullified with the 0-1 metric.
As a result, we identify several beneficial scenarios for DA in classification tasks. At a technical level, we
generalize the analysis of [52] to sub-Gaussian design. We also believe that our framework can be combined
with the alternative mixture model (where covariates are generated from discrete labels [17, 70, 14]), but we
do not formally explore this path in this paper.
Generalized
ℓ2
regularizer analysis: Our framework extends the analyses of least squares and ridge
regression to estimators with general Tikhonov regularization, i.e., a penalty of the form
θ⊤
M
θ
for arbitrary
positive definite matrix M. A closely related work is [72], which analyzes the regression generalization error
of general Tikhonov regularization. However, our work differs from theirs in three key respects. First, the
analysis of [72] is based on the proportional asymptotic limit (where the sample size
n
and data dimension
p
increase proportionally with a fixed ratio) and provides sharp asymptotic formulas for regression error that are
exact, but not closed-form and not easily interpretable. On the other hand, our framework is non-asymptotic,
and we generally consider
p≫n
or
p≪n
; our expressions are closed-form, match up to universal constants
and are easily interpretable. Second, our analysis allows for a more general class of random regularizers that
themselves depend on the training data; a key technical innovation involves showing that the additional effect
of this randomness is, in fact, minimal. Third, we do not explicitly consider the problem of determining an
optimal regularizer; instead, we compare and contrast the generalization characteristics of various types of
practical augmentations and discuss which characteristics lead to favorable performance.
In addition to explicitly regularized estimators, [72] also analyze the ridgeless limit for these regularizers,
which can be interpreted as the minimum-Mahalanobis-norm interpolator. In Section 6.1 we show that such
estimators can also be realized in the limit of minimal DA.
The role of explicit regularization and hyperparameter tuning: Research on harmless interpola-
tion and double descent [7] has challenged conventional thinking about regularization and overfitting for
overparameterized models; in particular, good performance can be achieved with weak (or even negative)
explicit regularization [40, 68], and gradient descent trained to interpolation can sometimes beat ridge
regression [60]. These results show that the scale of the ridge regularization significantly affects model
generalization; consequently, recent work strives to estimate the optimal scale of ridge regularization using
cross-validation techniques [56, 55].
As shown in classical work [9], ridge regularization is equivalent to augmentation with (isotropic) Gaussian
noise, and the scale of regularization naturally maps to the variance of Gaussian noise augmentation. Our
work links DA to a much more flexible class of regularizers and shows that some types of DA induce an
implicit regularization that yields much more robust performance across the hyperparameter(s) dictating the
“strength” of the augmentation. In particular, our experiments in Section 6.2 show that random mask [35],
cutout [26] and our new random rotation augmentation yield comparable generalization error for a wide
range of hyperparameters (masking probability, cutout width and rotation angle respectively); the random
rotation is a new augmentation proposed in this work and frequently beats ridge regularization as well as
interpolation. Thus, our flexible framework enables the discovery of DA with appealing robustness properties
not present in the more basic methodology of ridge regularization.
2
As remarked on at various points throughout the paper, we believe that the subsequent and very recent work of [47], which
weakens these assumptions further, can also be plugged with our analysis framework; we will explore this in the sequel.
5