The good the bad and the ugly sides of data augmentation An implicit spectral regularization perspective Chi-Heng Lin14Chiraag Kaushik1Eva L Dyer12Vidya Muthukumar13

2025-05-06 0 0 3.5MB 72 页 10玖币
侵权投诉
The good, the bad and the ugly sides of data augmentation:
An implicit spectral regularization perspective
Chi-Heng Lin1,4Chiraag Kaushik1Eva L Dyer,1,2Vidya Muthukumar,1,3
Abstract
Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning.
Specific augmentations like translations and scaling in computer vision are traditionally believed to
improve generalization by generating new (artificial) data from the same distribution. However, this
traditional viewpoint does not explain the success of prevalent augmentations in modern machine
learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution.
In this work, we develop a new theoretical framework to characterize the impact of a general class
of DA on underparameterized and overparameterized linear model generalization. Our framework
reveals that DA induces implicit spectral regularization through a combination of two distinct effects:
a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-
dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through
ridge regression. These effects, when applied to popular augmentations, give rise to a wide variety of
phenomena, including discrepancies in generalization between over-parameterized and under-parameterized
regimes and differences between regression and classification tasks. Our framework highlights the nuanced
and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation
design.
1 Introduction
Data augmentation (DA), or the transformation of data samples before or during learning, is a workhorse
of both supervised [65, 37, 46] and self-supervised approaches [29, 19, 31, 3, 76] for machine learning (ML).
It is critical to the success of modern ML in multiple domains, e.g., computer vision [65], natural language
processing [28], time series data [71], and neuroscience [42, 3, 45]. This is especially true in settings where
data and/or labels are scarce or in other cases where algorithms are prone to overfitting [77]. While DA is
perhaps one of the most widely used tools for regularization, most augmentations are applied in an ad hoc
manner, and it is often unclear exactly how, why, and when a DA strategy will work for a given dataset [21,
59, 4].
Recent theoretical studies have provided insights into the effect of DA on learning and generalization when
augmented samples lie close to the original data distribution [23, 18]. However, state-of-the-art augmentations
that are used in practice (e.g. data masking [35], cutout [26], mixup [78]) are stochastic and can significantly
alter the distribution of the data [30, 35, 75]. Despite many efforts to explain the success of DA in the
literature [9, 16, 18, 23, 73], there is still a lack of a comprehensive platform to compare different types of
augmentations at a quantitative level.
In this paper, we address this challenge by proposing a simple yet flexible theoretical framework that
precisely characterizes the impact of DA on generalization. Our framework enables generalization analysis
for: 1. general stochastic augmentations, 2. the classical underparameterized regime [34] and the modern
Both senior authors contributed equally.
1School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA
2Department of Biomedical Engineering, Georgia Institute of Technology, GA, USA
3School of Industrial and Systems Engineering, Georgia Institute of Technology, GA, USA
4Samsung Research America
1
arXiv:2210.05021v3 [cs.LG] 27 Feb 2024
overparameterized regime, 3. regression and classification tasks, and 4. strong and weak distributional-shift
augmentations. To do this, we borrow and build on finite-sample analysis techniques that simultaneously
operate in the underparameterized and overparameterized regime for linear and kernel models [6, 68, 52, 53].
We find that DA induces two types of implicit, training-data-dependent regularization: manipulation of the
spectrum (i.e. eigenvalues) of the data covariance matrix, and the addition of explicit
2
-type regularization
to avoid noise overfitting. The first effect of spectral manipulation can either make or break generalization
by introducing helpful or harmful biases. In contrast, the explicit
2
regularization effect always improves
generalization by preventing possibly harmful overfitting of noise.
Our theory reveals good, bad, and ugly sides to DA depending on the setting, nature of task and type
of augmentation. We find that on one hand, DA improves generalization when it is designed in a targeted
manner to reduce variance while preserving bias (for any setting/task) or if the reduction in variance
outweighs increase in bias (for classification or underparameterized regression). On the other hand, DA
is more unforgiving for overparameterized regression; here, we find that popular augmentations frequently
induce a large increase in both bias and distribution shift between training and test data. We also identify
several ugly (i.e. subtle/nuanced) features to DA depending on whether the task is regression or classification,
the model is underparameterized or overparameterized, and the augmentations are pre-computed or applied
on-the-fly.
1.1 Main contributions
Below, we outline and provide a roadmap of the main contributions of this work.
We propose a new framework for studying non-asymptotic generalization with data augmentation for
linear models by building on the recent literature on the theory of overparameterized learning [6, 68, 52].
We provide natural definitions of the augmentation mean and covariance operators that capture the
impact of change in data distribution on model generalization in Section 3.1, and sharply characterize
the ensuing performance for both regression and classification tasks in Sections 4.3 and 4.4, respectively.
In Section 5, we apply our theory to provide new interpretations of a broad class of randomized DA
strategies used in practice; e.g., random-masking [35], cutout [26], noise injection [9], and group-invariant
augmentations [18]. An example is as follows: while the classical noise injection augmentation [9] causes
only a constant shift in the spectrum, data masking [35, 2], cutout [26] and distribution-preserving
augmentations [18] tend to isotropize the equivalent data spectrum. We also use our framework as
a testbed for new approaches by designing a new augmentation method, inspired by isometries in
random feature rotation (Section 7.1). We show that this augmentation achieves smaller bias than the
least-squared estimator and variance reduction on the order of the ridge estimator.
In Section 6 we empirically examine the influence of DA in conjunction with data and model family on
generalization. We compare our closed-form expression with augmented stochastic gradient descent
(SGD) [23, 18, 19] and pre-computed augmentations [73, 64]. In addition to verifying our theoretical
insights, our experiments reveal phenomena of independent interest, including surprising distinctions
between pre-computed DA and augmented SGD and varying robustness to augmentation hyperparameter
tuning between regression and classification tasks.
We conclude in Section 7 with an extended discussion of the “good, bad and ugly" ideas of DA. In
Section 7.1 we discuss how cleverly designed data-adaptive covariance modification can reduce both bias
and variance, and how a broader class of DA leads to variance reduction that outweighs bias increase
for classification and underparameterized regression tasks. In Section 7.2 we unpack the suboptimalities
of the “isotropizing" effect of DA, particularly in overparameterized regression where bias is especially
harmful [53, 33]. Finally, in Section 7.3, we identify strikingly divergent impacts of DA depending
on whether the task is regression or classification, the model is under or overparameterized, and the
augmentation is pre-computed or applied to SGD. Our findings here corroborate the empirically observed
benefits of DA being primarily applied “on-the-fly", on moderate-dimensional data and classification
tasks [75, 22].
2
1.2 Notation
We use
n
to denote the number of training examples and
p
to denote the data dimension. Given a training data
matrix X
Rn×p
where each row (representing a training example) is independently and identically distributed
(i.i.d.) and has covariance
Σ
:=
E
[xx
], we denote P
Σ
1:k1
and P
Σ
k:
as the projection matrices to the top
k
1and the bottom
pk
+ 1 eigen-subspaces of
Σ
, respectively. For convenience, we denote the residual
Gram matrix by
Ak
(X;
λ
) =
λ
I
n
+X
PΣ
k:
X
T
, where
λ
is some regularization constant. Subscripts denote the
subsets of column vectors when applied to a matrix; e.g. for a matrix Vwe have V
a:b
:= [v
a,
v
a+1,...,
v
b
].
A similar definition applies to vectors; e.g. for a vector xwe have x
a:b
= [x
a,
x
a+1,...,
x
b
]. The Mahalanobis
norm of a vector is defined by
x
H
=
xHx
. For a matrix A,
diag
(A)denotes the diagonal matrix with
a diagonal equal to that of A,
Tr
(A)denotes its trace and
µi
(A)its
i
-th largest eigenvalue. The symbols
and
are used to denote inequality relations that hold up to universal constants which may depend only on
σxor σεand not on nor p. All asymptotic convergence results are stated in probability.
More specific notation corresponding to our signal model is given in Section 4.1, and some additional
notation that is convenient to define for our analysis is postponed to Section 4.2.
2 Related Work
We organize our discussion of related work into two verticals: a) historical and recent perspectives on the role
of data augmentation, and b) recent analyses of minimum-norm and ridge estimators in the over-parameterized
regime.
2.1 Data augmentation
Classical links between DA and regularization: Early analysis of DA showed that adding random
Gaussian noise to data points is equivalent to Tikhonov regularization [9] and vicinal risk minimization [78,
16]; in the latter, a local distribution is defined in the neighborhood of each training sample, and new
samples are drawn from these local distributions to be used during training. These results established an
early link between augmentation and explicit regularization. However, the impact of such approaches on
generalization has been mostly studied in the underparameterized regime of ML, where the primary concern
is reducing variance and avoiding overfitting of noise. Modern ML practices, by contrast, have achieved great
empirical success in overparameterized settings and with a broader range of augmentation strategies [65, 37,
46]. The type of regularization that is induced by these more general augmentation strategies is not well
understood. Our work provides a systematic point of view to study this general connection without assuming
any additional explicit regularization, or specific operating regime.
In-distribution versus out-of-distribution augmentations: Intuitively, if we could design an augmen-
tation that would produce more virtual but identically distributed samples of our data, we would expect an
improvement in generalization. Based on this insight and the inherent structure of many augmentations used
in vision (that have symmetries), another set of works explores the common intuition that data augmentation
helps insert beneficial group-invariances into the learning process [20, 58, 51, 11, 74]. These studies generally
consider cases in which the group structure is explicitly present in the model design via convolutional
architectures [20, 11] or feature maps approximating group-invariant kernels [58, 51]. The authors of [18]
propose a general group-theoretic framework for DA and explain that an averaging effect helps the model
generalize through variance reduction. However, they only consider augmentations that do not alter (or
alter by minimal amounts) the original data distribution; consequently, they identify variance reduction as
a sole positive effect of DA. Moreover, their analysis applies primarily to underparameterized or explicitly
regularized models1.
1
More recent studies of invariant kernel methods, trained to interpolation, suggest that invariance could either improve [48]
or worsen [27] generalization depending on the precise setting. Our results for the overparameterized linear model (in particular,
Corollary 9) also support this message.
3
Recent empirical studies have highlighted the importance of diverse stochastic augmentations [30]. They
argue that in many cases, it is important to introduce samples which are out-of-distribution (OOD) [66,
57] (in the sense that they do not resemble the original data). In our framework, we allow for cases in
which augmentation leads to significant changes in distribution and provide a path to analysis for such OOD
augmentations that encompass empirically popular approaches for DA [35, 26]. We also consider the modern
overparameterized regime [7, 24]. We show that the effects of OOD augmentations go far beyond variance
reduction, and the spectral manipulation effect introduces interesting biases that can either improve or worsen
generalization for overparameterized models.
Analysis of specific types of DA in linear and kernel methods: [23] propose a Markov process-based
framework to model compositional DA and demonstrate an asymptotic connection between a Bayes-optimal
classifier and a kernel classifier dependent on DA. Furthermore, they study the augmented empirical risk
minimization procedure and show that some types of DA, implemented in this way, induce approximate data-
dependent regularization. However, unlike our work, they do not quantitatively study the generalization of
these classifiers. [44] also propose a kernel classifier based on a notion of invariance to local translations, which
produces competitive empirical performance. In another recent analysis, [73] study the generalization of linear
models with DA that constitutes linear transformations on the data for regression in the overparameterized
regime (but still considering additional explicit regularization). They find that data augmentation can
enlarge the span of training data and induce regularization. There are several key differences between their
framework and ours. First, they analyze deterministic DA, while we analyze stochastic augmentations used in
practice [31, 18]. Second, they assume that the augmentations would not change the labels generated by the
ground-truth model, thereby only identifying beneficial scenarios for DA (while we identify scenarios that are
both helpful and harmful). Third, they study empirical risk minimization with pre-computed augmentations,
in contrast to our study of augmentations applied on-the-fly during the optimization process [23, 18], which
are arguably more commonly used in practice. Our experiments in Section 6.4 identify sizably different
impacts of these methods of application of DA even in simple linear models. Finally, the role of DA in linear
model optimization, rather than generalization, has also been recently studied; in particular, [32] characterize
how DA affects the convergence rate of optimization.
The impact of DA on nonlinear models: Recent works aim to to understand the role of DA in
nonlinear models such as neural networks. [43] show that certain local augmentations induce regularization
in deep networks via a “rugosity”, or “roughness” complexity measure. While they show empirically that DA
reduces rugosity, they leave open the question of whether this alone is an appropriate measure of a model’s
generalization capability. Very recently, [64] showed that training a two-layer convolutional neural network
with a specific permutation-style augmentation can have a novel feature manipulation effect. Assuming the
recently posited “multi-view" signal model [1], they show that this permutation-style DA enables the model
to better learn the essential feature for a classification task. They also observe that the benefit becomes more
pronounced for nonlinear models. Our work provides a similar message, as we also identify the DA-induced
data manipulation effect as key to generalization. However, we provide a comprehensive general-purpose
framework for DA by which we can compare and contrast different augmentations that can either help or
hurt generalization, while [64] only analyze a permutation-style augmentation. We believe that combining
our general-purpose framework for DA with a more complex nonlinear model is a promising future direction,
and we discuss possible analysis paths for this in Section 8.
2.2 Interpolation and regularization in overparameterized models
Minimum-norm-interpolation analysis: Our technical approach leverages recent results in overparame-
terized linear regression, where models are allowed to interpolate the training data. Following the definition
of [24], we characterize such works by their explicit focus on models that achieve close to zero training
loss and which have a high complexity relative to the number of training samples. Specifically, many of
these works provide finite sample analysis of the risk of the least squared estimator (LSE) and the ridge
estimator [6, 68, 33, 8, 53]. This line of research (most notably, [6, 68]) finds that the mean squared error
4
(MSE), comprising the bias and variance, can be characterized in terms of the effective ranks of the spectrum
of the data distribution. The main insight is that, contrary to traditional wisdom, perfect interpolation of the
data may not have a harmful effect on the generalization error in highly overparameterized models. In the
context of these advances, we identify the principal impact of DA as spectral manipulation which directly
modifies the effective ranks, thus either improving or worsening generalization. We build in particular on the
work of [68], who provide non-asymptotic characterizations of generalization error for general sub-Gaussian
design, with some additional technical assumptions that also carry over to our framework2.
Subsequently, this type of “harmless interpolation” was shown to occur for classification tasks [52, 14, 70,
17, 63, 25, 50]. In particular, [52, 63] showed that classification can be significantly easier than regression
due to the relative benignness of the 0-1 test loss. Our analysis also compares classification and regression
and shows that the potentially harmful biases generated by DA are frequently nullified with the 0-1 metric.
As a result, we identify several beneficial scenarios for DA in classification tasks. At a technical level, we
generalize the analysis of [52] to sub-Gaussian design. We also believe that our framework can be combined
with the alternative mixture model (where covariates are generated from discrete labels [17, 70, 14]), but we
do not formally explore this path in this paper.
Generalized
2
regularizer analysis: Our framework extends the analyses of least squares and ridge
regression to estimators with general Tikhonov regularization, i.e., a penalty of the form
θ
M
θ
for arbitrary
positive definite matrix M. A closely related work is [72], which analyzes the regression generalization error
of general Tikhonov regularization. However, our work differs from theirs in three key respects. First, the
analysis of [72] is based on the proportional asymptotic limit (where the sample size
n
and data dimension
p
increase proportionally with a fixed ratio) and provides sharp asymptotic formulas for regression error that are
exact, but not closed-form and not easily interpretable. On the other hand, our framework is non-asymptotic,
and we generally consider
pn
or
pn
; our expressions are closed-form, match up to universal constants
and are easily interpretable. Second, our analysis allows for a more general class of random regularizers that
themselves depend on the training data; a key technical innovation involves showing that the additional effect
of this randomness is, in fact, minimal. Third, we do not explicitly consider the problem of determining an
optimal regularizer; instead, we compare and contrast the generalization characteristics of various types of
practical augmentations and discuss which characteristics lead to favorable performance.
In addition to explicitly regularized estimators, [72] also analyze the ridgeless limit for these regularizers,
which can be interpreted as the minimum-Mahalanobis-norm interpolator. In Section 6.1 we show that such
estimators can also be realized in the limit of minimal DA.
The role of explicit regularization and hyperparameter tuning: Research on harmless interpola-
tion and double descent [7] has challenged conventional thinking about regularization and overfitting for
overparameterized models; in particular, good performance can be achieved with weak (or even negative)
explicit regularization [40, 68], and gradient descent trained to interpolation can sometimes beat ridge
regression [60]. These results show that the scale of the ridge regularization significantly affects model
generalization; consequently, recent work strives to estimate the optimal scale of ridge regularization using
cross-validation techniques [56, 55].
As shown in classical work [9], ridge regularization is equivalent to augmentation with (isotropic) Gaussian
noise, and the scale of regularization naturally maps to the variance of Gaussian noise augmentation. Our
work links DA to a much more flexible class of regularizers and shows that some types of DA induce an
implicit regularization that yields much more robust performance across the hyperparameter(s) dictating the
“strength” of the augmentation. In particular, our experiments in Section 6.2 show that random mask [35],
cutout [26] and our new random rotation augmentation yield comparable generalization error for a wide
range of hyperparameters (masking probability, cutout width and rotation angle respectively); the random
rotation is a new augmentation proposed in this work and frequently beats ridge regularization as well as
interpolation. Thus, our flexible framework enables the discovery of DA with appealing robustness properties
not present in the more basic methodology of ridge regularization.
2
As remarked on at various points throughout the paper, we believe that the subsequent and very recent work of [47], which
weakens these assumptions further, can also be plugged with our analysis framework; we will explore this in the sequel.
5
摘要:

Thegood,thebadandtheuglysidesofdataaugmentation:AnimplicitspectralregularizationperspectiveChi-HengLin1,4ChiraagKaushik1EvaLDyer∗,1,2VidyaMuthukumar∗,1,3AbstractDataaugmentation(DA)isapowerfulworkhorseforbolsteringperformanceinmodernmachinelearning.Specificaugmentationsliketranslationsandscalinginco...

展开>> 收起<<
The good the bad and the ugly sides of data augmentation An implicit spectral regularization perspective Chi-Heng Lin14Chiraag Kaushik1Eva L Dyer12Vidya Muthukumar13.pdf

共72页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:72 页 大小:3.5MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 72
客服
关注