The good the bad and the ugly sides of data augmentation An implicit spectral regularization perspective Chi-Heng Lin14Chiraag Kaushik1Eva L Dyer12Vidya Muthukumar13

2025-05-06 0 0 3.5MB 72 页 10玖币

侵权投诉

The good, the bad and the ugly sides of data augmentation:

An implicit spectral regularization perspective

Chi-Heng Lin1,4Chiraag Kaushik1Eva L Dyer∗,1,2Vidya Muthukumar∗,1,3

Abstract

Data augmentation (DA) is a powerful workhorse for bolstering performance in modern machine learning.

Speciﬁc augmentations like translations and scaling in computer vision are traditionally believed to

improve generalization by generating new (artiﬁcial) data from the same distribution. However, this

traditional viewpoint does not explain the success of prevalent augmentations in modern machine

learning (e.g. randomized masking, cutout, mixup), that greatly alter the training data distribution.

In this work, we develop a new theoretical framework to characterize the impact of a general class

of DA on underparameterized and overparameterized linear model generalization. Our framework

reveals that DA induces implicit spectral regularization through a combination of two distinct eﬀects:

a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-

dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through

ridge regression. These eﬀects, when applied to popular augmentations, give rise to a wide variety of

phenomena, including discrepancies in generalization between over-parameterized and under-parameterized

regimes and diﬀerences between regression and classiﬁcation tasks. Our framework highlights the nuanced

and sometimes surprising impacts of DA on generalization, and serves as a testbed for novel augmentation

design.

1 Introduction

Data augmentation (DA), or the transformation of data samples before or during learning, is a workhorse

of both supervised [65, 37, 46] and self-supervised approaches [29, 19, 31, 3, 76] for machine learning (ML).

It is critical to the success of modern ML in multiple domains, e.g., computer vision [65], natural language

processing [28], time series data [71], and neuroscience [42, 3, 45]. This is especially true in settings where

data and/or labels are scarce or in other cases where algorithms are prone to overﬁtting [77]. While DA is

perhaps one of the most widely used tools for regularization, most augmentations are applied in an ad hoc

manner, and it is often unclear exactly how, why, and when a DA strategy will work for a given dataset [21,

59, 4].

Recent theoretical studies have provided insights into the eﬀect of DA on learning and generalization when

augmented samples lie close to the original data distribution [23, 18]. However, state-of-the-art augmentations

that are used in practice (e.g. data masking [35], cutout [26], mixup [78]) are stochastic and can signiﬁcantly

alter the distribution of the data [30, 35, 75]. Despite many eﬀorts to explain the success of DA in the

literature [9, 16, 18, 23, 73], there is still a lack of a comprehensive platform to compare diﬀerent types of

augmentations at a quantitative level.

In this paper, we address this challenge by proposing a simple yet ﬂexible theoretical framework that

precisely characterizes the impact of DA on generalization. Our framework enables generalization analysis

for: 1. general stochastic augmentations, 2. the classical underparameterized regime [34] and the modern

∗Both senior authors contributed equally.

1School of Electrical and Computer Engineering, Georgia Institute of Technology, GA, USA

2Department of Biomedical Engineering, Georgia Institute of Technology, GA, USA

3School of Industrial and Systems Engineering, Georgia Institute of Technology, GA, USA

4Samsung Research America

arXiv:2210.05021v3 [cs.LG] 27 Feb 2024

overparameterized regime, 3. regression and classiﬁcation tasks, and 4. strong and weak distributional-shift

augmentations. To do this, we borrow and build on ﬁnite-sample analysis techniques that simultaneously

operate in the underparameterized and overparameterized regime for linear and kernel models [6, 68, 52, 53].

We ﬁnd that DA induces two types of implicit, training-data-dependent regularization: manipulation of the

spectrum (i.e. eigenvalues) of the data covariance matrix, and the addition of explicit

ℓ2

-type regularization

to avoid noise overﬁtting. The ﬁrst eﬀect of spectral manipulation can either make or break generalization

by introducing helpful or harmful biases. In contrast, the explicit

ℓ2

regularization eﬀect always improves

generalization by preventing possibly harmful overﬁtting of noise.

Our theory reveals good, bad, and ugly sides to DA depending on the setting, nature of task and type

of augmentation. We ﬁnd that on one hand, DA improves generalization when it is designed in a targeted

manner to reduce variance while preserving bias (for any setting/task) or if the reduction in variance

outweighs increase in bias (for classiﬁcation or underparameterized regression). On the other hand, DA

is more unforgiving for overparameterized regression; here, we ﬁnd that popular augmentations frequently

induce a large increase in both bias and distribution shift between training and test data. We also identify

several ugly (i.e. subtle/nuanced) features to DA depending on whether the task is regression or classiﬁcation,

the model is underparameterized or overparameterized, and the augmentations are pre-computed or applied

on-the-ﬂy.

1.1 Main contributions

Below, we outline and provide a roadmap of the main contributions of this work.

•

We propose a new framework for studying non-asymptotic generalization with data augmentation for

linear models by building on the recent literature on the theory of overparameterized learning [6, 68, 52].

We provide natural deﬁnitions of the augmentation mean and covariance operators that capture the

impact of change in data distribution on model generalization in Section 3.1, and sharply characterize

the ensuing performance for both regression and classiﬁcation tasks in Sections 4.3 and 4.4, respectively.

•

In Section 5, we apply our theory to provide new interpretations of a broad class of randomized DA

strategies used in practice; e.g., random-masking [35], cutout [26], noise injection [9], and group-invariant

augmentations [18]. An example is as follows: while the classical noise injection augmentation [9] causes

only a constant shift in the spectrum, data masking [35, 2], cutout [26] and distribution-preserving

augmentations [18] tend to isotropize the equivalent data spectrum. We also use our framework as

a testbed for new approaches by designing a new augmentation method, inspired by isometries in

random feature rotation (Section 7.1). We show that this augmentation achieves smaller bias than the

least-squared estimator and variance reduction on the order of the ridge estimator.

•

In Section 6 we empirically examine the inﬂuence of DA in conjunction with data and model family on

generalization. We compare our closed-form expression with augmented stochastic gradient descent

(SGD) [23, 18, 19] and pre-computed augmentations [73, 64]. In addition to verifying our theoretical

insights, our experiments reveal phenomena of independent interest, including surprising distinctions

between pre-computed DA and augmented SGD and varying robustness to augmentation hyperparameter

tuning between regression and classiﬁcation tasks.

•

We conclude in Section 7 with an extended discussion of the “good, bad and ugly" ideas of DA. In

Section 7.1 we discuss how cleverly designed data-adaptive covariance modiﬁcation can reduce both bias

and variance, and how a broader class of DA leads to variance reduction that outweighs bias increase

for classiﬁcation and underparameterized regression tasks. In Section 7.2 we unpack the suboptimalities

of the “isotropizing" eﬀect of DA, particularly in overparameterized regression where bias is especially

harmful [53, 33]. Finally, in Section 7.3, we identify strikingly divergent impacts of DA depending

on whether the task is regression or classiﬁcation, the model is under or overparameterized, and the

augmentation is pre-computed or applied to SGD. Our ﬁndings here corroborate the empirically observed

beneﬁts of DA being primarily applied “on-the-ﬂy", on moderate-dimensional data and classiﬁcation

tasks [75, 22].

1.2 Notation

We use

to denote the number of training examples and

to denote the data dimension. Given a training data

matrix X

∈Rn×p

where each row (representing a training example) is independently and identically distributed

(i.i.d.) and has covariance

[xx

⊤

], we denote P

1:k−1

and P

k:∞

as the projection matrices to the top

k−

1and the bottom

p−k

+ 1 eigen-subspaces of

, respectively. For convenience, we denote the residual

Gram matrix by

(X;

) =

PΣ

k:∞

, where

is some regularization constant. Subscripts denote the

subsets of column vectors when applied to a matrix; e.g. for a matrix Vwe have V

a:b

:= [v

a+1,...,

A similar deﬁnition applies to vectors; e.g. for a vector xwe have x

a:b

= [x

a+1,...,

]. The Mahalanobis

norm of a vector is deﬁned by

∥

∥H

√x⊤Hx

. For a matrix A,

diag

(A)denotes the diagonal matrix with

a diagonal equal to that of A,

(A)denotes its trace and

µi

(A)its

-th largest eigenvalue. The symbols

≳

and

≲

are used to denote inequality relations that hold up to universal constants which may depend only on

σxor σεand not on nor p. All asymptotic convergence results are stated in probability.

More speciﬁc notation corresponding to our signal model is given in Section 4.1, and some additional

notation that is convenient to deﬁne for our analysis is postponed to Section 4.2.

2 Related Work

We organize our discussion of related work into two verticals: a) historical and recent perspectives on the role

of data augmentation, and b) recent analyses of minimum-norm and ridge estimators in the over-parameterized

regime.

2.1 Data augmentation

Classical links between DA and regularization: Early analysis of DA showed that adding random

Gaussian noise to data points is equivalent to Tikhonov regularization [9] and vicinal risk minimization [78,

16]; in the latter, a local distribution is deﬁned in the neighborhood of each training sample, and new

samples are drawn from these local distributions to be used during training. These results established an

early link between augmentation and explicit regularization. However, the impact of such approaches on

generalization has been mostly studied in the underparameterized regime of ML, where the primary concern

is reducing variance and avoiding overﬁtting of noise. Modern ML practices, by contrast, have achieved great

empirical success in overparameterized settings and with a broader range of augmentation strategies [65, 37,

46]. The type of regularization that is induced by these more general augmentation strategies is not well

understood. Our work provides a systematic point of view to study this general connection without assuming

any additional explicit regularization, or speciﬁc operating regime.

In-distribution versus out-of-distribution augmentations: Intuitively, if we could design an augmen-

tation that would produce more virtual but identically distributed samples of our data, we would expect an

improvement in generalization. Based on this insight and the inherent structure of many augmentations used

in vision (that have symmetries), another set of works explores the common intuition that data augmentation

helps insert beneﬁcial group-invariances into the learning process [20, 58, 51, 11, 74]. These studies generally

consider cases in which the group structure is explicitly present in the model design via convolutional

architectures [20, 11] or feature maps approximating group-invariant kernels [58, 51]. The authors of [18]

propose a general group-theoretic framework for DA and explain that an averaging eﬀect helps the model

generalize through variance reduction. However, they only consider augmentations that do not alter (or

alter by minimal amounts) the original data distribution; consequently, they identify variance reduction as

a sole positive eﬀect of DA. Moreover, their analysis applies primarily to underparameterized or explicitly

regularized models1.

More recent studies of invariant kernel methods, trained to interpolation, suggest that invariance could either improve [48]

or worsen [27] generalization depending on the precise setting. Our results for the overparameterized linear model (in particular,

Corollary 9) also support this message.

Recent empirical studies have highlighted the importance of diverse stochastic augmentations [30]. They

argue that in many cases, it is important to introduce samples which are out-of-distribution (OOD) [66,

57] (in the sense that they do not resemble the original data). In our framework, we allow for cases in

which augmentation leads to signiﬁcant changes in distribution and provide a path to analysis for such OOD

augmentations that encompass empirically popular approaches for DA [35, 26]. We also consider the modern

overparameterized regime [7, 24]. We show that the eﬀects of OOD augmentations go far beyond variance

reduction, and the spectral manipulation eﬀect introduces interesting biases that can either improve or worsen

generalization for overparameterized models.

Analysis of speciﬁc types of DA in linear and kernel methods: [23] propose a Markov process-based

framework to model compositional DA and demonstrate an asymptotic connection between a Bayes-optimal

classiﬁer and a kernel classiﬁer dependent on DA. Furthermore, they study the augmented empirical risk

minimization procedure and show that some types of DA, implemented in this way, induce approximate data-

dependent regularization. However, unlike our work, they do not quantitatively study the generalization of

these classiﬁers. [44] also propose a kernel classiﬁer based on a notion of invariance to local translations, which

produces competitive empirical performance. In another recent analysis, [73] study the generalization of linear

models with DA that constitutes linear transformations on the data for regression in the overparameterized

regime (but still considering additional explicit regularization). They ﬁnd that data augmentation can

enlarge the span of training data and induce regularization. There are several key diﬀerences between their

framework and ours. First, they analyze deterministic DA, while we analyze stochastic augmentations used in

practice [31, 18]. Second, they assume that the augmentations would not change the labels generated by the

ground-truth model, thereby only identifying beneﬁcial scenarios for DA (while we identify scenarios that are

both helpful and harmful). Third, they study empirical risk minimization with pre-computed augmentations,

in contrast to our study of augmentations applied on-the-ﬂy during the optimization process [23, 18], which

are arguably more commonly used in practice. Our experiments in Section 6.4 identify sizably diﬀerent

impacts of these methods of application of DA even in simple linear models. Finally, the role of DA in linear

model optimization, rather than generalization, has also been recently studied; in particular, [32] characterize

how DA aﬀects the convergence rate of optimization.

The impact of DA on nonlinear models: Recent works aim to to understand the role of DA in

nonlinear models such as neural networks. [43] show that certain local augmentations induce regularization

in deep networks via a “rugosity”, or “roughness” complexity measure. While they show empirically that DA

reduces rugosity, they leave open the question of whether this alone is an appropriate measure of a model’s

generalization capability. Very recently, [64] showed that training a two-layer convolutional neural network

with a speciﬁc permutation-style augmentation can have a novel feature manipulation eﬀect. Assuming the

recently posited “multi-view" signal model [1], they show that this permutation-style DA enables the model

to better learn the essential feature for a classiﬁcation task. They also observe that the beneﬁt becomes more

pronounced for nonlinear models. Our work provides a similar message, as we also identify the DA-induced

data manipulation eﬀect as key to generalization. However, we provide a comprehensive general-purpose

framework for DA by which we can compare and contrast diﬀerent augmentations that can either help or

hurt generalization, while [64] only analyze a permutation-style augmentation. We believe that combining

our general-purpose framework for DA with a more complex nonlinear model is a promising future direction,

and we discuss possible analysis paths for this in Section 8.

2.2 Interpolation and regularization in overparameterized models

Minimum-norm-interpolation analysis: Our technical approach leverages recent results in overparame-

terized linear regression, where models are allowed to interpolate the training data. Following the deﬁnition

of [24], we characterize such works by their explicit focus on models that achieve close to zero training

loss and which have a high complexity relative to the number of training samples. Speciﬁcally, many of

these works provide ﬁnite sample analysis of the risk of the least squared estimator (LSE) and the ridge

estimator [6, 68, 33, 8, 53]. This line of research (most notably, [6, 68]) ﬁnds that the mean squared error

(MSE), comprising the bias and variance, can be characterized in terms of the eﬀective ranks of the spectrum

of the data distribution. The main insight is that, contrary to traditional wisdom, perfect interpolation of the

data may not have a harmful eﬀect on the generalization error in highly overparameterized models. In the

context of these advances, we identify the principal impact of DA as spectral manipulation which directly

modiﬁes the eﬀective ranks, thus either improving or worsening generalization. We build in particular on the

work of [68], who provide non-asymptotic characterizations of generalization error for general sub-Gaussian

design, with some additional technical assumptions that also carry over to our framework2.

Subsequently, this type of “harmless interpolation” was shown to occur for classiﬁcation tasks [52, 14, 70,

17, 63, 25, 50]. In particular, [52, 63] showed that classiﬁcation can be signiﬁcantly easier than regression

due to the relative benignness of the 0-1 test loss. Our analysis also compares classiﬁcation and regression

and shows that the potentially harmful biases generated by DA are frequently nulliﬁed with the 0-1 metric.

As a result, we identify several beneﬁcial scenarios for DA in classiﬁcation tasks. At a technical level, we

generalize the analysis of [52] to sub-Gaussian design. We also believe that our framework can be combined

with the alternative mixture model (where covariates are generated from discrete labels [17, 70, 14]), but we

do not formally explore this path in this paper.

Generalized

ℓ2

regularizer analysis: Our framework extends the analyses of least squares and ridge

regression to estimators with general Tikhonov regularization, i.e., a penalty of the form

θ⊤

for arbitrary

positive deﬁnite matrix M. A closely related work is [72], which analyzes the regression generalization error

of general Tikhonov regularization. However, our work diﬀers from theirs in three key respects. First, the

analysis of [72] is based on the proportional asymptotic limit (where the sample size

and data dimension

increase proportionally with a ﬁxed ratio) and provides sharp asymptotic formulas for regression error that are

exact, but not closed-form and not easily interpretable. On the other hand, our framework is non-asymptotic,

and we generally consider

p≫n

p≪n

; our expressions are closed-form, match up to universal constants

and are easily interpretable. Second, our analysis allows for a more general class of random regularizers that

themselves depend on the training data; a key technical innovation involves showing that the additional eﬀect

of this randomness is, in fact, minimal. Third, we do not explicitly consider the problem of determining an

optimal regularizer; instead, we compare and contrast the generalization characteristics of various types of

practical augmentations and discuss which characteristics lead to favorable performance.

In addition to explicitly regularized estimators, [72] also analyze the ridgeless limit for these regularizers,

which can be interpreted as the minimum-Mahalanobis-norm interpolator. In Section 6.1 we show that such

estimators can also be realized in the limit of minimal DA.

The role of explicit regularization and hyperparameter tuning: Research on harmless interpola-

tion and double descent [7] has challenged conventional thinking about regularization and overﬁtting for

overparameterized models; in particular, good performance can be achieved with weak (or even negative)

explicit regularization [40, 68], and gradient descent trained to interpolation can sometimes beat ridge

regression [60]. These results show that the scale of the ridge regularization signiﬁcantly aﬀects model

generalization; consequently, recent work strives to estimate the optimal scale of ridge regularization using

cross-validation techniques [56, 55].

As shown in classical work [9], ridge regularization is equivalent to augmentation with (isotropic) Gaussian

noise, and the scale of regularization naturally maps to the variance of Gaussian noise augmentation. Our

work links DA to a much more ﬂexible class of regularizers and shows that some types of DA induce an

implicit regularization that yields much more robust performance across the hyperparameter(s) dictating the

“strength” of the augmentation. In particular, our experiments in Section 6.2 show that random mask [35],

cutout [26] and our new random rotation augmentation yield comparable generalization error for a wide

range of hyperparameters (masking probability, cutout width and rotation angle respectively); the random

rotation is a new augmentation proposed in this work and frequently beats ridge regularization as well as

interpolation. Thus, our ﬂexible framework enables the discovery of DA with appealing robustness properties

not present in the more basic methodology of ridge regularization.

As remarked on at various points throughout the paper, we believe that the subsequent and very recent work of [47], which

weakens these assumptions further, can also be plugged with our analysis framework; we will explore this in the sequel.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Thegood,thebadandtheuglysidesofdataaugmentation:AnimplicitspectralregularizationperspectiveChi-HengLin1,4ChiraagKaushik1EvaLDyer∗,1,2VidyaMuthukumar∗,1,3AbstractDataaugmentation(DA)isapowerfulworkhorseforbolsteringperformanceinmodernmachinelearning.Specificaugmentationsliketranslationsandscalinginco...

展开>> 收起<<

The good the bad and the ugly sides of data augmentation An implicit spectral regularization perspective Chi-Heng Lin14Chiraag Kaushik1Eva L Dyer12Vidya Muthukumar13.pdf

共72页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The good the bad and the ugly sides of data augmentation An implicit spectral regularization perspective Chi-Heng Lin14Chiraag Kaushik1Eva L Dyer12Vidya Muthukumar13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: