C-Mixup Improving Generalization in Regression Huaxiu Yao1 Yiping Wang2 Linjun Zhang3 James Zou1 Chelsea Finn1 1Stanford University2Zhejiang University3Rutgers University

2025-04-27 1 0 1.38MB 32 页 10玖币

侵权投诉

C-Mixup: Improving Generalization in Regression

Huaxiu Yao1∗, Yiping Wang2∗, Linjun Zhang3, James Zou1, Chelsea Finn1

1Stanford University, 2Zhejiang University, 3Rutgers University

1{huaxiu,cbﬁnn}@cs.stanford.edu, jamesz@stanford.edu

2yipingwang6161@gmail.com, 3linjun.zhang@rutgers.edu

Abstract

Improving the generalization of deep networks is an important open challenge,

particularly in domains without plentiful data. The mixup algorithm improves

generalization by linearly interpolating a pair of examples and their corresponding

labels. These interpolated examples augment the original training set. Mixup has

shown promising results in various classiﬁcation tasks, but systematic analysis of

mixup in regression remains underexplored. Using mixup directly on regression

labels can result in arbitrarily incorrect labels. In this paper, we propose a simple

yet powerful algorithm, C-Mixup, to improve generalization on regression tasks. In

contrast with vanilla mixup, which picks training examples for mixing with uniform

probability, C-Mixup adjusts the sampling probability based on the similarity of

the labels. Our theoretical analysis conﬁrms that C-Mixup with label similarity

obtains a smaller mean square error in supervised regression and meta-regression

than vanilla mixup and using feature similarity. Another beneﬁt of C-Mixup is

that it can improve out-of-distribution robustness, where the test distribution is

different from the training distribution. By selectively interpolating examples

with similar labels, it mitigates the effects of domain-associated information and

yields domain-invariant representations. We evaluate C-Mixup on eleven datasets,

ranging from tabular to video data. Compared to the best prior approach, C-Mixup

achieves 6.56%, 4.76%, 5.82% improvements in in-distribution generalization, task

generalization, and out-of-distribution robustness, respectively. Code is released at

https://github.com/huaxiuyao/C-Mixup.

1 Introduction

Deep learning practitioners commonly face the challenge of overﬁtting. To improve generalization,

prior works have proposed a number of techniques, including data augmentation [

]

and explicit regularization [

]. Representatively, mixup [

] densiﬁes the data distribution

and implicitly regularizes the model by linearly interpolating the features of randomly sampled pairs

of examples and applying the same interpolation on the corresponding labels. Despite mixup having

demonstrated promising results in improving generalization in classiﬁcation problems, it has rarely

been studied in the context of regression with continuous labels, on which we focus in this paper.

In contrast to classiﬁcation, which formalizes the label as a one-hot vector, the goal of regression is

to predict a continuous label from each input. Directly applying mixup to input features and labels

in regression tasks may yield arbitrarily incorrect labels. For example, as shown in Figure 1(a),

ShapeNet1D pose prediction [

] aims to predict the current orientation of the object relative to

its canonical orientation. We randomly select three mixing pairs and show the mixed images and

labels in Figure 1(b), where only pair 1 exhibits reasonable mixing results. We thus see that sampling

mixing pairs uniformly from the dataset introduces a number of noisy pairs.

∗

Equal contribution. This work was done when Yiping Wang was remotely co-mentored by Huaxiu Yao

and Linjun Zhang.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05775v1 [cs.LG] 11 Oct 2022

Pair 1

(P1)

(a) ShapeNet1D Pose

Predic4on

𝑦!: 171∘

𝑦#: 150∘𝑦#: 150∘𝑦$: 346∘

)𝑦: 160.5∘)𝑦: 248∘

Pair 2

(P2)

𝑦$: 346∘

𝑦!: 171∘

)𝑦: 258.5∘

Pair 3

(P3)

(b) Mixing Pairs (𝜆 = 0.5)

Prob(P1) = Prob(P2) = Prob(P3)

Prob(P1) ≈Prob(P3) >> Prob(P2)

Prob(P1) >> Prob(P2) > Prob(P3)

Vanilla mixup

Feature-based Local mixup

C-Mixup (Ours)

Comparison

😄

☹

😐

…

Training set

Test s et

OpDmized

model 𝑓

%∗

Figure 1: Illustration of C-Mixup on ShapeNet1D pose prediction.

represents the interpolation ratio.

(a) ShapeNet1D pose prediction task, aiming to predict the current orientation of the object relative to

its canonical orientation. (b) Three mixing pairs are randomly picked, where the interpolated images

are visualized and

˜y

represents interpolated labels. (c) Illustration of a rough comparison of sampling

probabilities among three mixing pairs in (b). The Euclidean distance measures input feature distance

and the corresponding results between examples in pairs 1, 2, 3 are

1.51 ×105

1.82 ×105

1.50 ×105

respectively. Hence, pairs 1 and 3 have similar results, leading to similar sampling probabilities.

C-Mixup is able to assign higher sampling probability for more reasonable mixing pairs.

In this paper, we aim to adjust the sampling probability of mixing pairs according to the similarity

of examples, resulting in a simple training technique named

C-Mixup

. Speciﬁcally, we employ a

Gaussian kernel to calculate the sampling probability of drawing another example for mixing, where

closer examples are more likely to be sampled. Here, the core question is: how to measure the

similarity between two examples? The most straightforward solution is to compute input feature

similarity. Yet, using input similarity has two major downsides when dealing with high-dimensional

data such as images or time-series: substantial computational costs and lack of good distance metrics.

Speciﬁcally, it takes considerable time to compute pairwise similarities across all samples, and directly

applying classical distance metrics (e.g., Euclidean distance, cosine distance) does not reﬂect the high-

level relation between input features. In the ShapeNet1D rotation prediction example (Figure 1(a)),

pair 1 and 3 have close input similarities, while only pair 1 can be reasonably interpolated.

To overcome these drawbacks, C-Mixup instead uses label similarity, which is typically much faster

to compute since the label space is usually low dimensional. In addition to the computational

advantages, C-Mixup beneﬁts three kinds of regression problems. First, it empirically improves

in-distribution generalization in supervised regression compared to using vanilla mixup or using

feature similarity. Second, we extend C-Mixup to gradient-based meta-learning by incorporating

it into MetaMix, a mixup-based task augmentation method [

]. Compared to vanilla MetaMix,

C-Mixup empirically improves task generalization. Third, C-Mixup is well-suited for improving

out-of-distribution robustness without domain information, particularly to covariate shift (see the

corresponding example in Appendix A.1). By performing mixup on examples with close continuous

labels, examples with different domains are mixed. In this way, C-Mixup encourages the model to

rely on domain-invariant features to make prediction and ignore unrelated or spurious correlations,

making the model more robust to covariate shift.

The primary contribution of this paper is C-Mixup, a simple and scalable algorithm for improving

generalization in regression problems. In linear or monotonic non-linear models, our theoretical

analysis shows that C-Mixup improves generalization in multiple settings compared to vanilla mixup

or compared to using feature similarities. Moreover, our experiments thoroughly evaluate C-Mixup

on eleven datasets, including many large-scale real-world applications like drug-target interaction

prediction [

], ejection fraction estimation with echocardiogram videos [

], poverty estimation with

satellite imagery [

]. Compared to the best prior method, the results demonstrate the promise of C-

Mixup with 6.56%, 4.76%, 5.82% improvements in in-distribution generalization, task generalization,

and out-of-distribution robustness, respectively.

2 Preliminaries

In this section, we deﬁne notation and describe the background of ERM and mixup in the supervised

learning setting, and MetaMix in the meta-learning setting for task generalization.

ERM.

Assume a machine learning model

with parameter space

. In this paper, we consider the

setting where one predicts the continuous label

y∈ Y

according to the input feature

x∈ X

. Given a

loss function

, we train a model

fθ

under the empirical training distribution

Ptr

with the following

objective, and get the optimized parameter θ∗∈Θ:

θ∗←arg min

θ∈Θ

E(x,y)∼Ptr [`(fθ(x), y)].(1)

Typically, we expect the model to perform well on unseen examples drawn from the test distribution

Pts

. We are interested in both in-distribution (

Ptr =Pts

) and out-of-distribution (

Ptr 6=Pts

) settings.

Mixup.

The mixup algorithm samples a pair of instances

(xi, yi)

and

(xj, yj)

, sampled uniformly at

random from the training dataset, and generates new examples by performing linear interpolation on

the input features and corresponding labels as:

˜x=λ·xi+ (1 −λ)·xj,˜y=λ·yi+ (1 −λ)·yj,(2)

where the interpolation ratio

λ∈[0,1]

is drawn from a Beta distribution, i.e.,

λ∼Beta(α, α)

. The

interpolated examples are then used to optimize the model as follows:

θ∗←arg min

θ∈Θ

E(xi,yi),(xj,yj)∼Ptr [`(fθ(˜x),˜y)].(3)

Task Generalization and MetaMix.

In this paper, we also investigate few-shot task generalization

under the gradient-based meta-regression setting. Given a task distribution

p(T)

, we assume each task

is sampled from

p(T)

and is associated with a dataset

. A support set

m={(Xs

m, Y s

m)}=

{(xs

m,i, ys

m,i)}Ns

i=1

and a query set

m={(Xs

m, Y s

m)}={(xq

m,j , yq

m,j )}Nq

i=1

are sampled from

Representatively, in model-agnostic meta-learning (MAML) [

], given a predictive model

with

parameter

, it aims to learn an initialization

θ∗

from meta-training tasks

{Tm}|M|

m=1

. Speciﬁcally,

at the meta-training phase, MAML obtained the task-speciﬁc parameter

φm

for each task

performing a few gradient steps starting from

. Then, the corresponding query set

is used to

evaluate the performance of the task-speciﬁc model and optimize the model initialization as:

θ∗:= arg min

|M|

i=1 L(fφm;Dq

m), where φm=θ−α∇θL(fθ;Ds

m)(4)

At the meta-testing phase, for each meta-testing task

, MAML ﬁne-tunes the learned initialization

θ∗on the support set Ds

tand evaluates the performance on the corresponding query set Dq

To improve task generalization, MetaMix [

] adapts mixup (Eqn.

(3)

) to meta-learning, which

linearly interpolates the support set and query set and uses the interpolated set to replace the original

query set Dq

min Eqn. (4). Speciﬁcally, the interpolated query set is formulated as:

m=λXs

m+ (1 −λ)Xq

m,˜

m=λY s

m+ (1 −λ)Yq

m,(5)

where λ∼Beta(α, α).

3 Mixup for Regression (C-Mixup)

For continuous labels, the example in Figure 1(b) illustrates that applying vanilla mixup to the entire

distribution is likely to produce arbitrary labels. To resolve this issue, C-Mixup proposes to sample

closer pairs of examples with higher probability. Speciﬁcally, given an example

(xi, yi)

, C-Mixup

introduces a symmetric Gaussian kernel to calculate the sampling probability

P((xj, yj)|(xi, yi))

for

another (xj, yj)example to be mixed as follows:

P((xj, yj)|(xi, yi)) ∝exp −d(i, j)

2σ2(6)

where

d(i, j)

represents the distance between the examples

(xi, yi)

and

(xj, yj)

, and

describes

the bandwidth. For the example

(xi, yi)

, the set

{P((xj, yj)|(xi, yi))|∀j}

is then normalized to a

probability mass function that sums to one.

Algorithm 1 Training with C-Mixup

Require: Learning rates η; Shape parameter α

Require: Training data D:= {(xi, yi)}N

i=1

1: Randomly initialize model parameters θ

2: Calculate pairwise distance matrix Pvia Eqn. (6)

3: while not converge do

4: Sample a batch of examples B ∼ D

5: for each example (xi, yi)∈ B do

6: Sample (xj,yj) from P(· | (xi, yi)) and λfrom Beta(α, α)

7: Interpolate (xi,yi), (xj,yj) to get (˜x, ˜y) according to Eqn. (2)

8: Use interpolated examples to update the model via Eqn. (3)

One natural way to compute the distance is using the input feature

, i.e.,

d(i, j) = d(xi, xj)

However, when dealing with the high-dimensional data such as images or videos, we lack good

distance metrics to capture structured feature information and the distances can be easily inﬂuenced

by feature noise. Additionally, computing feature distances for high-dimensional data is time-

consuming. Instead, C-Mixup leverages the labels with

d(i, j) = d(yi, yj) = kyi−yjk2

, where

and

are vectors with continuous values. The dimension of label is typically much smaller

than that of the input feature, therefore reducing computational costs (see more discussions about

compuatational efﬁciency in Appendix A.3). The overall algorithm of C-Mixup is described in Alg. 1

and we detail the difference between C-Mixup and mixup in Appendix A.4. According to Alg. 1,

C-Mixup assigns higher probabilities to example pairs with closer continuous labels. In addition to

its computational beneﬁts, C-Mixup improves generalization on three distinct kinds of regression

problems – in-distribution generalization, task generalization, and out-of-distribution robustness,

which is theoretically and empirically justiﬁed in the following sections.

4 Theoretical Analysis

In this section, we theoretically explain how C-Mixup beneﬁts in-distribution generalization, task

generalization, and out-of-distribution robustness.

4.1 C-Mixup for Improving In-Distribution Generalization

In this section, we show that C-Mixup provably improves in-distribution generalization when the

features are observed with noise, and the response depends on a small fraction of the features in a

monotonic way. Speciﬁcally, we consider the following single index model with measurement error,

y=g(θ>z) + , (7)

where

θ∈Rp

and



is a sub-Gaussian random variable and

is a monotonic transformation. Since

images are often inaccurately observed in practice, we assume the feature

is observed or measured

with noise, and denote the observed value by

x=z+ξ

with

being a random vector with

mean 0 and covariance matrix

σ2

ξI

. We assume

to be monotonic to model the nearly one-to-one

correspondence between causal features (e.g., the car pose in Figure 1(a)) and labels (rotation) in the

in-distribution setting. The out-of-distribution setting will be discussed in Section 4.3. We would like

to also comment that the single index model has been commonly used in econometrics, statistics, and

deep learning theory [19, 27, 47, 53, 73].

Suppose we have

{(xi, yi)}N

i=1

i.i.d. drawn from the above model. We ﬁrst follow the single index

model literature (e.g. [

]) and estimate

by minimizing the square error

i=1(˜yi−˜x>

iθ)2

, where

the

(˜xi,˜yi)0s

are the augmented data by either vanilla mixup, mixup with input feature similarity, and

C-Mixup. We denote the solution by

θ∗

mixup

θ∗

feat

, and

θ∗

C−Mixup

respectively. Given an estimate

θ∗

, we estimate

ˆg

via the standard nonparametric kernel estimator [

] (we specify this in detail

in Appendix B.1 for completeness) using the augmented data. We consider the mean square error

metric as MSE(θ) = E[(y−ˆg(θ>x))2], and then have the following theorem (proof: Appendix B.1):

Theorem 1.

Suppose

θ∈Rp

is sparse with sparsity

s=o(min{p, σ2

ξ})

p=o(N)

and

is smooth

with

c0< g0< c1

c2< g00 < c3

for some universal constants

c0, c1, c2, c3>0

. There exists a

distribution on

with a kernel function, such that when the sample size

is sufﬁciently large, with

probability 1−o(1),

MSE(θ∗

C−Mixup)<min(MSE(θ∗

feat),MSE(θ∗

mixup)).(8)

The high-level intuition of why C-Mixup helps is that the vanilla mixup imposes linearity regulariza-

tion on the relationship between the feature and response. When the relationship is strongly nonlinear

and one-to-one, such a regularization hurts the generalization, but could be mitigated by C-Mixup.

4.2 C-Mixup for Improving Task Generalization

The second beneﬁt of C-Mixup is improving task generalization in meta-learning when the data from

each task follows the model discussed in the last section. Concretely, we apply C-Mixup to MetaMix

[

]. For each query example, the support example with a more similar label will have a higher

probability of being mixed. The algorithm of C-Mixup on MetaMix is summarized in Appendix A.2.

Similar to in-distribution generalization analysis, we consider the following data generative model:

for the m-th task (m∈[M]), we have (x(m), y(m))∼ Tmwith

y(m)=gm(θ>z(m)) + and x(m)=z(m)+(m).(9)

Here,

denotes the globally-shared representation, and

’s are the task-speciﬁc transformations.

Note that, the formulation is close to [

] and wildly applied to theoretical analysis of meta-

learning [

]. Following similar spirit of [

] and last section, we obtain the estimation of

θ∗=1

MPM

m=1(E(˜x(m),˜y(m))∈ˆ

Dm[˜y(m)−θ>˜x(m)]).

Here,

denotes the generic dataset augmented

by different approaches, including the vanilla MetaMix, MetaMix with input feature similarity, and C-

Mixup. We denote these approaches by

θ∗

MetaM ix

θ∗

Meta−f eat

, and

θ∗

Meta−C−M ixup

respectively. For a

new task

, we again use the standard nonparametric kernel estimator to estimate

via the augmented

target data. We then consider the following error metric

MSETarget(θ∗) = E(x,y)∼Tt[(y−ˆgt(θ∗>x))2].

Based on this metric, we get the following theorem to show the promise of C-Mixup in improving task

generalization (see Appendix B.2 for detailed proof). Here, C-Mixup achieves smaller

MSETarget

compared to vanilla MetaMix and MetaMix with input feature similarity.

Theorem 2.

Let

N=PM

m=1 Nm

and

is the number of examples of

. Suppose

θk

is sparse

with sparsity

s=o(min{d, σ2

ξ})

p=o(N)

and

’s are smooth with

0< g0

m< c1

c2< g00

m< c3

for some universal constants

c1, c2, c3>0

and

m∈[M]∪{t}

. There exists a distribution on

with

a kernel function, such that when the sample size Nis sufﬁciently large, with probability 1−o(1),

MSETarget(θ∗

Meta−C−M ixup)<min(MSETarget(θ∗

Meta−f eat),MSETarget(θ∗

MetaM ix)).(10)

4.3 C-Mixup for Improving Out-of-distribution Robustness

Finally, we show that C-Mixup improves OOD robustness in the covariate shift setting where some

unrelated features vary across different domains. In this setting, we regard the entire data distribution

consisting of

E={1, . . . , E}

domains, where each domain is associated with a data distribution

for

e∈ E

. Given a set of training domains

Etr ⊆ E

, we aim to make the trained model generalize

well to an unseen test domain

Ets

that is not necessarily in

Etr

. Here, we focus on covariate shift, i.e.,

the change of

among domains is only caused by the change of marginal distribution

Pe(X)

, while

the conditional distribution Pe(Y|X)is ﬁxed across different domains.

To overcome covariate shift, mixing examples with close labels without considering domain infor-

mation can effectively average out domain-changeable correlations and make the predicted values

rely on the invariant causal features. To further understand how C-Mixup improves the robustness to

covariate shift, we provide the following theoretical analysis.

We assume the training data

(xi, yi)n

i=1

follows

xi= (zi;ai)∈Rp1+p2

and

yi=θ>xi+i

where

zi∈Rp1

and

ai∈Rp2

are regarded as invariant and domain-changeable unrelated features,

respectively, and the last

coordinates of

θ∈Rp1+p2

are 0. Now we consider the case where

the training data consists of a pair of domains with almost identical invariant features and opposite

domain-changeable features, i.e.,

xi= (zi, ai), x0

i= (z0

i, a0

, where

zi∼ Np1(0, σ2

xIp1)

i=zi+0

ai∼ Np2(0, σ2

aIp2)

i=−ai+00

i, 0

i, i00

are noise terms with mean 0 and sub-Gaussian norm

bounded by

σ

. We use ridge estimator

θ∗(k) = arg minθ(Pikyi−θ>xik2+kkθk2)

to reﬂect the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

C-Mixup:ImprovingGeneralizationinRegressionHuaxiuYao1,YipingWang2,LinjunZhang3,JamesZou1,ChelseaFinn11StanfordUniversity,2ZhejiangUniversity,3RutgersUniversity1{huaxiu,cbnn}@cs.stanford.edu,jamesz@stanford.edu2yipingwang6161@gmail.com,3linjun.zhang@rutgers.eduAbstractImprovingthegeneralizationofd...

展开>> 收起<<

C-Mixup Improving Generalization in Regression Huaxiu Yao1 Yiping Wang2 Linjun Zhang3 James Zou1 Chelsea Finn1 1Stanford University2Zhejiang University3Rutgers University.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

C-Mixup Improving Generalization in Regression Huaxiu Yao1 Yiping Wang2 Linjun Zhang3 James Zou1 Chelsea Finn1 1Stanford University2Zhejiang University3Rutgers University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: