C-Mixup Improving Generalization in Regression Huaxiu Yao1 Yiping Wang2 Linjun Zhang3 James Zou1 Chelsea Finn1 1Stanford University2Zhejiang University3Rutgers University

2025-04-27 1 0 1.38MB 32 页 10玖币
侵权投诉
C-Mixup: Improving Generalization in Regression
Huaxiu Yao1, Yiping Wang2, Linjun Zhang3, James Zou1, Chelsea Finn1
1Stanford University, 2Zhejiang University, 3Rutgers University
1{huaxiu,cbfinn}@cs.stanford.edu, jamesz@stanford.edu
2yipingwang6161@gmail.com, 3linjun.zhang@rutgers.edu
Abstract
Improving the generalization of deep networks is an important open challenge,
particularly in domains without plentiful data. The mixup algorithm improves
generalization by linearly interpolating a pair of examples and their corresponding
labels. These interpolated examples augment the original training set. Mixup has
shown promising results in various classification tasks, but systematic analysis of
mixup in regression remains underexplored. Using mixup directly on regression
labels can result in arbitrarily incorrect labels. In this paper, we propose a simple
yet powerful algorithm, C-Mixup, to improve generalization on regression tasks. In
contrast with vanilla mixup, which picks training examples for mixing with uniform
probability, C-Mixup adjusts the sampling probability based on the similarity of
the labels. Our theoretical analysis confirms that C-Mixup with label similarity
obtains a smaller mean square error in supervised regression and meta-regression
than vanilla mixup and using feature similarity. Another benefit of C-Mixup is
that it can improve out-of-distribution robustness, where the test distribution is
different from the training distribution. By selectively interpolating examples
with similar labels, it mitigates the effects of domain-associated information and
yields domain-invariant representations. We evaluate C-Mixup on eleven datasets,
ranging from tabular to video data. Compared to the best prior approach, C-Mixup
achieves 6.56%, 4.76%, 5.82% improvements in in-distribution generalization, task
generalization, and out-of-distribution robustness, respectively. Code is released at
https://github.com/huaxiuyao/C-Mixup.
1 Introduction
Deep learning practitioners commonly face the challenge of overfitting. To improve generalization,
prior works have proposed a number of techniques, including data augmentation [
3
,
10
,
12
,
81
,
82
]
and explicit regularization [
15
,
38
,
60
]. Representatively, mixup [
82
,
83
] densifies the data distribution
and implicitly regularizes the model by linearly interpolating the features of randomly sampled pairs
of examples and applying the same interpolation on the corresponding labels. Despite mixup having
demonstrated promising results in improving generalization in classification problems, it has rarely
been studied in the context of regression with continuous labels, on which we focus in this paper.
In contrast to classification, which formalizes the label as a one-hot vector, the goal of regression is
to predict a continuous label from each input. Directly applying mixup to input features and labels
in regression tasks may yield arbitrarily incorrect labels. For example, as shown in Figure 1(a),
ShapeNet1D pose prediction [
18
] aims to predict the current orientation of the object relative to
its canonical orientation. We randomly select three mixing pairs and show the mixed images and
labels in Figure 1(b), where only pair 1 exhibits reasonable mixing results. We thus see that sampling
mixing pairs uniformly from the dataset introduces a number of noisy pairs.
Equal contribution. This work was done when Yiping Wang was remotely co-mentored by Huaxiu Yao
and Linjun Zhang.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05775v1 [cs.LG] 11 Oct 2022
Pair 1
(P1)
(a) ShapeNet1D Pose
Predic4on
𝑦!: 171
𝑦#: 150𝑦#: 150𝑦$: 346
)𝑦: 160.5)𝑦: 248
Pair 2
(P2)
𝑦$: 346
𝑦!: 171
)𝑦: 258.5
Pair 3
(P3)
(b) Mixing Pairs (𝜆 = 0.5)
Prob(P1) = Prob(P2) = Prob(P3)
Prob(P1) Prob(P3) >> Prob(P2)
Prob(P1) >> Prob(P2) > Prob(P3)
Vanilla mixup
Feature-based Local mixup
C-Mixup (Ours)
(c) Sampling Probability
Comparison
😄
😐
Training set
Test s et
OpDmized
model 𝑓
%
Figure 1: Illustration of C-Mixup on ShapeNet1D pose prediction.
λ
represents the interpolation ratio.
(a) ShapeNet1D pose prediction task, aiming to predict the current orientation of the object relative to
its canonical orientation. (b) Three mixing pairs are randomly picked, where the interpolated images
are visualized and
˜y
represents interpolated labels. (c) Illustration of a rough comparison of sampling
probabilities among three mixing pairs in (b). The Euclidean distance measures input feature distance
and the corresponding results between examples in pairs 1, 2, 3 are
1.51 ×105
,
1.82 ×105
,
1.50 ×105
,
respectively. Hence, pairs 1 and 3 have similar results, leading to similar sampling probabilities.
C-Mixup is able to assign higher sampling probability for more reasonable mixing pairs.
In this paper, we aim to adjust the sampling probability of mixing pairs according to the similarity
of examples, resulting in a simple training technique named
C-Mixup
. Specifically, we employ a
Gaussian kernel to calculate the sampling probability of drawing another example for mixing, where
closer examples are more likely to be sampled. Here, the core question is: how to measure the
similarity between two examples? The most straightforward solution is to compute input feature
similarity. Yet, using input similarity has two major downsides when dealing with high-dimensional
data such as images or time-series: substantial computational costs and lack of good distance metrics.
Specifically, it takes considerable time to compute pairwise similarities across all samples, and directly
applying classical distance metrics (e.g., Euclidean distance, cosine distance) does not reflect the high-
level relation between input features. In the ShapeNet1D rotation prediction example (Figure 1(a)),
pair 1 and 3 have close input similarities, while only pair 1 can be reasonably interpolated.
To overcome these drawbacks, C-Mixup instead uses label similarity, which is typically much faster
to compute since the label space is usually low dimensional. In addition to the computational
advantages, C-Mixup benefits three kinds of regression problems. First, it empirically improves
in-distribution generalization in supervised regression compared to using vanilla mixup or using
feature similarity. Second, we extend C-Mixup to gradient-based meta-learning by incorporating
it into MetaMix, a mixup-based task augmentation method [
74
]. Compared to vanilla MetaMix,
C-Mixup empirically improves task generalization. Third, C-Mixup is well-suited for improving
out-of-distribution robustness without domain information, particularly to covariate shift (see the
corresponding example in Appendix A.1). By performing mixup on examples with close continuous
labels, examples with different domains are mixed. In this way, C-Mixup encourages the model to
rely on domain-invariant features to make prediction and ignore unrelated or spurious correlations,
making the model more robust to covariate shift.
The primary contribution of this paper is C-Mixup, a simple and scalable algorithm for improving
generalization in regression problems. In linear or monotonic non-linear models, our theoretical
analysis shows that C-Mixup improves generalization in multiple settings compared to vanilla mixup
or compared to using feature similarities. Moreover, our experiments thoroughly evaluate C-Mixup
on eleven datasets, including many large-scale real-world applications like drug-target interaction
prediction [
28
], ejection fraction estimation with echocardiogram videos [
50
], poverty estimation with
satellite imagery [
78
]. Compared to the best prior method, the results demonstrate the promise of C-
Mixup with 6.56%, 4.76%, 5.82% improvements in in-distribution generalization, task generalization,
and out-of-distribution robustness, respectively.
2 Preliminaries
In this section, we define notation and describe the background of ERM and mixup in the supervised
learning setting, and MetaMix in the meta-learning setting for task generalization.
2
ERM.
Assume a machine learning model
f
with parameter space
Θ
. In this paper, we consider the
setting where one predicts the continuous label
y∈ Y
according to the input feature
x∈ X
. Given a
loss function
`
, we train a model
fθ
under the empirical training distribution
Ptr
with the following
objective, and get the optimized parameter θΘ:
θarg min
θΘ
E(x,y)Ptr [`(fθ(x), y)].(1)
Typically, we expect the model to perform well on unseen examples drawn from the test distribution
Pts
. We are interested in both in-distribution (
Ptr =Pts
) and out-of-distribution (
Ptr 6=Pts
) settings.
Mixup.
The mixup algorithm samples a pair of instances
(xi, yi)
and
(xj, yj)
, sampled uniformly at
random from the training dataset, and generates new examples by performing linear interpolation on
the input features and corresponding labels as:
˜x=λ·xi+ (1 λ)·xj,˜y=λ·yi+ (1 λ)·yj,(2)
where the interpolation ratio
λ[0,1]
is drawn from a Beta distribution, i.e.,
λBeta(α, α)
. The
interpolated examples are then used to optimize the model as follows:
θarg min
θΘ
E(xi,yi),(xj,yj)Ptr [`(fθ(˜x),˜y)].(3)
Task Generalization and MetaMix.
In this paper, we also investigate few-shot task generalization
under the gradient-based meta-regression setting. Given a task distribution
p(T)
, we assume each task
Tm
is sampled from
p(T)
and is associated with a dataset
Dm
. A support set
Ds
m={(Xs
m, Y s
m)}=
{(xs
m,i, ys
m,i)}Ns
i=1
and a query set
Dq
m={(Xs
m, Y s
m)}={(xq
m,j , yq
m,j )}Nq
i=1
are sampled from
Dm
.
Representatively, in model-agnostic meta-learning (MAML) [
14
], given a predictive model
f
with
parameter
θ
, it aims to learn an initialization
θ
from meta-training tasks
{Tm}|M|
m=1
. Specifically,
at the meta-training phase, MAML obtained the task-specific parameter
φm
for each task
Tm
by
performing a few gradient steps starting from
θ
. Then, the corresponding query set
Dq
m
is used to
evaluate the performance of the task-specific model and optimize the model initialization as:
θ:= arg min
θ
1
|M|
|M|
X
i=1 L(fφm;Dq
m), where φm=θαθL(fθ;Ds
m)(4)
At the meta-testing phase, for each meta-testing task
Tt
, MAML fine-tunes the learned initialization
θon the support set Ds
tand evaluates the performance on the corresponding query set Dq
t.
To improve task generalization, MetaMix [
74
] adapts mixup (Eqn.
(3)
) to meta-learning, which
linearly interpolates the support set and query set and uses the interpolated set to replace the original
query set Dq
min Eqn. (4). Specifically, the interpolated query set is formulated as:
˜
Xq
m=λXs
m+ (1 λ)Xq
m,˜
Yq
m=λY s
m+ (1 λ)Yq
m,(5)
where λBeta(α, α).
3 Mixup for Regression (C-Mixup)
For continuous labels, the example in Figure 1(b) illustrates that applying vanilla mixup to the entire
distribution is likely to produce arbitrary labels. To resolve this issue, C-Mixup proposes to sample
closer pairs of examples with higher probability. Specifically, given an example
(xi, yi)
, C-Mixup
introduces a symmetric Gaussian kernel to calculate the sampling probability
P((xj, yj)|(xi, yi))
for
another (xj, yj)example to be mixed as follows:
P((xj, yj)|(xi, yi)) exp d(i, j)
2σ2(6)
where
d(i, j)
represents the distance between the examples
(xi, yi)
and
(xj, yj)
, and
σ
describes
the bandwidth. For the example
(xi, yi)
, the set
{P((xj, yj)|(xi, yi))|∀j}
is then normalized to a
probability mass function that sums to one.
3
Algorithm 1 Training with C-Mixup
Require: Learning rates η; Shape parameter α
Require: Training data D:= {(xi, yi)}N
i=1
1: Randomly initialize model parameters θ
2: Calculate pairwise distance matrix Pvia Eqn. (6)
3: while not converge do
4: Sample a batch of examples B ∼ D
5: for each example (xi, yi)∈ B do
6: Sample (xj,yj) from P(· | (xi, yi)) and λfrom Beta(α, α)
7: Interpolate (xi,yi), (xj,yj) to get (˜x, ˜y) according to Eqn. (2)
8: Use interpolated examples to update the model via Eqn. (3)
One natural way to compute the distance is using the input feature
x
, i.e.,
d(i, j) = d(xi, xj)
.
However, when dealing with the high-dimensional data such as images or videos, we lack good
distance metrics to capture structured feature information and the distances can be easily influenced
by feature noise. Additionally, computing feature distances for high-dimensional data is time-
consuming. Instead, C-Mixup leverages the labels with
d(i, j) = d(yi, yj) = kyiyjk2
2
, where
yi
and
yj
are vectors with continuous values. The dimension of label is typically much smaller
than that of the input feature, therefore reducing computational costs (see more discussions about
compuatational efficiency in Appendix A.3). The overall algorithm of C-Mixup is described in Alg. 1
and we detail the difference between C-Mixup and mixup in Appendix A.4. According to Alg. 1,
C-Mixup assigns higher probabilities to example pairs with closer continuous labels. In addition to
its computational benefits, C-Mixup improves generalization on three distinct kinds of regression
problems – in-distribution generalization, task generalization, and out-of-distribution robustness,
which is theoretically and empirically justified in the following sections.
4 Theoretical Analysis
In this section, we theoretically explain how C-Mixup benefits in-distribution generalization, task
generalization, and out-of-distribution robustness.
4.1 C-Mixup for Improving In-Distribution Generalization
In this section, we show that C-Mixup provably improves in-distribution generalization when the
features are observed with noise, and the response depends on a small fraction of the features in a
monotonic way. Specifically, we consider the following single index model with measurement error,
y=g(θ>z) + , (7)
where
θRp
and
is a sub-Gaussian random variable and
g
is a monotonic transformation. Since
images are often inaccurately observed in practice, we assume the feature
z
is observed or measured
with noise, and denote the observed value by
x
:
x=z+ξ
with
ξ
being a random vector with
mean 0 and covariance matrix
σ2
ξI
. We assume
g
to be monotonic to model the nearly one-to-one
correspondence between causal features (e.g., the car pose in Figure 1(a)) and labels (rotation) in the
in-distribution setting. The out-of-distribution setting will be discussed in Section 4.3. We would like
to also comment that the single index model has been commonly used in econometrics, statistics, and
deep learning theory [19, 27, 47, 53, 73].
Suppose we have
{(xi, yi)}N
i=1
i.i.d. drawn from the above model. We first follow the single index
model literature (e.g. [
73
]) and estimate
θ
by minimizing the square error
Pn
i=1(˜yi˜x>
iθ)2
, where
the
(˜xi,˜yi)0s
are the augmented data by either vanilla mixup, mixup with input feature similarity, and
C-Mixup. We denote the solution by
θ
mixup
,
θ
feat
, and
θ
CMixup
respectively. Given an estimate
θ
, we estimate
g
by
ˆg
via the standard nonparametric kernel estimator [
64
] (we specify this in detail
in Appendix B.1 for completeness) using the augmented data. We consider the mean square error
metric as MSE(θ) = E[(yˆg(θ>x))2], and then have the following theorem (proof: Appendix B.1):
Theorem 1.
Suppose
θRp
is sparse with sparsity
s=o(min{p, σ2
ξ})
,
p=o(N)
and
g
is smooth
with
c0< g0< c1
,
c2< g00 < c3
for some universal constants
c0, c1, c2, c3>0
. There exists a
4
distribution on
x
with a kernel function, such that when the sample size
N
is sufficiently large, with
probability 1o(1),
MSE(θ
CMixup)<min(MSE(θ
feat),MSE(θ
mixup)).(8)
The high-level intuition of why C-Mixup helps is that the vanilla mixup imposes linearity regulariza-
tion on the relationship between the feature and response. When the relationship is strongly nonlinear
and one-to-one, such a regularization hurts the generalization, but could be mitigated by C-Mixup.
4.2 C-Mixup for Improving Task Generalization
The second benefit of C-Mixup is improving task generalization in meta-learning when the data from
each task follows the model discussed in the last section. Concretely, we apply C-Mixup to MetaMix
[
74
]. For each query example, the support example with a more similar label will have a higher
probability of being mixed. The algorithm of C-Mixup on MetaMix is summarized in Appendix A.2.
Similar to in-distribution generalization analysis, we consider the following data generative model:
for the m-th task (m[M]), we have (x(m), y(m))∼ Tmwith
y(m)=gm(θ>z(m)) + and x(m)=z(m)+(m).(9)
Here,
θ
denotes the globally-shared representation, and
gm
s are the task-specific transformations.
Note that, the formulation is close to [
54
] and wildly applied to theoretical analysis of meta-
learning [
74
,
77
]. Following similar spirit of [
62
] and last section, we obtain the estimation of
θ
by
θ=1
MPM
m=1(E(˜x(m),˜y(m))ˆ
Dm[˜y(m)θ>˜x(m)]).
Here,
ˆ
Dm
denotes the generic dataset augmented
by different approaches, including the vanilla MetaMix, MetaMix with input feature similarity, and C-
Mixup. We denote these approaches by
θ
MetaM ix
,
θ
Metaf eat
, and
θ
MetaCM ixup
respectively. For a
new task
Tt
, we again use the standard nonparametric kernel estimator to estimate
gt
via the augmented
target data. We then consider the following error metric
MSETarget(θ) = E(x,y)∼Tt[(yˆgt(θ>x))2].
Based on this metric, we get the following theorem to show the promise of C-Mixup in improving task
generalization (see Appendix B.2 for detailed proof). Here, C-Mixup achieves smaller
MSETarget
compared to vanilla MetaMix and MetaMix with input feature similarity.
Theorem 2.
Let
N=PM
m=1 Nm
and
Nm
is the number of examples of
Tm
. Suppose
θk
is sparse
with sparsity
s=o(min{d, σ2
ξ})
,
p=o(N)
and
gm
s are smooth with
0< g0
m< c1
,
c2< g00
m< c3
for some universal constants
c1, c2, c3>0
and
m[M]{t}
. There exists a distribution on
x
with
a kernel function, such that when the sample size Nis sufficiently large, with probability 1o(1),
MSETarget(θ
MetaCM ixup)<min(MSETarget(θ
Metaf eat),MSETarget(θ
MetaM ix)).(10)
4.3 C-Mixup for Improving Out-of-distribution Robustness
Finally, we show that C-Mixup improves OOD robustness in the covariate shift setting where some
unrelated features vary across different domains. In this setting, we regard the entire data distribution
consisting of
E={1, . . . , E}
domains, where each domain is associated with a data distribution
Pe
for
e∈ E
. Given a set of training domains
Etr ⊆ E
, we aim to make the trained model generalize
well to an unseen test domain
Ets
that is not necessarily in
Etr
. Here, we focus on covariate shift, i.e.,
the change of
Pe
among domains is only caused by the change of marginal distribution
Pe(X)
, while
the conditional distribution Pe(Y|X)is fixed across different domains.
To overcome covariate shift, mixing examples with close labels without considering domain infor-
mation can effectively average out domain-changeable correlations and make the predicted values
rely on the invariant causal features. To further understand how C-Mixup improves the robustness to
covariate shift, we provide the following theoretical analysis.
We assume the training data
(xi, yi)n
i=1
follows
xi= (zi;ai)Rp1+p2
and
yi=θ>xi+i
,
where
ziRp1
and
aiRp2
are regarded as invariant and domain-changeable unrelated features,
respectively, and the last
p2
coordinates of
θRp1+p2
are 0. Now we consider the case where
the training data consists of a pair of domains with almost identical invariant features and opposite
domain-changeable features, i.e.,
xi= (zi, ai), x0
i= (z0
i, a0
i)
, where
zi∼ Np1(0, σ2
xIp1)
,
z0
i=zi+0
i
,
ai∼ Np2(0, σ2
aIp2)
,
a0
i=ai+00
i
.
i, 0
i, i00
are noise terms with mean 0 and sub-Gaussian norm
bounded by
σ
. We use ridge estimator
θ(k) = arg minθ(Pikyiθ>xik2+kkθk2)
to reflect the
5
摘要:

C-Mixup:ImprovingGeneralizationinRegressionHuaxiuYao1,YipingWang2,LinjunZhang3,JamesZou1,ChelseaFinn11StanfordUniversity,2ZhejiangUniversity,3RutgersUniversity1{huaxiu,cbnn}@cs.stanford.edu,jamesz@stanford.edu2yipingwang6161@gmail.com,3linjun.zhang@rutgers.eduAbstractImprovingthegeneralizationofd...

展开>> 收起<<
C-Mixup Improving Generalization in Regression Huaxiu Yao1 Yiping Wang2 Linjun Zhang3 James Zou1 Chelsea Finn1 1Stanford University2Zhejiang University3Rutgers University.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:1.38MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注