
distribution on
x
with a kernel function, such that when the sample size
N
is sufficiently large, with
probability 1−o(1),
MSE(θ∗
C−Mixup)<min(MSE(θ∗
feat),MSE(θ∗
mixup)).(8)
The high-level intuition of why C-Mixup helps is that the vanilla mixup imposes linearity regulariza-
tion on the relationship between the feature and response. When the relationship is strongly nonlinear
and one-to-one, such a regularization hurts the generalization, but could be mitigated by C-Mixup.
4.2 C-Mixup for Improving Task Generalization
The second benefit of C-Mixup is improving task generalization in meta-learning when the data from
each task follows the model discussed in the last section. Concretely, we apply C-Mixup to MetaMix
[
74
]. For each query example, the support example with a more similar label will have a higher
probability of being mixed. The algorithm of C-Mixup on MetaMix is summarized in Appendix A.2.
Similar to in-distribution generalization analysis, we consider the following data generative model:
for the m-th task (m∈[M]), we have (x(m), y(m))∼ Tmwith
y(m)=gm(θ>z(m)) + and x(m)=z(m)+(m).(9)
Here,
θ
denotes the globally-shared representation, and
gm
’s are the task-specific transformations.
Note that, the formulation is close to [
54
] and wildly applied to theoretical analysis of meta-
learning [
74
,
77
]. Following similar spirit of [
62
] and last section, we obtain the estimation of
θ
by
θ∗=1
MPM
m=1(E(˜x(m),˜y(m))∈ˆ
Dm[˜y(m)−θ>˜x(m)]).
Here,
ˆ
Dm
denotes the generic dataset augmented
by different approaches, including the vanilla MetaMix, MetaMix with input feature similarity, and C-
Mixup. We denote these approaches by
θ∗
MetaM ix
,
θ∗
Meta−f eat
, and
θ∗
Meta−C−M ixup
respectively. For a
new task
Tt
, we again use the standard nonparametric kernel estimator to estimate
gt
via the augmented
target data. We then consider the following error metric
MSETarget(θ∗) = E(x,y)∼Tt[(y−ˆgt(θ∗>x))2].
Based on this metric, we get the following theorem to show the promise of C-Mixup in improving task
generalization (see Appendix B.2 for detailed proof). Here, C-Mixup achieves smaller
MSETarget
compared to vanilla MetaMix and MetaMix with input feature similarity.
Theorem 2.
Let
N=PM
m=1 Nm
and
Nm
is the number of examples of
Tm
. Suppose
θk
is sparse
with sparsity
s=o(min{d, σ2
ξ})
,
p=o(N)
and
gm
’s are smooth with
0< g0
m< c1
,
c2< g00
m< c3
for some universal constants
c1, c2, c3>0
and
m∈[M]∪{t}
. There exists a distribution on
x
with
a kernel function, such that when the sample size Nis sufficiently large, with probability 1−o(1),
MSETarget(θ∗
Meta−C−M ixup)<min(MSETarget(θ∗
Meta−f eat),MSETarget(θ∗
MetaM ix)).(10)
4.3 C-Mixup for Improving Out-of-distribution Robustness
Finally, we show that C-Mixup improves OOD robustness in the covariate shift setting where some
unrelated features vary across different domains. In this setting, we regard the entire data distribution
consisting of
E={1, . . . , E}
domains, where each domain is associated with a data distribution
Pe
for
e∈ E
. Given a set of training domains
Etr ⊆ E
, we aim to make the trained model generalize
well to an unseen test domain
Ets
that is not necessarily in
Etr
. Here, we focus on covariate shift, i.e.,
the change of
Pe
among domains is only caused by the change of marginal distribution
Pe(X)
, while
the conditional distribution Pe(Y|X)is fixed across different domains.
To overcome covariate shift, mixing examples with close labels without considering domain infor-
mation can effectively average out domain-changeable correlations and make the predicted values
rely on the invariant causal features. To further understand how C-Mixup improves the robustness to
covariate shift, we provide the following theoretical analysis.
We assume the training data
(xi, yi)n
i=1
follows
xi= (zi;ai)∈Rp1+p2
and
yi=θ>xi+i
,
where
zi∈Rp1
and
ai∈Rp2
are regarded as invariant and domain-changeable unrelated features,
respectively, and the last
p2
coordinates of
θ∈Rp1+p2
are 0. Now we consider the case where
the training data consists of a pair of domains with almost identical invariant features and opposite
domain-changeable features, i.e.,
xi= (zi, ai), x0
i= (z0
i, a0
i)
, where
zi∼ Np1(0, σ2
xIp1)
,
z0
i=zi+0
i
,
ai∼ Np2(0, σ2
aIp2)
,
a0
i=−ai+00
i
.
i, 0
i, i00
are noise terms with mean 0 and sub-Gaussian norm
bounded by
σ
. We use ridge estimator
θ∗(k) = arg minθ(Pikyi−θ>xik2+kkθk2)
to reflect the
5