
Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup
per class, ERM learns only one feature with high probability,
while Mixup training can guarantee the learning of both
features with high probability. Our analysis focuses on
Midpoint Mixup, in which training is done on the midpoints
of data points and their labels. While this seems extreme,
we motivate this choice by proving several nice properties
of Midpoint Mixup and giving intuitions on why it favors
learning all features in the data in Section 3.1.
In particular, Section 3.1 highlights the main ideas behind
why this multi-view learning is possible in the relatively
simple to understand setting of linearly separable data. We
prove in this section that the Midpoint Mixup gradient de-
scent dynamics can push towards learning all features in the
data (for our notion of multi-view data) so long as there are
dependencies between the features. Furthermore, we show
that models that have learned all features can achieve arbi-
trarily small pointwise loss on Midpoint-Mixup-augmented
data points, and that this property is unique to Midpoint
Mixup.
In Section 4.2, we show that the ideas developed for the lin-
early separable case can be extended to a noisy, non-linearly-
separable class of data distributions with two features per
class. We prove in our main results that for such distribu-
tions, minimizing the empirical cross-entropy using gradient
descent can lead to learning only one of the features in the
data (Theorem 4.6) while minimizing the Midpoint Mixup
cross-entropy succeeds in learning both features (Theorem
4.7). While our theory in this section focuses on the case of
two features/views per class to be consistent with Allen-Zhu
&Li(2021), our techniques can readily be extended to more
general multi-view data distributions.
Last but not least, we show in Section 5that our theory ex-
tends to practice by training models on image classification
benchmarks that are modified to have additional spurious
features correlated with the true class labels. We find in our
experiments that Midpoint Mixup outperforms ERM, and
performs comparably to the previously used Mixup settings
in Zhang et al. (2018). A primary goal of this section is to
illustrate that Midpoint Mixup is not just a toy theoretical
setting, but rather one that can be of practical interest.
1.2. Related Work
Mixup. The idea of training on midpoints (or approximate
midpoints) is not new; both Guo (2021) and Chidambaram
et al. (2021) empirically study settings resembling what
we consider in this paper, but they do not develop theory
for this kind of training (beyond an information theoretic
result in the latter case). As mentioned earlier, there are also
several theoretical works analyzing the Mixup formulation
and it variants (Carratino et al.,2020;Zhang et al.,2020;
2021;Chidambaram et al.,2021;Park et al.,2022), but
none of these works contain optimization results (which are
the focus of this work). Additionally, we note that there are
many Mixup-like data augmentation techniques and training
formulations that are not (immediately) within the scope of
the theory developed in this paper. For example, Cut Mix
(Yun et al.,2019), Manifold Mixup (Verma et al.,2019),
Puzzle Mix (Kim et al.,2020), SaliencyMix (Uddin et al.,
2020), Co-Mixup (Kim et al.,2021), AutoMix (Liu et al.,
2021), and Noisy Feature Mixup (Lim et al.,2021) are all
such variations.
Data Augmentation. Our work is also influenced by the
existing large body of work theoretically analyzing the ben-
efits of data augmentation (Bishop,1995;Dao et al.,2019;
Wu et al.,2020;Hanin & Sun,2021;Rajput et al.,2019;
Yang et al.,2022;Wang et al.,2022;Chen et al.,2020;Mei
et al.,2021). The most relevant such work to ours is the
recent work of Shen et al. (2022), which also studies the
impact of data augmentation on the learning dynamics of a
2-layer network in a setting motivated by that of Allen-Zhu
& Li (2021). However, Midpoint Mixup differs significantly
from the data augmentation scheme considered in Shen et al.
(2022), and consequently our results and setting are also
of a different nature (we stick much more closely to the
setting of Allen-Zhu & Li (2021)). As such, our work can
be viewed as a parallel thread to that of Shen et al. (2022).
2. Background on Mixup
We will introduce Mixup in the context of
k
-class classi-
fication, although the definitions below easily extend to
regression. As a notational convenience, we will use
[k]
to
indicate {1,2, ..., k}.
Recall that, given a finite dataset
X ⊂ Rd×[k]
with
|X|=
N
, we can define the empirical cross-entropy loss
J(g, X)
of a model g:Rd→Rkas:
J(g, X) = −1
NX
i∈[N]
log ϕyi(g(xi)),
where ϕy(g(x)) = exp(gy(x))
Ps∈[k]exp(gs(x)).(2.1)
With
ϕ
being the standard softmax function and the notation
gy, ϕy
indicating the
y
-th coordinate functions of
g
and
ϕ
respectively. Now let us fix a distribution
Dλ
whose
support is contained in
[0,1]
and introduce the notation
zi,j (λ) = λxi+ (1 −λ)xj
(using
zi,j
when
λ
is clear from
context) where
(xi, yi),(xj, yj)∈ X
. Then we may define
the Mixup cross-entropy JM(g, X,Dλ)as:
ℓ(λ, i, j) = λlog ϕyi(g(zi,j ))
+ (1 −λ) log ϕyj(g(zi,j )),
JM(g, X,Dλ) = −1
N2X
i∈[N]X
j∈[N]
Eλ∼Dλ[ℓ(λ, i, j)] .
(2.2)
2