Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup

2025-05-02 0 0 974.74KB 37 页 10玖币

侵权投诉

Muthu Chidambaram 1Xiang Wang 1Chenwei Wu 1Rong Ge 1

Abstract

Mixup is a data augmentation technique that relies

on training using random convex combinations

of data points and their labels. In recent years,

Mixup has become a standard primitive used in

the training of state-of-the-art image classiﬁca-

tion models due to its demonstrated beneﬁts over

empirical risk minimization with regards to gen-

eralization and robustness. In this work, we try

to explain some of this success from a feature

learning perspective. We focus our attention on

classiﬁcation problems in which each class may

have multiple associated features (or views) that

can be used to predict the class correctly. Our

main theoretical results demonstrate that, for a

non-trivial class of data distributions with two fea-

tures per class, training a 2-layer convolutional

network using empirical risk minimization can

lead to learning only one feature for almost all

classes while training with a speciﬁc instantiation

of Mixup succeeds in learning both features for

every class. We also show empirically that these

theoretical insights extend to the practical settings

of image benchmarks modiﬁed to have multiple

features.

1. Introduction

Data augmentation techniques have been a mainstay in the

training of state-of-the-art models for a wide array of tasks

- particularly in the ﬁeld of computer vision - due to their

ability to artiﬁcially inﬂate dataset size and encourage model

robustness to various transformations of the data.

One such technique that has achieved widespread use is

Mixup (Zhang et al.,2018), which constructs new data

points as convex combinations of pairs of data points and

their labels from the original dataset. Mixup has been shown

to empirically improve generalization and robustness when

Department of Computer Science, Duke University. Corre-

spondence to: Muthu Chidambaram <muthu@cs.duke.edu>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

compared to standard training over different model archi-

tectures, tasks, and domains (Liang et al.,2018;He et al.,

2019;Thulasidasan et al.,2019;Lamb et al.,2019;Arazo

et al.,2019;Guo,2020;Verma et al.,2021b;Wang et al.,

2021). It has also found applications to distributed pri-

vate learning (Huang et al.,2021), learning fair models

(Chuang & Mroueh,2021), semi-supervised learning (Berth-

elot et al.,2019b;Sohn et al.,2020;Berthelot et al.,2019a),

self-supervised (speciﬁcally contrastive) learning (Verma

et al.,2021a;Lee et al.,2020;Kalantidis et al.,2020), and

multi-modal learning (So et al.,2022).

The success of Mixup has instigated several works attempt-

ing to theoretically characterize its potential beneﬁts and

drawbacks (Guo et al.,2019;Carratino et al.,2020;Zhang

et al.,2020;2021;Chidambaram et al.,2021). These works

have focused mainly on analyzing, at a high-level, the bene-

ﬁcial (or detrimental) behaviors encouraged by the Mixup-

version of the original empirical loss for a given task.

As such, none of these previous works (to the best of our

knowledge) have provided an algorithmic analysis of Mixup

training in the context of non-linear models (i.e. neural net-

works), which is the main use case of Mixup. In this paper,

we begin this line of work by theoretically separating the

full training dynamics of Mixup (with a speciﬁc set of hyper-

parameters) from empirical risk minimization (ERM) for a

2-layer convolutional network (CNN) architecture on a class

of data distributions exhibiting a multi-view nature. This

multi-view property essentially requires (assuming classi-

ﬁcation data) that each class in the data is well-correlated

with multiple features present in the data.

Our analysis is heavily motivated by the recent work of

Allen-Zhu & Li (2021), which showed that this kind of

multi-view data can provide a fruitful setting for theoreti-

cally understanding the beneﬁts of ensembles and knowl-

edge distillation in the training of deep learning models. We

show that Mixup can, perhaps surprisingly, capture some of

the key beneﬁts of ensembles explained by Allen-Zhu & Li

(2021) despite only being used to train a single model.

1.1. Main Contributions

Our main results (Theorem 4.6 and Theorem 4.7) give a

clear separation between Mixup training and regular training.

They show that for data with two different features (or views)

arXiv:2210.13512v4 [cs.LG] 4 Nov 2024

Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup

per class, ERM learns only one feature with high probability,

while Mixup training can guarantee the learning of both

features with high probability. Our analysis focuses on

Midpoint Mixup, in which training is done on the midpoints

of data points and their labels. While this seems extreme,

we motivate this choice by proving several nice properties

of Midpoint Mixup and giving intuitions on why it favors

learning all features in the data in Section 3.1.

In particular, Section 3.1 highlights the main ideas behind

why this multi-view learning is possible in the relatively

simple to understand setting of linearly separable data. We

prove in this section that the Midpoint Mixup gradient de-

scent dynamics can push towards learning all features in the

data (for our notion of multi-view data) so long as there are

dependencies between the features. Furthermore, we show

that models that have learned all features can achieve arbi-

trarily small pointwise loss on Midpoint-Mixup-augmented

data points, and that this property is unique to Midpoint

Mixup.

In Section 4.2, we show that the ideas developed for the lin-

early separable case can be extended to a noisy, non-linearly-

separable class of data distributions with two features per

class. We prove in our main results that for such distribu-

tions, minimizing the empirical cross-entropy using gradient

descent can lead to learning only one of the features in the

data (Theorem 4.6) while minimizing the Midpoint Mixup

cross-entropy succeeds in learning both features (Theorem

4.7). While our theory in this section focuses on the case of

two features/views per class to be consistent with Allen-Zhu

&Li(2021), our techniques can readily be extended to more

general multi-view data distributions.

Last but not least, we show in Section 5that our theory ex-

tends to practice by training models on image classiﬁcation

benchmarks that are modiﬁed to have additional spurious

features correlated with the true class labels. We ﬁnd in our

experiments that Midpoint Mixup outperforms ERM, and

performs comparably to the previously used Mixup settings

in Zhang et al. (2018). A primary goal of this section is to

illustrate that Midpoint Mixup is not just a toy theoretical

setting, but rather one that can be of practical interest.

1.2. Related Work

Mixup. The idea of training on midpoints (or approximate

midpoints) is not new; both Guo (2021) and Chidambaram

et al. (2021) empirically study settings resembling what

we consider in this paper, but they do not develop theory

for this kind of training (beyond an information theoretic

result in the latter case). As mentioned earlier, there are also

several theoretical works analyzing the Mixup formulation

and it variants (Carratino et al.,2020;Zhang et al.,2020;

2021;Chidambaram et al.,2021;Park et al.,2022), but

none of these works contain optimization results (which are

the focus of this work). Additionally, we note that there are

many Mixup-like data augmentation techniques and training

formulations that are not (immediately) within the scope of

the theory developed in this paper. For example, Cut Mix

(Yun et al.,2019), Manifold Mixup (Verma et al.,2019),

Puzzle Mix (Kim et al.,2020), SaliencyMix (Uddin et al.,

2020), Co-Mixup (Kim et al.,2021), AutoMix (Liu et al.,

2021), and Noisy Feature Mixup (Lim et al.,2021) are all

such variations.

Data Augmentation. Our work is also inﬂuenced by the

existing large body of work theoretically analyzing the ben-

eﬁts of data augmentation (Bishop,1995;Dao et al.,2019;

Wu et al.,2020;Hanin & Sun,2021;Rajput et al.,2019;

Yang et al.,2022;Wang et al.,2022;Chen et al.,2020;Mei

et al.,2021). The most relevant such work to ours is the

recent work of Shen et al. (2022), which also studies the

impact of data augmentation on the learning dynamics of a

2-layer network in a setting motivated by that of Allen-Zhu

& Li (2021). However, Midpoint Mixup differs signiﬁcantly

from the data augmentation scheme considered in Shen et al.

(2022), and consequently our results and setting are also

of a different nature (we stick much more closely to the

setting of Allen-Zhu & Li (2021)). As such, our work can

be viewed as a parallel thread to that of Shen et al. (2022).

2. Background on Mixup

We will introduce Mixup in the context of

-class classi-

ﬁcation, although the deﬁnitions below easily extend to

regression. As a notational convenience, we will use

[k]

indicate {1,2, ..., k}.

Recall that, given a ﬁnite dataset

X ⊂ Rd×[k]

with

|X|=

, we can deﬁne the empirical cross-entropy loss

J(g, X)

of a model g:Rd→Rkas:

J(g, X) = −1

i∈[N]

log ϕyi(g(xi)),

where ϕy(g(x)) = exp(gy(x))

Ps∈[k]exp(gs(x)).(2.1)

With

being the standard softmax function and the notation

gy, ϕy

indicating the

-th coordinate functions of

and

respectively. Now let us ﬁx a distribution

Dλ

whose

support is contained in

[0,1]

and introduce the notation

zi,j (λ) = λxi+ (1 −λ)xj

(using

zi,j

when

is clear from

context) where

(xi, yi),(xj, yj)∈ X

. Then we may deﬁne

the Mixup cross-entropy JM(g, X,Dλ)as:

ℓ(λ, i, j) = λlog ϕyi(g(zi,j ))

+ (1 −λ) log ϕyj(g(zi,j )),

JM(g, X,Dλ) = −1

N2X

i∈[N]X

j∈[N]

Eλ∼Dλ[ℓ(λ, i, j)] .

(2.2)

Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup

We mention a minor differences between Equation 2.2 and

the original formulation of Zhang et al. (2018). Zhang et al.

(2018) consider the expectation term in Equation 2.2 over

randomly sampled pairs of points from the original dataset

, whereas we explicitly consider mixing all

possible

pairs of points. This is, however, just to make various parts

of our analysis easier to follow - one could also sample

mixed points uniformly, and the analysis would still carry

through with an additional high probability qualiﬁer (the

important aspect is the proportions with which different

mixed points show up; i.e. mixing across classes versus

mixing within a class).

3. Motivating Midpoint Mixup: The Linear

Regime

As can be seen from Equation 2.2, the Mixup cross-entropy

JM(g, X,Dλ)

depends heavily on the choice of mixing dis-

tribution

Dλ

.Zhang et al. (2018) took

Dλ

to be

Beta(α, α)

with

being a hyperparameter. In this work, we will specif-

ically be interested in the case of

α→ ∞

, for which the dis-

tribution

Dλ

takes the value

1/2

with probability 1. We refer

to this special case as Midpoint Mixup, and note that it can

also be viewed as a case of the Pairwise Label Smoothing

strategy introduced by (Guo,2021). We will write the Mid-

point Mixup loss as

JMM (g, X)

(here

zi,j = (xi+xj)/2

and there is no

Dλ

dependence as the mixing is determinis-

tic):

ℓ(i, j) = log ϕyi(g(zi,j )) + log ϕyj(g(zi,j )),

JMM (g, X) = −1

2N2X

i∈[N]X

j∈[N]

ℓ(i, j).(3.1)

We focus on this version of Mixup for the following key

reasons.

Equal Feature Learning. Firstly, we will show that

JMM (g, X)

exhibits the nice property that its global mini-

mizers correspond to models in which all of the features in

the data are learned equally (in a sense to be made precise

in Section 3.1).

Pointwise Optimality. We show that for Midpoint Mixup,

it is possible to learn a classiﬁer (with equal feature learn-

ing) that achieves arbitrarily small loss for every Midpoint-

Mixup-augmented point. We will also show that this is not

possible for

JM(g, X,Dλ)

when

Dλ

is any other non-trivial

distribution (i.e. non-point-mass distribution).

Cleaner Optimization Analysis. Additionally, from a tech-

nical perspective, the Midpoint Mixup loss lends itself to a

simpler optimization analysis due to the fact that the struc-

ture of its gradients is not changing with each optimization

iteration (we do not need to sample new mixing proportions

at each optimization step). Indeed, we see that Equation 3.1

circumvents the expectation with respect to

that arose in

JM(g, X,Dλ).

Empirically Viable. While we are not trying to claim that

Midpoint Mixup is a superior alternative to standard Mixup

settings considered in the literature, we will show in Section

5that it can still signiﬁcantly outperform empirical risk

minimization in practice, and in fact performs quite closely

to known good settings of Mixup.

3.1. Midpoint Mixup with Linear Models on Linearly

Separable Data

To make clear what we mean by feature learning, we ﬁrst

turn our attention to the simple setting of learning linear

models

gy(x) = ⟨wy, x⟩

(i.e. one weight vector associated

per class) on linearly separable data, as this setting will serve

as a foundation for our main results. Namely, we consider

-class classiﬁcation with a dataset

labeled data

points generated according to the following data distribution

(with Nsufﬁciently large).

Deﬁnition 3.1. [Simple Multi-View Setting] For each class

y∈[k]

, let

vy,1, vy,2∈Rd

be orthonormal unit vectors also

satisfying

vy,ℓ ⊥vs,ℓ′

when

y̸=s

for any

ℓ, ℓ′∈[2]

. Each

point

(x, y)∼ D

is then generated by sampling

y∈[k]

uniformly and constructing xas:

βy∼Uni([0.1,0.9]) x=βyvy,1+ (1 −βy)vy,2.(3.2)

Deﬁnition 3.1 is multi-view in the following sense: for

any class

, it sufﬁces (from an accuracy perspective) to

learn a model

that has a signiﬁcant correlation with either

the feature vector

vy,1

vy,2

. In this context, one can

think of feature learning as corresponding to how positively

correlated the weight

is with each of the same class

feature vectors

vy,1

and

vy,2

(we provide a more rigorous

deﬁnition in our main results).

If one now considers the empirical cross-entropy loss

J(g, X)

, it is straightforward to see that it is possible to

achieve the global minimum of

J(g, X)

by just considering

models

in which we take

⟨wy, vy,1⟩→∞

for every class

. This means we can minimize the usual cross-entropy loss

without learning both features for each class in X.

However, this is not the case for Midpoint Mixup. Indeed,

we show below that a necessary (with extremely high prob-

ability) and sufﬁcient condition for a linear model

to min-

imize

JMM

(when taking its scaling to

∞

) is that it has

equal correlation with both features for every class (sufﬁ-

ciency relies also on having weaker correlations with other

class features). In what follows, we use

inf JMM (h, X)

indicate the global minimum of

JMM

over all functions

h:Rd→Rk

(i.e. this is the smallest achievable loss). Full

proofs of all of the following results can be found in Section

Cof the Appendix.

Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup

Lemma 3.2. [Midpoint Mixup Optimal Direction] A linear

model gsatisﬁes the following

lim

γ→∞ JMM (γg, X) = inf JM M (h, X),(3.3)

has the property that for every class

we have

⟨wy, vy,ℓ1⟩=⟨ws, vs,ℓ2⟩>0

and

⟨wy, vs,ℓ2⟩= 0

for ev-

ery

s̸=y

and

ℓ1, ℓ2∈[2]

. Furthermore, with probability

1−exp(−Θ(N))

(over the randomness of

), the condi-

tion

⟨wy, vy,ℓ1⟩=⟨ws, vs,ℓ2⟩

is necessary for

to satisfy

Equation 3.3.

Proof Sketch. The idea is that if

has equal correlation

with both features for every class, its predictions will be

constant on the original data points due to the fact that the

coefﬁcients for both features in each data point are mirrored

as per Equation 3.2. With the condition

⟨wy, vs,ℓ⟩= 0

(this

can be weakened signiﬁcantly), this implies the softmax

output of

on the Midpoint Mixup points will be exactly

1/2

for each of the classes being mixed in the scaling limit

(and 0 for all other classes), which is optimal.

Note that Lemma 3.2 implies two properties mentioned

earlier for Midpoint Mixup: Equal Feature Learning and

Pointwise Optimality. Furthermore, we can also show that

if we consider

JM(g, X,Dλ)

for any other non-point-mass

distribution, we can prove that the analogue of Lemma 3.2

does not hold true (because Pointwise Optimality would be

impossible).

Proposition 3.3. For any distribution

Dλ

that is not a point

mass on

0,1

, or

1/2

, and any linear model

satisfying the

conditions of Lemma 3.2, we have that with probability

1−exp(−Θ(N))

(over the randomness of

) there exists

an ϵ0>0depending only on Dλsuch that:

JM(g, X,Dλ)≥inf JM(h, X,Dλ) + ϵ0.(3.4)

Proof Sketch. In the case of general mixing distribu-

tions, we cannot achieve the Mixup optimal behavior of

ϕyi(g(zi,j (λ))) = λ

for every

if the outputs

are con-

stant on the original data points.

Lemma 3.2 outlines the key theoretical beneﬁt of Midpoint

Mixup - namely that its global minimizers exist within the

class of models that we consider, and such minimizers learn

all features in the data equally. And although Lemma 3.2

is stated in the context of linear models, the result natu-

rally carries through to when we consider two-layer neural

networks of the type we deﬁne in the next section. That

being said, the interpretation of Proposition 3.3 is not in-

tended to disqualify the possibility that the minimizer of

JM(g, X,Dλ)

when restricted to a speciﬁc model class is

a model in which all features are learned near-equally (we

expect this to be the case in fact for any reasonable

Dλ

Proposition 3.3 is moreso intended to motivate the study of

Midpoint Mixup as a particularly interesting choice of the

mixing distribution Dλ.

We now proceed one step further from the above results and

show that the feature learning beneﬁt of Midpoint Mixup

manifests itself even in the optimization process (when us-

ing gradient-based methods). We show that, if signiﬁcant

separation between feature correlations exists, the Midpoint

Mixup gradients correct the separation. For simplicity, we

suppose WLOG that

⟨wy, vy,1⟩>⟨wy, vy,2⟩

. Now letting

∆y=⟨wy, vy,1−vy,2⟩

and using the notation

∇wy

for

∂

∂wy, we can prove:

Proposition 3.4. [Mixup Gradient Lower Bound] Let

be any class such that

∆y≥log k

, and suppose that both

⟨wy, vy,1⟩ ≥ 0

and the cross-class orthogonality condition

⟨ws, vu,ℓ⟩= 0

hold for all

s̸=u

and

ℓ∈[2]

. Then we have

with high probability that:

−∇wyJMM (g, X), vy,2≥Θ1

k2.(3.5)

Proof Sketch. The key idea is to analyze the gradient cor-

relation with the direction

vy,1−vy,2

via a concentration

of measure argument. We show that either this correlation

is signiﬁcantly negative under the stated conditions (which

will imply Equation 3.5), or that the gradient correlation

with vy,2is already large.

Proposition 3.4 shows that, assuming nonnegativity of

within-class correlations and an orthogonality condition

across classes (which we will show to be approximately true

in our main results), the feature correlation that is lagging

behind for any class

will receive a signiﬁcant gradient

when optimizing the Midpoint Mixup loss. On the other

hand, we can also prove that this need not be the case for

empirical risk minimization:

Proposition 3.5. [ERM Gradient Upper Bound] For every

y∈[k]

, assuming the same conditions as in Proposition 3.4,

∆y≥Clog k

for any

C > 0

then with high probability

we have that:

−∇wyJ(g, X), vy,2≤O1

k0.1C−1.(3.6)

Proof Sketch. This follows directly from the form of the

gradient for

J(g, X)

and the fact that there is a constant

lower bound on the weight associated with each feature in

every data point, as per Deﬁnition 3.1.

While Proposition 3.5 demonstrates that training using ERM

can possibly fail to learn both features associated with a

class due to increasingly small gradients, one can verify that

this does not naturally occur in the optimization dynamics

of linear models on linearly separable data of the type in

Deﬁnition 3.1 (see for example, the related result in Chi-

dambaram et al. (2021)). On the other hand, if we move

Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup

away from linearly separable data and linear models to more

realistic settings, the situation described above does indeed

show up, which motivates our main results.

4. Analyzing Midpoint Mixup Training

Dynamics on General Multi-View Data

For our main results, we now consider a data distribution

and class of models that are meant to more closely mimic

practical situations.

4.1. General Multi-View Data Setup

We adopt a slightly simpliﬁed version of the setting of

(Allen-Zhu & Li,2021). We still consider the problem

-class classiﬁcation on a dataset

labeled data

points, but our data points are now represented as ordered tu-

ples

x= (x(1), ..., x(P))

input patches

x(i)

with each

x(i)∈Rd(so X ⊂ RP d ×[k]).

As was the case in Deﬁnition 3.1 and in (Allen-Zhu & Li,

2021), we assume that the data is multi-view in that each

class

is associated with 2 orthonormal feature vectors

vy,1

and

vy,2

, and we once again consider

and

to be

sufﬁciently large. As mentioned in (Allen-Zhu & Li,2021),

we could alternatively consider the number of classes

to be

ﬁxed (i.e. binary classiﬁcation) and the number of associated

features to be large, and our theory would still translate. We

now precisely deﬁne the data generating distribution

that

we will focus on for the remainder of the paper.

Deﬁnition 4.1. [General Multi-View Data Distribution]

Identically to Deﬁnition 3.1, each class

is associated with

two orthonormal feature vectors, after which each point

(x, y)∼ D is generated as:.

1. Sample a label yuniformly from [k].

Designate via any method two disjoint subsets

Py,1(x), Py,2(x)⊂[P]

with

|Py,1(x)|=|Py,2(x)|=

for a universal constant

, and additionally

choose via any method a bijection

φ:Py,1(x)→

Py,2(x)

. We then generate the signal patches of

in corresponding pairs

x(p)=βy,pvy,1

and

x(φ(p)) =

(δ2−βy,p)vy,2=βy,φ(p)vy,2

for every

p∈Py,1(x)

with the

βy,p

chosen according to a symmetric dis-

tribution (allowed to vary per class

) supported on

[δ1, δ2−δ1]

satisfying the anti-concentration prop-

erty that

βy,p

takes values in a subset of its support

whose Lebesgue measure is

O(1/log k)

with probabil-

ity o(1).1

Fix, via any method,

distinct classes

s1, s2, ..., sQ∈

[k]\y

with

Q= Θ(1)

. The remaining

[P]\(Py,1(x)∪

This assumption is true for any distribution with reasonable

variance; for example, the uniform distribution.

Py,2(x))

patches not considered above are the fea-

ture noise patches of

, and are deﬁned to be

x(p)=

Pj∈[Q]Pℓ∈[2] γj,ℓvsj,ℓ

, where the

γj,ℓ ∈[δ3, δ4]

can

be arbitrary.

Note that there are parts of the data-generating process that

we leave underspeciﬁed, as our results will work for any

choice. Henceforth, we use

to refer to a dataset consist-

ing of

i.i.d. draws from the distribution

. Our data

distribution represents a very low signal-to-noise (SNR)

setting in which the true signal for a class exists only in

a constant (

2CP

) number of patches while the rest of the

patches contain low magnitude noise in the form of other

class features.

We focus on the case of learning the data distribution

with

the same two-layer CNN-like architecture used in (Allen-

Zhu & Li,2021). We recall that this architecture relies

on the following polynomially-smoothed ReLU activation,

which we refer to as ^

ReLU:

ReLU(x) = 









0if x≤0

xα

αρα−1if x∈[0, ρ]

x−1−1

αρif x≥ρ

The polynomial part of this activation function will be very

useful for us in suppressing the feature noise in

. Our full

network architecture, which consists of

hidden neurons,

can then be speciﬁed as follows.

Deﬁnition 4.2. [2-Layer Network] We denote our network

g:RP d →Rk

. For each

y∈[k]

, we deﬁne

follows.

gy(x) = X

r∈[m]X

p∈[P]

ReLUDwy,r, x(p)E.(4.1)

We will use

w(0)

y,r

to refer to the weights of the network

at initialization (and

w(t)

y,r

after

steps of gradient descent),

and similarly

to refer to the model after

iterations of

gradient descent. We consider the standard choice of Xavier

initialization, which, in our setting, corresponds to

w(0)

y,r ∼

N(0,1

dId).

For model training, we focus on full batch gradient de-

scent with a ﬁxed learning rate of

applied to

J(g, X)

and

JMM (g, X)

. Once again using the notation

∇w(t)

y,r

for

∂

∂w(t)

y,r

, the updates to the weights of the network

are thus

of the form:

w(t+1)

y,r =w(t)

y,r −η∇w(t)

y,r JM M (g, X).(4.2)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ProvablyLearningDiverseFeaturesinMulti-ViewDatawithMidpointMixupMuthuChidambaram1XiangWang1ChenweiWu1RongGe1AbstractMixupisadataaugmentationtechniquethatreliesontrainingusingrandomconvexcombinationsofdatapointsandtheirlabels.Inrecentyears,Mixuphasbecomeastandardprimitiveusedinthetrainingofstate-of-t...

展开>> 收起<<

Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup.pdf

共37页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: