Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup

2025-05-02 0 0 974.74KB 37 页 10玖币
侵权投诉
Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup
Muthu Chidambaram 1Xiang Wang 1Chenwei Wu 1Rong Ge 1
Abstract
Mixup is a data augmentation technique that relies
on training using random convex combinations
of data points and their labels. In recent years,
Mixup has become a standard primitive used in
the training of state-of-the-art image classifica-
tion models due to its demonstrated benefits over
empirical risk minimization with regards to gen-
eralization and robustness. In this work, we try
to explain some of this success from a feature
learning perspective. We focus our attention on
classification problems in which each class may
have multiple associated features (or views) that
can be used to predict the class correctly. Our
main theoretical results demonstrate that, for a
non-trivial class of data distributions with two fea-
tures per class, training a 2-layer convolutional
network using empirical risk minimization can
lead to learning only one feature for almost all
classes while training with a specific instantiation
of Mixup succeeds in learning both features for
every class. We also show empirically that these
theoretical insights extend to the practical settings
of image benchmarks modified to have multiple
features.
1. Introduction
Data augmentation techniques have been a mainstay in the
training of state-of-the-art models for a wide array of tasks
- particularly in the field of computer vision - due to their
ability to artificially inflate dataset size and encourage model
robustness to various transformations of the data.
One such technique that has achieved widespread use is
Mixup (Zhang et al.,2018), which constructs new data
points as convex combinations of pairs of data points and
their labels from the original dataset. Mixup has been shown
to empirically improve generalization and robustness when
1
Department of Computer Science, Duke University. Corre-
spondence to: Muthu Chidambaram <muthu@cs.duke.edu>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
compared to standard training over different model archi-
tectures, tasks, and domains (Liang et al.,2018;He et al.,
2019;Thulasidasan et al.,2019;Lamb et al.,2019;Arazo
et al.,2019;Guo,2020;Verma et al.,2021b;Wang et al.,
2021). It has also found applications to distributed pri-
vate learning (Huang et al.,2021), learning fair models
(Chuang & Mroueh,2021), semi-supervised learning (Berth-
elot et al.,2019b;Sohn et al.,2020;Berthelot et al.,2019a),
self-supervised (specifically contrastive) learning (Verma
et al.,2021a;Lee et al.,2020;Kalantidis et al.,2020), and
multi-modal learning (So et al.,2022).
The success of Mixup has instigated several works attempt-
ing to theoretically characterize its potential benefits and
drawbacks (Guo et al.,2019;Carratino et al.,2020;Zhang
et al.,2020;2021;Chidambaram et al.,2021). These works
have focused mainly on analyzing, at a high-level, the bene-
ficial (or detrimental) behaviors encouraged by the Mixup-
version of the original empirical loss for a given task.
As such, none of these previous works (to the best of our
knowledge) have provided an algorithmic analysis of Mixup
training in the context of non-linear models (i.e. neural net-
works), which is the main use case of Mixup. In this paper,
we begin this line of work by theoretically separating the
full training dynamics of Mixup (with a specific set of hyper-
parameters) from empirical risk minimization (ERM) for a
2-layer convolutional network (CNN) architecture on a class
of data distributions exhibiting a multi-view nature. This
multi-view property essentially requires (assuming classi-
fication data) that each class in the data is well-correlated
with multiple features present in the data.
Our analysis is heavily motivated by the recent work of
Allen-Zhu & Li (2021), which showed that this kind of
multi-view data can provide a fruitful setting for theoreti-
cally understanding the benefits of ensembles and knowl-
edge distillation in the training of deep learning models. We
show that Mixup can, perhaps surprisingly, capture some of
the key benefits of ensembles explained by Allen-Zhu & Li
(2021) despite only being used to train a single model.
1.1. Main Contributions
Our main results (Theorem 4.6 and Theorem 4.7) give a
clear separation between Mixup training and regular training.
They show that for data with two different features (or views)
1
arXiv:2210.13512v4 [cs.LG] 4 Nov 2024
Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup
per class, ERM learns only one feature with high probability,
while Mixup training can guarantee the learning of both
features with high probability. Our analysis focuses on
Midpoint Mixup, in which training is done on the midpoints
of data points and their labels. While this seems extreme,
we motivate this choice by proving several nice properties
of Midpoint Mixup and giving intuitions on why it favors
learning all features in the data in Section 3.1.
In particular, Section 3.1 highlights the main ideas behind
why this multi-view learning is possible in the relatively
simple to understand setting of linearly separable data. We
prove in this section that the Midpoint Mixup gradient de-
scent dynamics can push towards learning all features in the
data (for our notion of multi-view data) so long as there are
dependencies between the features. Furthermore, we show
that models that have learned all features can achieve arbi-
trarily small pointwise loss on Midpoint-Mixup-augmented
data points, and that this property is unique to Midpoint
Mixup.
In Section 4.2, we show that the ideas developed for the lin-
early separable case can be extended to a noisy, non-linearly-
separable class of data distributions with two features per
class. We prove in our main results that for such distribu-
tions, minimizing the empirical cross-entropy using gradient
descent can lead to learning only one of the features in the
data (Theorem 4.6) while minimizing the Midpoint Mixup
cross-entropy succeeds in learning both features (Theorem
4.7). While our theory in this section focuses on the case of
two features/views per class to be consistent with Allen-Zhu
&Li(2021), our techniques can readily be extended to more
general multi-view data distributions.
Last but not least, we show in Section 5that our theory ex-
tends to practice by training models on image classification
benchmarks that are modified to have additional spurious
features correlated with the true class labels. We find in our
experiments that Midpoint Mixup outperforms ERM, and
performs comparably to the previously used Mixup settings
in Zhang et al. (2018). A primary goal of this section is to
illustrate that Midpoint Mixup is not just a toy theoretical
setting, but rather one that can be of practical interest.
1.2. Related Work
Mixup. The idea of training on midpoints (or approximate
midpoints) is not new; both Guo (2021) and Chidambaram
et al. (2021) empirically study settings resembling what
we consider in this paper, but they do not develop theory
for this kind of training (beyond an information theoretic
result in the latter case). As mentioned earlier, there are also
several theoretical works analyzing the Mixup formulation
and it variants (Carratino et al.,2020;Zhang et al.,2020;
2021;Chidambaram et al.,2021;Park et al.,2022), but
none of these works contain optimization results (which are
the focus of this work). Additionally, we note that there are
many Mixup-like data augmentation techniques and training
formulations that are not (immediately) within the scope of
the theory developed in this paper. For example, Cut Mix
(Yun et al.,2019), Manifold Mixup (Verma et al.,2019),
Puzzle Mix (Kim et al.,2020), SaliencyMix (Uddin et al.,
2020), Co-Mixup (Kim et al.,2021), AutoMix (Liu et al.,
2021), and Noisy Feature Mixup (Lim et al.,2021) are all
such variations.
Data Augmentation. Our work is also influenced by the
existing large body of work theoretically analyzing the ben-
efits of data augmentation (Bishop,1995;Dao et al.,2019;
Wu et al.,2020;Hanin & Sun,2021;Rajput et al.,2019;
Yang et al.,2022;Wang et al.,2022;Chen et al.,2020;Mei
et al.,2021). The most relevant such work to ours is the
recent work of Shen et al. (2022), which also studies the
impact of data augmentation on the learning dynamics of a
2-layer network in a setting motivated by that of Allen-Zhu
& Li (2021). However, Midpoint Mixup differs significantly
from the data augmentation scheme considered in Shen et al.
(2022), and consequently our results and setting are also
of a different nature (we stick much more closely to the
setting of Allen-Zhu & Li (2021)). As such, our work can
be viewed as a parallel thread to that of Shen et al. (2022).
2. Background on Mixup
We will introduce Mixup in the context of
k
-class classi-
fication, although the definitions below easily extend to
regression. As a notational convenience, we will use
[k]
to
indicate {1,2, ..., k}.
Recall that, given a finite dataset
X Rd×[k]
with
|X|=
N
, we can define the empirical cross-entropy loss
J(g, X)
of a model g:RdRkas:
J(g, X) = 1
NX
i[N]
log ϕyi(g(xi)),
where ϕy(g(x)) = exp(gy(x))
Ps[k]exp(gs(x)).(2.1)
With
ϕ
being the standard softmax function and the notation
gy, ϕy
indicating the
y
-th coordinate functions of
g
and
ϕ
respectively. Now let us fix a distribution
Dλ
whose
support is contained in
[0,1]
and introduce the notation
zi,j (λ) = λxi+ (1 λ)xj
(using
zi,j
when
λ
is clear from
context) where
(xi, yi),(xj, yj)∈ X
. Then we may define
the Mixup cross-entropy JM(g, X,Dλ)as:
(λ, i, j) = λlog ϕyi(g(zi,j ))
+ (1 λ) log ϕyj(g(zi,j )),
JM(g, X,Dλ) = 1
N2X
i[N]X
j[N]
Eλ∼Dλ[(λ, i, j)] .
(2.2)
2
Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup
We mention a minor differences between Equation 2.2 and
the original formulation of Zhang et al. (2018). Zhang et al.
(2018) consider the expectation term in Equation 2.2 over
N
randomly sampled pairs of points from the original dataset
X
, whereas we explicitly consider mixing all
N2
possible
pairs of points. This is, however, just to make various parts
of our analysis easier to follow - one could also sample
N
mixed points uniformly, and the analysis would still carry
through with an additional high probability qualifier (the
important aspect is the proportions with which different
mixed points show up; i.e. mixing across classes versus
mixing within a class).
3. Motivating Midpoint Mixup: The Linear
Regime
As can be seen from Equation 2.2, the Mixup cross-entropy
JM(g, X,Dλ)
depends heavily on the choice of mixing dis-
tribution
Dλ
.Zhang et al. (2018) took
Dλ
to be
Beta(α, α)
with
α
being a hyperparameter. In this work, we will specif-
ically be interested in the case of
α→ ∞
, for which the dis-
tribution
Dλ
takes the value
1/2
with probability 1. We refer
to this special case as Midpoint Mixup, and note that it can
also be viewed as a case of the Pairwise Label Smoothing
strategy introduced by (Guo,2021). We will write the Mid-
point Mixup loss as
JMM (g, X)
(here
zi,j = (xi+xj)/2
and there is no
Dλ
dependence as the mixing is determinis-
tic):
(i, j) = log ϕyi(g(zi,j )) + log ϕyj(g(zi,j )),
JMM (g, X) = 1
2N2X
i[N]X
j[N]
(i, j).(3.1)
We focus on this version of Mixup for the following key
reasons.
Equal Feature Learning. Firstly, we will show that
JMM (g, X)
exhibits the nice property that its global mini-
mizers correspond to models in which all of the features in
the data are learned equally (in a sense to be made precise
in Section 3.1).
Pointwise Optimality. We show that for Midpoint Mixup,
it is possible to learn a classifier (with equal feature learn-
ing) that achieves arbitrarily small loss for every Midpoint-
Mixup-augmented point. We will also show that this is not
possible for
JM(g, X,Dλ)
when
Dλ
is any other non-trivial
distribution (i.e. non-point-mass distribution).
Cleaner Optimization Analysis. Additionally, from a tech-
nical perspective, the Midpoint Mixup loss lends itself to a
simpler optimization analysis due to the fact that the struc-
ture of its gradients is not changing with each optimization
iteration (we do not need to sample new mixing proportions
at each optimization step). Indeed, we see that Equation 3.1
circumvents the expectation with respect to
λ
that arose in
JM(g, X,Dλ).
Empirically Viable. While we are not trying to claim that
Midpoint Mixup is a superior alternative to standard Mixup
settings considered in the literature, we will show in Section
5that it can still significantly outperform empirical risk
minimization in practice, and in fact performs quite closely
to known good settings of Mixup.
3.1. Midpoint Mixup with Linear Models on Linearly
Separable Data
To make clear what we mean by feature learning, we first
turn our attention to the simple setting of learning linear
models
gy(x) = wy, x
(i.e. one weight vector associated
per class) on linearly separable data, as this setting will serve
as a foundation for our main results. Namely, we consider
k
-class classification with a dataset
X
of
N
labeled data
points generated according to the following data distribution
(with Nsufficiently large).
Definition 3.1. [Simple Multi-View Setting] For each class
y[k]
, let
vy,1, vy,2Rd
be orthonormal unit vectors also
satisfying
vy,ℓ vs,ℓ
when
y̸=s
for any
ℓ, ℓ[2]
. Each
point
(x, y)∼ D
is then generated by sampling
y[k]
uniformly and constructing xas:
βyUni([0.1,0.9]) x=βyvy,1+ (1 βy)vy,2.(3.2)
Definition 3.1 is multi-view in the following sense: for
any class
y
, it suffices (from an accuracy perspective) to
learn a model
g
that has a significant correlation with either
the feature vector
vy,1
or
vy,2
. In this context, one can
think of feature learning as corresponding to how positively
correlated the weight
wy
is with each of the same class
feature vectors
vy,1
and
vy,2
(we provide a more rigorous
definition in our main results).
If one now considers the empirical cross-entropy loss
J(g, X)
, it is straightforward to see that it is possible to
achieve the global minimum of
J(g, X)
by just considering
models
g
in which we take
wy, vy,1⟩→∞
for every class
y
. This means we can minimize the usual cross-entropy loss
without learning both features for each class in X.
However, this is not the case for Midpoint Mixup. Indeed,
we show below that a necessary (with extremely high prob-
ability) and sufficient condition for a linear model
g
to min-
imize
JMM
(when taking its scaling to
) is that it has
equal correlation with both features for every class (suffi-
ciency relies also on having weaker correlations with other
class features). In what follows, we use
inf JMM (h, X)
to
indicate the global minimum of
JMM
over all functions
h:RdRk
(i.e. this is the smallest achievable loss). Full
proofs of all of the following results can be found in Section
Cof the Appendix.
3
Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup
Lemma 3.2. [Midpoint Mixup Optimal Direction] A linear
model gsatisfies the following
lim
γ→∞ JMM (γg, X) = inf JM M (h, X),(3.3)
if
g
has the property that for every class
y
we have
wy, vy,ℓ1=ws, vs,ℓ2>0
and
wy, vs,ℓ2= 0
for ev-
ery
s̸=y
and
1, ℓ2[2]
. Furthermore, with probability
1exp(Θ(N))
(over the randomness of
X
), the condi-
tion
wy, vy,ℓ1=ws, vs,ℓ2
is necessary for
g
to satisfy
Equation 3.3.
Proof Sketch. The idea is that if
g
has equal correlation
with both features for every class, its predictions will be
constant on the original data points due to the fact that the
coefficients for both features in each data point are mirrored
as per Equation 3.2. With the condition
wy, vs,ℓ= 0
(this
can be weakened significantly), this implies the softmax
output of
g
on the Midpoint Mixup points will be exactly
1/2
for each of the classes being mixed in the scaling limit
(and 0 for all other classes), which is optimal.
Note that Lemma 3.2 implies two properties mentioned
earlier for Midpoint Mixup: Equal Feature Learning and
Pointwise Optimality. Furthermore, we can also show that
if we consider
JM(g, X,Dλ)
for any other non-point-mass
distribution, we can prove that the analogue of Lemma 3.2
does not hold true (because Pointwise Optimality would be
impossible).
Proposition 3.3. For any distribution
Dλ
that is not a point
mass on
0,1
, or
1/2
, and any linear model
g
satisfying the
conditions of Lemma 3.2, we have that with probability
1exp(Θ(N))
(over the randomness of
X
) there exists
an ϵ0>0depending only on Dλsuch that:
JM(g, X,Dλ)inf JM(h, X,Dλ) + ϵ0.(3.4)
Proof Sketch. In the case of general mixing distribu-
tions, we cannot achieve the Mixup optimal behavior of
ϕyi(g(zi,j (λ))) = λ
for every
λ
if the outputs
gy
are con-
stant on the original data points.
Lemma 3.2 outlines the key theoretical benefit of Midpoint
Mixup - namely that its global minimizers exist within the
class of models that we consider, and such minimizers learn
all features in the data equally. And although Lemma 3.2
is stated in the context of linear models, the result natu-
rally carries through to when we consider two-layer neural
networks of the type we define in the next section. That
being said, the interpretation of Proposition 3.3 is not in-
tended to disqualify the possibility that the minimizer of
JM(g, X,Dλ)
when restricted to a specific model class is
a model in which all features are learned near-equally (we
expect this to be the case in fact for any reasonable
Dλ
).
Proposition 3.3 is moreso intended to motivate the study of
Midpoint Mixup as a particularly interesting choice of the
mixing distribution Dλ.
We now proceed one step further from the above results and
show that the feature learning benefit of Midpoint Mixup
manifests itself even in the optimization process (when us-
ing gradient-based methods). We show that, if significant
separation between feature correlations exists, the Midpoint
Mixup gradients correct the separation. For simplicity, we
suppose WLOG that
wy, vy,1>wy, vy,2
. Now letting
y=wy, vy,1vy,2
and using the notation
wy
for
wy, we can prove:
Proposition 3.4. [Mixup Gradient Lower Bound] Let
y
be any class such that
ylog k
, and suppose that both
wy, vy,1 0
and the cross-class orthogonality condition
ws, vu,ℓ= 0
hold for all
s̸=u
and
[2]
. Then we have
with high probability that:
wyJMM (g, X), vy,2Θ1
k2.(3.5)
Proof Sketch. The key idea is to analyze the gradient cor-
relation with the direction
vy,1vy,2
via a concentration
of measure argument. We show that either this correlation
is significantly negative under the stated conditions (which
will imply Equation 3.5), or that the gradient correlation
with vy,2is already large.
Proposition 3.4 shows that, assuming nonnegativity of
within-class correlations and an orthogonality condition
across classes (which we will show to be approximately true
in our main results), the feature correlation that is lagging
behind for any class
y
will receive a significant gradient
when optimizing the Midpoint Mixup loss. On the other
hand, we can also prove that this need not be the case for
empirical risk minimization:
Proposition 3.5. [ERM Gradient Upper Bound] For every
y[k]
, assuming the same conditions as in Proposition 3.4,
if
yClog k
for any
C > 0
then with high probability
we have that:
wyJ(g, X), vy,2O1
k0.1C1.(3.6)
Proof Sketch. This follows directly from the form of the
gradient for
J(g, X)
and the fact that there is a constant
lower bound on the weight associated with each feature in
every data point, as per Definition 3.1.
While Proposition 3.5 demonstrates that training using ERM
can possibly fail to learn both features associated with a
class due to increasingly small gradients, one can verify that
this does not naturally occur in the optimization dynamics
of linear models on linearly separable data of the type in
Definition 3.1 (see for example, the related result in Chi-
dambaram et al. (2021)). On the other hand, if we move
4
Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup
away from linearly separable data and linear models to more
realistic settings, the situation described above does indeed
show up, which motivates our main results.
4. Analyzing Midpoint Mixup Training
Dynamics on General Multi-View Data
For our main results, we now consider a data distribution
and class of models that are meant to more closely mimic
practical situations.
4.1. General Multi-View Data Setup
We adopt a slightly simplified version of the setting of
(Allen-Zhu & Li,2021). We still consider the problem
of
k
-class classification on a dataset
X
of
N
labeled data
points, but our data points are now represented as ordered tu-
ples
x= (x(1), ..., x(P))
of
P
input patches
x(i)
with each
x(i)Rd(so X RP d ×[k]).
As was the case in Definition 3.1 and in (Allen-Zhu & Li,
2021), we assume that the data is multi-view in that each
class
y
is associated with 2 orthonormal feature vectors
vy,1
and
vy,2
, and we once again consider
N
and
k
to be
sufficiently large. As mentioned in (Allen-Zhu & Li,2021),
we could alternatively consider the number of classes
k
to be
fixed (i.e. binary classification) and the number of associated
features to be large, and our theory would still translate. We
now precisely define the data generating distribution
D
that
we will focus on for the remainder of the paper.
Definition 4.1. [General Multi-View Data Distribution]
Identically to Definition 3.1, each class
y
is associated with
two orthonormal feature vectors, after which each point
(x, y)∼ D is generated as:.
1. Sample a label yuniformly from [k].
2.
Designate via any method two disjoint subsets
Py,1(x), Py,2(x)[P]
with
|Py,1(x)|=|Py,2(x)|=
CP
for a universal constant
CP
, and additionally
choose via any method a bijection
φ:Py,1(x)
Py,2(x)
. We then generate the signal patches of
x
in corresponding pairs
x(p)=βy,pvy,1
and
x(φ(p)) =
(δ2βy,p)vy,2=βy(p)vy,2
for every
pPy,1(x)
with the
βy,p
chosen according to a symmetric dis-
tribution (allowed to vary per class
y
) supported on
[δ1, δ2δ1]
satisfying the anti-concentration prop-
erty that
βy,p
takes values in a subset of its support
whose Lebesgue measure is
O(1/log k)
with probabil-
ity o(1).1
3.
Fix, via any method,
Q
distinct classes
s1, s2, ..., sQ
[k]\y
with
Q= Θ(1)
. The remaining
[P]\(Py,1(x)
1
This assumption is true for any distribution with reasonable
variance; for example, the uniform distribution.
Py,2(x))
patches not considered above are the fea-
ture noise patches of
x
, and are defined to be
x(p)=
Pj[Q]P[2] γj,ℓvsj,ℓ
, where the
γj,ℓ [δ3, δ4]
can
be arbitrary.
Note that there are parts of the data-generating process that
we leave underspecified, as our results will work for any
choice. Henceforth, we use
X
to refer to a dataset consist-
ing of
N
i.i.d. draws from the distribution
D
. Our data
distribution represents a very low signal-to-noise (SNR)
setting in which the true signal for a class exists only in
a constant (
2CP
) number of patches while the rest of the
patches contain low magnitude noise in the form of other
class features.
We focus on the case of learning the data distribution
D
with
the same two-layer CNN-like architecture used in (Allen-
Zhu & Li,2021). We recall that this architecture relies
on the following polynomially-smoothed ReLU activation,
which we refer to as ^
ReLU:
^
ReLU(x) =
0if x0
xα
αρα1if x[0, ρ]
x11
αρif xρ
.
The polynomial part of this activation function will be very
useful for us in suppressing the feature noise in
D
. Our full
network architecture, which consists of
m
hidden neurons,
can then be specified as follows.
Definition 4.2. [2-Layer Network] We denote our network
by
g:RP d Rk
. For each
y[k]
, we define
gy
as
follows.
gy(x) = X
r[m]X
p[P]
^
ReLUDwy,r, x(p)E.(4.1)
We will use
w(0)
y,r
to refer to the weights of the network
g
at initialization (and
w(t)
y,r
after
t
steps of gradient descent),
and similarly
gt
to refer to the model after
t
iterations of
gradient descent. We consider the standard choice of Xavier
initialization, which, in our setting, corresponds to
w(0)
y,r
N(0,1
dId).
For model training, we focus on full batch gradient de-
scent with a fixed learning rate of
η
applied to
J(g, X)
and
JMM (g, X)
. Once again using the notation
w(t)
y,r
for
w(t)
y,r
, the updates to the weights of the network
g
are thus
of the form:
w(t+1)
y,r =w(t)
y,r ηw(t)
y,r JM M (g, X).(4.2)
5
摘要:

ProvablyLearningDiverseFeaturesinMulti-ViewDatawithMidpointMixupMuthuChidambaram1XiangWang1ChenweiWu1RongGe1AbstractMixupisadataaugmentationtechniquethatreliesontrainingusingrandomconvexcombinationsofdatapointsandtheirlabels.Inrecentyears,Mixuphasbecomeastandardprimitiveusedinthetrainingofstate-of-t...

展开>> 收起<<
Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup.pdf

共37页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:37 页 大小:974.74KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 37
客服
关注