Nonlinear Sufficient Dimension Reduction with a Stochastic Neural Network Siqi Liang

2025-05-02 0 0 871.49KB 29 页 10玖币
侵权投诉
Nonlinear Sufficient Dimension Reduction with a
Stochastic Neural Network
Siqi Liang
Purdue University
West Lafayette, IN 47906
liang257@purdue.edu
Yan Sun
Purdue University
West Lafayette, IN 47907
sun748@purdue.edu
Faming Liang
Purdue University
West Lafayette, IN 47907
fmliang@purdue.edu
Abstract
Sufficient dimension reduction is a powerful tool to extract core information hid-
den in the high-dimensional data and has potentially many important applications
in machine learning tasks. However, the existing nonlinear sufficient dimension
reduction methods often lack the scalability necessary for dealing with large-scale
data. We propose a new type of stochastic neural network under a rigorous proba-
bilistic framework and show that it can be used for sufficient dimension reduction
for large-scale data. The proposed stochastic neural network is trained using an
adaptive stochastic gradient Markov chain Monte Carlo algorithm, whose conver-
gence is rigorously studied in the paper as well. Through extensive experiments
on real-world classification and regression problems, we show that the proposed
method compares favorably with the existing state-of-the-art sufficient dimension
reduction methods and is computationally more efficient for large-scale data.
1 Introduction
As a supervised method, sufficient dimension reduction (SDR) aims to project the data onto a lower
dimensional space so that the output is conditionally independent of the input features given the
projected features. Mathematically, the problem of SDR can be described as follows. Let
YRd
be
the response variables, and let
X= (X1, . . . , Xp)TRp
be the explanatory variables of dimension
p
. The goal of SDR is to find a lower-dimensional representation
ZRq
, as a function of
X
for
some q < p, such that
P(Y|X) = P(Y|Z),or equivalently Y
|=
X|Z,(1)
where
|=
denotes conditional independence. Intuitively, the definition (1) implies that
Z
has extracted
all the information contained in
X
for predicting
Y
. In the literature, SDR has been developed under
both linear and nonlinear settings.
Under the linear setting, SDR is to find a few linear combinations of
X
that are sufficient to describe
the conditional distribution of Ygiven X, i.e., finding a projection matrix BRp×qsuch that
Y
|=
XBTX.(2)
A more general definition for linear SDR based on
σ
-field can be found in [
9
]. Towards this goal, a
variety of inverse regression methods have been proposed, see e.g., sliced inverse regression (SIR)
To whom the correspondence should be addressed: Faming Liang.
Submitted to 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Do not distribute.
arXiv:2210.04349v1 [cs.LG] 9 Oct 2022
[
30
], sliced average variance estimation (SAVE) [
8
,
10
], parametric inverse regression [
6
], contour
regression [
29
], and directional regression [
28
]. These methods require strict assumptions on the joint
distribution of
(X,Y)
or the conditional distribution of
X|Y
, which limit their use in practice. To
address this issue, some forward regression methods have been developed in the literature, see e.g.,
principal Hessian directions [
31
], minimum average variance estimation [
51
], conditional variance
estimation [
14
], among others. These methods require minimal assumptions on the smoothness of
the joint distribution
(X,Y)
, but they do not scale well for big data problems. They can become
infeasible quickly as both pand nincrease, see [24] for more discussions on this issue.
Under the nonlinear setting, SDR is to find a nonlinear function f(·)such that
Y
|=
Xf(X).(3)
A general theory for nonlinear SDR has been developed in [
26
]. A common strategy to achieve
nonlinear SDR is to apply the kernel trick to the existing linear SDR methods, where the variable
X
is first mapped to a high-dimensional feature space via kernels and then inverse or forward regression
methods are performed. This strategy has led to a variety of methods such as kernel sliced inverse
regression (KSIR) [
49
], kernel dimension reduction (KDR) [
15
,
16
], manifold kernel dimension
reduction (MKDR) [
39
], generalized sliced inverse regression (GSIR) [
26
], generalized sliced average
variance estimator (GSAVE) [
26
], and least square mutual information estimation (LSMIE) [
47
]. A
drawback shared by these methods is that they require to compute the eigenvectors or inverse of an
n×n
matrix. Therefore, these methods lack the scalability necessary for big data problems. Another
strategy to achieve nonlinear SDR is to consider the problem under the multi-index model setting.
Under this setting, the methods of forward regression such as those based on the outer product of the
gradient [
50
,
23
] have been developed, which often involve eigen-decomposition of a
p×p
matrix
and are thus unscalable for high-dimensional problems.
Quite recently, some deep learning-based nonlinear SDR methods have been proposed in the literature,
see e.g. [
24
,
3
,
33
], which are scalable for big data by training the deep neural network (DNN) with a
mini-batch strategy. In [
24
], the authors assume that the response variable
Y
on the predictors
X
is
fully captured by a regression
Y=g(BTX) + ,(4)
for an unknown function
g(·)
and a low rank parameter matrix
B
, and they propose a two-stage
approach to estimate
g(·)
and
B
. They first estimate
g(·)
by
˜g(·)
by fitting the regression
Y=
˜g(X) +
with a DNN and initialize the estimator of
B
using the outer product gradient (OPG)
approach [
51
], and then refine the estimators of
g(·)
and
B
by optimizing them in a joint manner.
However, as pointed out by the authors, this method might not be valid unless the estimate of
g(·)
is
consistent, but the consistency does not generally hold for the fully connected neural networks trained
without constraints. Specifically, the universal approximation ability of the DNN can make the latent
variable
Z:= BTX
unidentifiable from the DNN approximator of
g(·)
; or, said differently,
Z
can be
an arbitrary vector by tuning the size of the DNN to be sufficiently large. A similar issue happened to
[
3
], where the authors propose to learn the latent variable
Z
by optimizing three DNNs to approximate
the distributions
p(Z|X)
,
p(X|Z)
and
p(Y|Z)
, respectively, under the framework of variational
autoencoder. Again,
Z
suffers from the identifiability issue due to the universal approximation ability
of the DNN. In [
33
], the authors employ a regular DNN for sufficient dimension reduction, which
works only for the case that the distribution of the response variable falls into the exponential family.
How to conduct SDR with DNNs for general large-scale data remains an unresolved issue.
We address the above issue by developing a new type of stochastic neural network. The idea can be
loosely described as follows. Suppose that we are able to learn a stochastic neural network, which
maps
X
to
Y
via some stochastic hidden layers and possesses a layer-wise Markovian structure. Let
h
denote the number of hidden layers, and let
Y1,Y2,...,Yh
denote the outputs of the respective
stochastic hidden layers. By the layer-wise Markovian structure of the stochastic neural network, we
can decompose the joint distribution of (Y,Yh,Yh1,...,Y1)conditioned on Xas follows
π(Y,Yh,Yh1,...,Y1|X) = π(Y|Yh)π(Yh|Yh1)···π(Y1|X),(5)
where each conditional distribution is modeled by a linear or logistic regression (on transformed
outputs of the previous layer), while the stochastic neural network still provides a good approximation
to the underlying DNN under appropriate conditions on the random noise added to each stochastic
layer. The layer-wise Markovian structure implies
Y
|=
X|Yh
, and the simple regression structure of
π(Y|Yh)
successfully gets around the identifiability issue of the latent variable
Z:= Yh
that has
2
been suffered by some other deep learning-based methods [
3
,
24
]. How to define and learn such a
stochastic neural network will be detailed in the paper.
Our contribution
in this paper is three-fold: (i) We propose a new type of stochastic neural
network (abbreviated as “StoNet” hereafter) for sufficient dimension reduction, for which a layer-
wise Markovian structure (5) is imposed on the network in training and the size of the noise added
to each hidden layer is calibrated for ensuring the StoNet to provide a good approximation to the
underlying DNN. (ii) We develop an adaptive stochastic gradient MCMC algorithm for training
the StoNet and provides a rigorous study for its convergence under mild conditions. The training
algorithm is scalable with respect to big data and it is itself of interest to statistical computing for
the problems with latent variables or missing data involved. (iii) We formulate the StoNet as a
composition of many simple linear/logistic regressions, making its structure more designable and
interpretable. The backward imputation and forward parameter updating mechanism embedded
in the proposed training algorithm enables the regression subtasks to communicate globally and
update locally. As discussed later, these two features enable the StoNet to solve many important
scientific problems, rather than sufficient dimension reduction, in a more convenient way than does
the conventional DNN. The StoNet bridges us from linear models to deep learning.
Other related works.
Stochastic neural networks have a long history in machine learning. Famous
examples include multilayer generative models [
21
], restricted Boltzmann machine [
22
] and deep
Boltzmann machine [
43
]. Recently, some researchers have proposed adding noise to the DNN to
improve its fitting and generalization. For example, [
44
] proposed the dropout method to prevent
the DNN from over-fitting by randomly dropping some hidden and visible units during training;
[
36
] proposed adding gradient noise to improve training; [
19
,
40
,
53
,
45
] proposed to use stochastic
activation functions through adding noise to improve generalization and adversarial robustness, and
[
54
] proposed to learn the uncertainty parameters of the stochastic activation functions along with the
training of the neural network.
However, none of the existing stochastic neural networks can be used for sufficient dimension
reduction. It is known that the multilayer generative models [
21
], restricted Boltzmann machine [
22
]
and deep Boltzmann machine [
43
] can be used for dimension reduction, but under the unsupervised
mode. As explained in [
44
], the dropout method is essentially a stochastic regularization method,
where the likelihood function is penalized in network training and thus the hidden layer output
of the resulting neural network does not satisfy (3). In [
19
], the size of the noise added to the
activity function is not well calibrated and it is unclear whether the true log-likelihood function is
maximized or not. The same issue happens to [
36
]; it is unclear whether the true log-likelihood
function is maximized by the proposed training procedure. In [
40
], the neural network was trained by
maximizing a lower bound of the log-likelihood function instead of the true log-likelihood function;
therefore, its hidden layer output does not satisfy (3). In [
53
], the random noise added to the output
of each hidden unit depends on its gradient; the mutual dependence between the gradients destroys
the layer-wise Markovian structure of the neural network and thus the hidden layer output does not
satisfy (3). Similarly, in [
54
], independent noise was added to the output of each hidden unit and,
therefore, the hidden layer output satisfies neither (5) nor (3). In [
45
], inclusion of the support vector
regression (SVR) layer to the stochastic neural network makes the hidden layer outputs mutually
dependent, although the observations are mutually independent.
2 StoNet for Sufficient Dimension Reduction
In this section, we first define the StoNet, then justify its validity as a universal learner for the map
from
X
to
Y
by showing that the StoNet has asymptotically the same loss function as a DNN under
appropriate conditions, and further justify its use for sufficient dimension reduction.
2.1 The StoNet
Consider a DNN model with
h
hidden layers. For the sake of simplicity, we assume that the same
activation function
ψ
is used for each hidden unit. By separating the feeding and activation operators
3
of each hidden unit, we can rewrite the DNN in the following form
˜
Y1=b1+w1X,
˜
Yi=bi+wiΨ( ˜
Yi1), i = 2,3, . . . , h,
Y=bh+1 +wh+1Ψ( ˜
Yh) + eh+1,
(6)
where
eh+1 N(0, σ2
h+1Idh+1 )
is Gaussian random error;
˜
Yi,biRdi
for
i= 1,2, . . . , h
;
Y,bh+1 Rdh+1
;
Ψ( ˜
Yi1) = (ψ(˜
Yi1,1), ψ(˜
Yi1,2), . . . , ψ(˜
Yi1,di1))T
for
i= 2,3, . . . , h +
1
,
ψ(·)
is the activation function, and
˜
Yi1,j
is the
j
th element of
˜
Yi1
;
wiRdi×di1
for
i= 1,2, . . . , h + 1
, and
d0=p
denotes the dimension of
X
. For simplicity, we consider only the
regression problems in (6). By replacing the third equation in (6) with a logit model, the DNN can be
trivially extended to the classification problems.
Figure 1: An illustrative plot for the structure
of a StoNet with two hidden layers.
The StoNet, as a probabilistic deep learning model,
can be constructed by adding auxiliary noise to
˜
Yi
s,
i= 1,2, . . . , h
in (6). Mathematically, the StoNet is
given by
Y1=b1+w1X+e1,
Yi=bi+wiΨ(Yi1) + ei, i = 2,3, . . . , h,
Y=bh+1 +wh+1Ψ(Yh) + eh+1,
(7)
where
Y1,Y2,...,Yh
can be viewed as latent vari-
ables. Further, we assume that
eiN(0, σ2
iIdi)
for
i= 1,2, . . . , h, h + 1
. For classification networks,
the parameter
σ2
h+1
plays the role of temperature for
the binomial or multinomial distribution formed at
the output layer, which works with
{σ2
1, . . . , σ2
h}
to-
gether to control the variation of the latent variables
{Y1,...,Yh}
. Figure 1 depicts the architecture of the StoNet. In words, the StoNet has been
formulated as a composition of many simple linear/logistic regressions, which makes its structure
more designable and interpretable. Refer to Section 5 for more discussions on this issue.
2.2 The StoNet as an Approximator to a DNN
To show that the StoNet is a valid approximator to a DNN, i.e., asymptotically they have the same
loss function, the following conditions are imposed on the model. To indicate their dependence
on the training sample size
n
, we rewrite
σi
as
σn,i
for
i= 1,2, . . . , h + 1
. Let
θi= (wi,bi)
, let
θ= (θ1,θ2··· ,θh+1)
denote the parameter vector of StoNet, let
dθ
denote the dimension of
θ
, and
let Θdenote the space of θ.
Assumption A1
(i)
Θ
is compact, i.e.,
Θ
is contained in a
dθ
-ball centered at 0 with radius
r
; (ii)
E(log π(Y|X,θ))2<
for any
θΘ
; (iii) the activation function
ψ(·)
is
c0
-Lipschitz continuous
for some constant
c0
; (iv) the network’s depth
h
and widths
di
s are both allowed to increase with
n
; (v)
σn,1σn,2≤ ··· ≤ σn,h+1
,
σn,h+1 =O(1)
, and
dh+1(Qh
i=k+1 d2
i)dkσ2
n,k 1
h
for any
k∈ {1,2, . . . , h}.
Condition (i) is more or less a technical condition. As shown in Lemma S1 (in supplementary
material), the proposed training algorithm for the StoNet ensures the estimates of
θ
to be
L2
-upper
bounded. Condition (ii) is the regularity condition for the distribution of
Y
. Condition (iii) can be sat-
isfied by many activation functions such as tanh,sigmoid and ReLU. Condition (v) constrains the size
of the noise added to each hidden layer such that the StoNet has asymptotically the same loss function
as the DNN when the training sample size becomes large, where the factor
dh+1(Qh
i=k+1 d2
i)dk
is
derived in the proof of Theorem 2.1 and it can be understood as the amplification factor of the noise
ekat the output layer.
Let L: Θ Rdenote the loss function of the DNN as defined in (6), which is given by
L(θ) = 1
n
n
X
i=1
log π(Y(i)|X(i),θ),(8)
4
where
n
denotes the training sample size, and
i
indexes the training samples. Theorem 2.1 shows that
the StoNet and the DNN have asymptotically the same training loss function.
Theorem 2.1
Suppose Assumption A1 holds. Then the StoNet (7) and the neural network (6) have
asymptotically the same loss function, i.e.,
sup
θΘ
1
n
n
X
i=1
log π(Y(i),Y(i)
mis|X(i),θ)1
n
n
X
i=1
log π(Y(i)|X(i),θ)
p
0, as n → ∞,(9)
where Ymis = (Y1,Y2,...,Yh)denotes the collection of all latent variables in the StoNet (7).
Let
Q(θ) = E(log π(Y|X,θ))
, where the expectation is taken with respect to the joint distribution
π(X,Y). By Assumption A1-(i)&(ii)and the law of large numbers,
1
n
n
X
i=1
log π(Y(i)|X(i),θ)Q(θ)p
0(10)
holds uniformly over Θ. Further, we assume the following condition hold for Q(θ):
Assumption A2
(i)
Q(θ)
is continuous in
θ
and uniquely maximized at
θ
; (ii) for any
 >
0
,
supθΘ\B()Q(θ)
exists, where
B() = {θ:kθθk< }
, and
δ=Q(θ)
supθΘ\B()Q(θ)>0.
Assumption A2 is more or less a technical assumption. As shown in [
38
] (see also [
18
]), for a fully
connected DNN, almost all local energy minima are globally optimal if the width of one hidden layer
of the DNN is no smaller than the training sample size and the network structure from this layer on is
pyramidal. Similarly, [
1
], [
13
], [
56
], and [
55
] proved that the gradient-based algorithms with random
initialization can converge to the global optimum provided that the width of the DNN is polynomial
in training sample size. All the existing theory implies that this assumption should not be a practical
concern for StoNet as long as its structure is large enough, possibly over-parameterized, such that the
data can be well fitted. Further, we assume that each
θ
for the DNN is unique up to loss-invariant
transformations, such as reordering some hidden units and simultaneously changing the signs of
some weights and biases. Such an implicit assumption has often been used in theoretical studies for
neural networks, see e.g. [32] and [46] for the detail.
Theorem 2.2
Suppose Assumptions A1 and A2 hold, and
π(Y,Ymis|X,θ)
is continuous in
θ
. Let
ˆ
θn= arg maxθΘ{1
nPn
i=1 log π(Y(i),Y(i)
mis|X(i),θ)}. Then kˆ
θnθkp
0as n→ ∞.
This theorem implies that the DNN (6) can be trained by training the StoNet (7), which are asymptot-
ically equivalent as the sample size nbecomes large. Refer to the supplement for its proof.
2.3 Nonlinear Sufficient Dimension Reduction via StoNet
The joint distribution π(Y,Ymis|X,θ)for the StoNet can be factored as
π(Y,Ymis|X,θ) = π(Y1|X,θ1)[
h
Y
i=2
π(Yi|Yi1,θi)]π(Y|Yh,θh+1),(11)
based on the Markovian structure between layers of the StoNet. Therefore,
π(Y|Ymis,X,θ) = π(Y|Yh,θh+1).(12)
By Proposition 2.1 of [
27
], Equation (12) is equivalent to
Y
|=
X|Yh
, which coincides with the
definition of nonlinear sufficient dimension reduction in (3). In summary, we have the proposition:
Proposition 2.1
For a well trained StoNet for the mapping
XY
, the output of the last hidden
layer Yhsatisfies SDR condition in (3).
The proof simply follows the above arguments and the properties of the StoNet. Proposition 2.1
implies that the StoNet can be a useful and flexible tool for nonlinear SDR. However, the conventional
optimization algorithm such as stochastic gradient descent (SGD) is no longer applicable for training
the StoNet. In the next section, we propose to train the StoNet using an adaptive stochastic gradient
MCMC algorithm. At the end of the paper, we discuss how to determine the dimension of
Yh
via
regularization at the output layer of the StoNet.
5
摘要:

NonlinearSufcientDimensionReductionwithaStochasticNeuralNetworkSiqiLiangPurdueUniversityWestLafayette,IN47906liang257@purdue.eduYanSunPurdueUniversityWestLafayette,IN47907sun748@purdue.eduFamingLiangPurdueUniversityWestLafayette,IN47907fmliang@purdue.eduAbstractSufcientdimensionreductionisapowerf...

展开>> 收起<<
Nonlinear Sufficient Dimension Reduction with a Stochastic Neural Network Siqi Liang.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:871.49KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注