[
30
], sliced average variance estimation (SAVE) [
8
,
10
], parametric inverse regression [
6
], contour
regression [
29
], and directional regression [
28
]. These methods require strict assumptions on the joint
distribution of
(X,Y)
or the conditional distribution of
X|Y
, which limit their use in practice. To
address this issue, some forward regression methods have been developed in the literature, see e.g.,
principal Hessian directions [
31
], minimum average variance estimation [
51
], conditional variance
estimation [
14
], among others. These methods require minimal assumptions on the smoothness of
the joint distribution
(X,Y)
, but they do not scale well for big data problems. They can become
infeasible quickly as both pand nincrease, see [24] for more discussions on this issue.
Under the nonlinear setting, SDR is to find a nonlinear function f(·)such that
Y
|=
Xf(X).(3)
A general theory for nonlinear SDR has been developed in [
26
]. A common strategy to achieve
nonlinear SDR is to apply the kernel trick to the existing linear SDR methods, where the variable
X
is first mapped to a high-dimensional feature space via kernels and then inverse or forward regression
methods are performed. This strategy has led to a variety of methods such as kernel sliced inverse
regression (KSIR) [
49
], kernel dimension reduction (KDR) [
15
,
16
], manifold kernel dimension
reduction (MKDR) [
39
], generalized sliced inverse regression (GSIR) [
26
], generalized sliced average
variance estimator (GSAVE) [
26
], and least square mutual information estimation (LSMIE) [
47
]. A
drawback shared by these methods is that they require to compute the eigenvectors or inverse of an
n×n
matrix. Therefore, these methods lack the scalability necessary for big data problems. Another
strategy to achieve nonlinear SDR is to consider the problem under the multi-index model setting.
Under this setting, the methods of forward regression such as those based on the outer product of the
gradient [
50
,
23
] have been developed, which often involve eigen-decomposition of a
p×p
matrix
and are thus unscalable for high-dimensional problems.
Quite recently, some deep learning-based nonlinear SDR methods have been proposed in the literature,
see e.g. [
24
,
3
,
33
], which are scalable for big data by training the deep neural network (DNN) with a
mini-batch strategy. In [
24
], the authors assume that the response variable
Y
on the predictors
X
is
fully captured by a regression
Y=g(BTX) + ,(4)
for an unknown function
g(·)
and a low rank parameter matrix
B
, and they propose a two-stage
approach to estimate
g(·)
and
B
. They first estimate
g(·)
by
˜g(·)
by fitting the regression
Y=
˜g(X) +
with a DNN and initialize the estimator of
B
using the outer product gradient (OPG)
approach [
51
], and then refine the estimators of
g(·)
and
B
by optimizing them in a joint manner.
However, as pointed out by the authors, this method might not be valid unless the estimate of
g(·)
is
consistent, but the consistency does not generally hold for the fully connected neural networks trained
without constraints. Specifically, the universal approximation ability of the DNN can make the latent
variable
Z:= BTX
unidentifiable from the DNN approximator of
g(·)
; or, said differently,
Z
can be
an arbitrary vector by tuning the size of the DNN to be sufficiently large. A similar issue happened to
[
3
], where the authors propose to learn the latent variable
Z
by optimizing three DNNs to approximate
the distributions
p(Z|X)
,
p(X|Z)
and
p(Y|Z)
, respectively, under the framework of variational
autoencoder. Again,
Z
suffers from the identifiability issue due to the universal approximation ability
of the DNN. In [
33
], the authors employ a regular DNN for sufficient dimension reduction, which
works only for the case that the distribution of the response variable falls into the exponential family.
How to conduct SDR with DNNs for general large-scale data remains an unresolved issue.
We address the above issue by developing a new type of stochastic neural network. The idea can be
loosely described as follows. Suppose that we are able to learn a stochastic neural network, which
maps
X
to
Y
via some stochastic hidden layers and possesses a layer-wise Markovian structure. Let
h
denote the number of hidden layers, and let
Y1,Y2,...,Yh
denote the outputs of the respective
stochastic hidden layers. By the layer-wise Markovian structure of the stochastic neural network, we
can decompose the joint distribution of (Y,Yh,Yh−1,...,Y1)conditioned on Xas follows
π(Y,Yh,Yh−1,...,Y1|X) = π(Y|Yh)π(Yh|Yh−1)···π(Y1|X),(5)
where each conditional distribution is modeled by a linear or logistic regression (on transformed
outputs of the previous layer), while the stochastic neural network still provides a good approximation
to the underlying DNN under appropriate conditions on the random noise added to each stochastic
layer. The layer-wise Markovian structure implies
Y
|=
X|Yh
, and the simple regression structure of
π(Y|Yh)
successfully gets around the identifiability issue of the latent variable
Z:= Yh
that has
2