Nonlinear Sufﬁcient Dimension Reduction with a Stochastic Neural Network Siqi Liang

2025-05-02 0 0 871.49KB 29 页 10玖币

侵权投诉

Nonlinear Sufﬁcient Dimension Reduction with a

Stochastic Neural Network

Siqi Liang

Purdue University

West Lafayette, IN 47906

liang257@purdue.edu

Yan Sun

Purdue University

West Lafayette, IN 47907

sun748@purdue.edu

Faming Liang∗

Purdue University

West Lafayette, IN 47907

fmliang@purdue.edu

Abstract

Sufﬁcient dimension reduction is a powerful tool to extract core information hid-

den in the high-dimensional data and has potentially many important applications

in machine learning tasks. However, the existing nonlinear sufﬁcient dimension

reduction methods often lack the scalability necessary for dealing with large-scale

data. We propose a new type of stochastic neural network under a rigorous proba-

bilistic framework and show that it can be used for sufﬁcient dimension reduction

for large-scale data. The proposed stochastic neural network is trained using an

adaptive stochastic gradient Markov chain Monte Carlo algorithm, whose conver-

gence is rigorously studied in the paper as well. Through extensive experiments

on real-world classiﬁcation and regression problems, we show that the proposed

method compares favorably with the existing state-of-the-art sufﬁcient dimension

reduction methods and is computationally more efﬁcient for large-scale data.

1 Introduction

As a supervised method, sufﬁcient dimension reduction (SDR) aims to project the data onto a lower

dimensional space so that the output is conditionally independent of the input features given the

projected features. Mathematically, the problem of SDR can be described as follows. Let

Y∈Rd

the response variables, and let

X= (X1, . . . , Xp)T∈Rp

be the explanatory variables of dimension

. The goal of SDR is to ﬁnd a lower-dimensional representation

Z∈Rq

, as a function of

for

some q < p, such that

P(Y|X) = P(Y|Z),or equivalently Y

X|Z,(1)

where

denotes conditional independence. Intuitively, the deﬁnition (1) implies that

has extracted

all the information contained in

for predicting

. In the literature, SDR has been developed under

both linear and nonlinear settings.

Under the linear setting, SDR is to ﬁnd a few linear combinations of

that are sufﬁcient to describe

the conditional distribution of Ygiven X, i.e., ﬁnding a projection matrix B∈Rp×qsuch that

XBTX.(2)

A more general deﬁnition for linear SDR based on

-ﬁeld can be found in [

]. Towards this goal, a

variety of inverse regression methods have been proposed, see e.g., sliced inverse regression (SIR)

∗To whom the correspondence should be addressed: Faming Liang.

Submitted to 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Do not distribute.

arXiv:2210.04349v1 [cs.LG] 9 Oct 2022

[

], sliced average variance estimation (SAVE) [

], parametric inverse regression [

], contour

regression [

], and directional regression [

]. These methods require strict assumptions on the joint

distribution of

(X,Y)

or the conditional distribution of

X|Y

, which limit their use in practice. To

address this issue, some forward regression methods have been developed in the literature, see e.g.,

principal Hessian directions [

], minimum average variance estimation [

], conditional variance

estimation [

], among others. These methods require minimal assumptions on the smoothness of

the joint distribution

(X,Y)

, but they do not scale well for big data problems. They can become

infeasible quickly as both pand nincrease, see [24] for more discussions on this issue.

Under the nonlinear setting, SDR is to ﬁnd a nonlinear function f(·)such that

Xf(X).(3)

A general theory for nonlinear SDR has been developed in [

]. A common strategy to achieve

nonlinear SDR is to apply the kernel trick to the existing linear SDR methods, where the variable

is ﬁrst mapped to a high-dimensional feature space via kernels and then inverse or forward regression

methods are performed. This strategy has led to a variety of methods such as kernel sliced inverse

regression (KSIR) [

], kernel dimension reduction (KDR) [

], manifold kernel dimension

reduction (MKDR) [

], generalized sliced inverse regression (GSIR) [

], generalized sliced average

variance estimator (GSAVE) [

], and least square mutual information estimation (LSMIE) [

]. A

drawback shared by these methods is that they require to compute the eigenvectors or inverse of an

n×n

matrix. Therefore, these methods lack the scalability necessary for big data problems. Another

strategy to achieve nonlinear SDR is to consider the problem under the multi-index model setting.

Under this setting, the methods of forward regression such as those based on the outer product of the

gradient [

] have been developed, which often involve eigen-decomposition of a

p×p

matrix

and are thus unscalable for high-dimensional problems.

Quite recently, some deep learning-based nonlinear SDR methods have been proposed in the literature,

see e.g. [

], which are scalable for big data by training the deep neural network (DNN) with a

mini-batch strategy. In [

], the authors assume that the response variable

on the predictors

fully captured by a regression

Y=g(BTX) + ,(4)

for an unknown function

g(·)

and a low rank parameter matrix

, and they propose a two-stage

approach to estimate

g(·)

and

. They ﬁrst estimate

g(·)

˜g(·)

by ﬁtting the regression

˜g(X) + 

with a DNN and initialize the estimator of

using the outer product gradient (OPG)

approach [

], and then reﬁne the estimators of

g(·)

and

by optimizing them in a joint manner.

However, as pointed out by the authors, this method might not be valid unless the estimate of

g(·)

consistent, but the consistency does not generally hold for the fully connected neural networks trained

without constraints. Speciﬁcally, the universal approximation ability of the DNN can make the latent

variable

Z:= BTX

unidentiﬁable from the DNN approximator of

g(·)

; or, said differently,

can be

an arbitrary vector by tuning the size of the DNN to be sufﬁciently large. A similar issue happened to

[

], where the authors propose to learn the latent variable

by optimizing three DNNs to approximate

the distributions

p(Z|X)

p(X|Z)

and

p(Y|Z)

, respectively, under the framework of variational

autoencoder. Again,

suffers from the identiﬁability issue due to the universal approximation ability

of the DNN. In [

], the authors employ a regular DNN for sufﬁcient dimension reduction, which

works only for the case that the distribution of the response variable falls into the exponential family.

How to conduct SDR with DNNs for general large-scale data remains an unresolved issue.

We address the above issue by developing a new type of stochastic neural network. The idea can be

loosely described as follows. Suppose that we are able to learn a stochastic neural network, which

maps

via some stochastic hidden layers and possesses a layer-wise Markovian structure. Let

denote the number of hidden layers, and let

Y1,Y2,...,Yh

denote the outputs of the respective

stochastic hidden layers. By the layer-wise Markovian structure of the stochastic neural network, we

can decompose the joint distribution of (Y,Yh,Yh−1,...,Y1)conditioned on Xas follows

π(Y,Yh,Yh−1,...,Y1|X) = π(Y|Yh)π(Yh|Yh−1)···π(Y1|X),(5)

where each conditional distribution is modeled by a linear or logistic regression (on transformed

outputs of the previous layer), while the stochastic neural network still provides a good approximation

to the underlying DNN under appropriate conditions on the random noise added to each stochastic

layer. The layer-wise Markovian structure implies

X|Yh

, and the simple regression structure of

π(Y|Yh)

successfully gets around the identiﬁability issue of the latent variable

Z:= Yh

that has

been suffered by some other deep learning-based methods [

]. How to deﬁne and learn such a

stochastic neural network will be detailed in the paper.

Our contribution

in this paper is three-fold: (i) We propose a new type of stochastic neural

network (abbreviated as “StoNet” hereafter) for sufﬁcient dimension reduction, for which a layer-

wise Markovian structure (5) is imposed on the network in training and the size of the noise added

to each hidden layer is calibrated for ensuring the StoNet to provide a good approximation to the

underlying DNN. (ii) We develop an adaptive stochastic gradient MCMC algorithm for training

the StoNet and provides a rigorous study for its convergence under mild conditions. The training

algorithm is scalable with respect to big data and it is itself of interest to statistical computing for

the problems with latent variables or missing data involved. (iii) We formulate the StoNet as a

composition of many simple linear/logistic regressions, making its structure more designable and

interpretable. The backward imputation and forward parameter updating mechanism embedded

in the proposed training algorithm enables the regression subtasks to communicate globally and

update locally. As discussed later, these two features enable the StoNet to solve many important

scientiﬁc problems, rather than sufﬁcient dimension reduction, in a more convenient way than does

the conventional DNN. The StoNet bridges us from linear models to deep learning.

Other related works.

Stochastic neural networks have a long history in machine learning. Famous

examples include multilayer generative models [

], restricted Boltzmann machine [

] and deep

Boltzmann machine [

]. Recently, some researchers have proposed adding noise to the DNN to

improve its ﬁtting and generalization. For example, [

] proposed the dropout method to prevent

the DNN from over-ﬁtting by randomly dropping some hidden and visible units during training;

[

] proposed adding gradient noise to improve training; [

] proposed to use stochastic

activation functions through adding noise to improve generalization and adversarial robustness, and

[

] proposed to learn the uncertainty parameters of the stochastic activation functions along with the

training of the neural network.

However, none of the existing stochastic neural networks can be used for sufﬁcient dimension

reduction. It is known that the multilayer generative models [

], restricted Boltzmann machine [

]

and deep Boltzmann machine [

] can be used for dimension reduction, but under the unsupervised

mode. As explained in [

], the dropout method is essentially a stochastic regularization method,

where the likelihood function is penalized in network training and thus the hidden layer output

of the resulting neural network does not satisfy (3). In [

], the size of the noise added to the

activity function is not well calibrated and it is unclear whether the true log-likelihood function is

maximized or not. The same issue happens to [

]; it is unclear whether the true log-likelihood

function is maximized by the proposed training procedure. In [

], the neural network was trained by

maximizing a lower bound of the log-likelihood function instead of the true log-likelihood function;

therefore, its hidden layer output does not satisfy (3). In [

], the random noise added to the output

of each hidden unit depends on its gradient; the mutual dependence between the gradients destroys

the layer-wise Markovian structure of the neural network and thus the hidden layer output does not

satisfy (3). Similarly, in [

], independent noise was added to the output of each hidden unit and,

therefore, the hidden layer output satisﬁes neither (5) nor (3). In [

], inclusion of the support vector

regression (SVR) layer to the stochastic neural network makes the hidden layer outputs mutually

dependent, although the observations are mutually independent.

2 StoNet for Sufﬁcient Dimension Reduction

In this section, we ﬁrst deﬁne the StoNet, then justify its validity as a universal learner for the map

from

by showing that the StoNet has asymptotically the same loss function as a DNN under

appropriate conditions, and further justify its use for sufﬁcient dimension reduction.

2.1 The StoNet

Consider a DNN model with

hidden layers. For the sake of simplicity, we assume that the same

activation function

is used for each hidden unit. By separating the feeding and activation operators

of each hidden unit, we can rewrite the DNN in the following form

Y1=b1+w1X,

Yi=bi+wiΨ( ˜

Yi−1), i = 2,3, . . . , h,

Y=bh+1 +wh+1Ψ( ˜

Yh) + eh+1,

(6)

where

eh+1 ∼N(0, σ2

h+1Idh+1 )

is Gaussian random error;

Yi,bi∈Rdi

for

i= 1,2, . . . , h

;

Y,bh+1 ∈Rdh+1

;

Ψ( ˜

Yi−1) = (ψ(˜

Yi−1,1), ψ(˜

Yi−1,2), . . . , ψ(˜

Yi−1,di−1))T

for

i= 2,3, . . . , h +

ψ(·)

is the activation function, and

Yi−1,j

is the

th element of

Yi−1

;

wi∈Rdi×di−1

for

i= 1,2, . . . , h + 1

, and

d0=p

denotes the dimension of

. For simplicity, we consider only the

regression problems in (6). By replacing the third equation in (6) with a logit model, the DNN can be

trivially extended to the classiﬁcation problems.

Figure 1: An illustrative plot for the structure

of a StoNet with two hidden layers.

The StoNet, as a probabilistic deep learning model,

can be constructed by adding auxiliary noise to

’s,

i= 1,2, . . . , h

in (6). Mathematically, the StoNet is

given by

Y1=b1+w1X+e1,

Yi=bi+wiΨ(Yi−1) + ei, i = 2,3, . . . , h,

Y=bh+1 +wh+1Ψ(Yh) + eh+1,

(7)

where

Y1,Y2,...,Yh

can be viewed as latent vari-

ables. Further, we assume that

ei∼N(0, σ2

iIdi)

for

i= 1,2, . . . , h, h + 1

. For classiﬁcation networks,

the parameter

σ2

h+1

plays the role of temperature for

the binomial or multinomial distribution formed at

the output layer, which works with

{σ2

1, . . . , σ2

to-

gether to control the variation of the latent variables

{Y1,...,Yh}

. Figure 1 depicts the architecture of the StoNet. In words, the StoNet has been

formulated as a composition of many simple linear/logistic regressions, which makes its structure

more designable and interpretable. Refer to Section 5 for more discussions on this issue.

2.2 The StoNet as an Approximator to a DNN

To show that the StoNet is a valid approximator to a DNN, i.e., asymptotically they have the same

loss function, the following conditions are imposed on the model. To indicate their dependence

on the training sample size

, we rewrite

σi

σn,i

for

i= 1,2, . . . , h + 1

. Let

θi= (wi,bi)

, let

θ= (θ1,θ2··· ,θh+1)

denote the parameter vector of StoNet, let

dθ

denote the dimension of

, and

let Θdenote the space of θ.

Assumption A1

(i)

is compact, i.e.,

is contained in a

dθ

-ball centered at 0 with radius

; (ii)

E(log π(Y|X,θ))2<∞

for any

θ∈Θ

; (iii) the activation function

ψ(·)

-Lipschitz continuous

for some constant

; (iv) the network’s depth

and widths

’s are both allowed to increase with

; (v)

σn,1≤σn,2≤ ··· ≤ σn,h+1

σn,h+1 =O(1)

, and

dh+1(Qh

i=k+1 d2

i)dkσ2

n,k ≺1

for any

k∈ {1,2, . . . , h}.

Condition (i) is more or less a technical condition. As shown in Lemma S1 (in supplementary

material), the proposed training algorithm for the StoNet ensures the estimates of

to be

-upper

bounded. Condition (ii) is the regularity condition for the distribution of

. Condition (iii) can be sat-

isﬁed by many activation functions such as tanh,sigmoid and ReLU. Condition (v) constrains the size

of the noise added to each hidden layer such that the StoNet has asymptotically the same loss function

as the DNN when the training sample size becomes large, where the factor

dh+1(Qh

i=k+1 d2

i)dk

derived in the proof of Theorem 2.1 and it can be understood as the ampliﬁcation factor of the noise

ekat the output layer.

Let L: Θ →Rdenote the loss function of the DNN as deﬁned in (6), which is given by

L(θ) = −1

i=1

log π(Y(i)|X(i),θ),(8)

where

denotes the training sample size, and

indexes the training samples. Theorem 2.1 shows that

the StoNet and the DNN have asymptotically the same training loss function.

Theorem 2.1

Suppose Assumption A1 holds. Then the StoNet (7) and the neural network (6) have

asymptotically the same loss function, i.e.,

sup

θ∈Θ

i=1

log π(Y(i),Y(i)

mis|X(i),θ)−1

i=1

log π(Y(i)|X(i),θ)

→0, as n → ∞,(9)

where Ymis = (Y1,Y2,...,Yh)denotes the collection of all latent variables in the StoNet (7).

Let

Q∗(θ) = E(log π(Y|X,θ))

, where the expectation is taken with respect to the joint distribution

π(X,Y). By Assumption A1-(i)&(ii)and the law of large numbers,

i=1

log π(Y(i)|X(i),θ)−Q∗(θ)p

→0(10)

holds uniformly over Θ. Further, we assume the following condition hold for Q∗(θ):

Assumption A2

(i)

Q∗(θ)

is continuous in

and uniquely maximized at

θ∗

; (ii) for any

 >

supθ∈Θ\B()Q∗(θ)

exists, where

B() = {θ:kθ−θ∗k< }

, and

δ=Q∗(θ∗)−

supθ∈Θ\B()Q∗(θ)>0.

Assumption A2 is more or less a technical assumption. As shown in [

] (see also [

]), for a fully

connected DNN, almost all local energy minima are globally optimal if the width of one hidden layer

of the DNN is no smaller than the training sample size and the network structure from this layer on is

pyramidal. Similarly, [

], [

], and [

] proved that the gradient-based algorithms with random

initialization can converge to the global optimum provided that the width of the DNN is polynomial

in training sample size. All the existing theory implies that this assumption should not be a practical

concern for StoNet as long as its structure is large enough, possibly over-parameterized, such that the

data can be well ﬁtted. Further, we assume that each

for the DNN is unique up to loss-invariant

transformations, such as reordering some hidden units and simultaneously changing the signs of

some weights and biases. Such an implicit assumption has often been used in theoretical studies for

neural networks, see e.g. [32] and [46] for the detail.

Theorem 2.2

Suppose Assumptions A1 and A2 hold, and

π(Y,Ymis|X,θ)

is continuous in

. Let

θn= arg maxθ∈Θ{1

nPn

i=1 log π(Y(i),Y(i)

mis|X(i),θ)}. Then kˆ

θn−θ∗kp

→0as n→ ∞.

This theorem implies that the DNN (6) can be trained by training the StoNet (7), which are asymptot-

ically equivalent as the sample size nbecomes large. Refer to the supplement for its proof.

2.3 Nonlinear Sufﬁcient Dimension Reduction via StoNet

The joint distribution π(Y,Ymis|X,θ)for the StoNet can be factored as

π(Y,Ymis|X,θ) = π(Y1|X,θ1)[

i=2

π(Yi|Yi−1,θi)]π(Y|Yh,θh+1),(11)

based on the Markovian structure between layers of the StoNet. Therefore,

π(Y|Ymis,X,θ) = π(Y|Yh,θh+1).(12)

By Proposition 2.1 of [

], Equation (12) is equivalent to

X|Yh

, which coincides with the

deﬁnition of nonlinear sufﬁcient dimension reduction in (3). In summary, we have the proposition:

Proposition 2.1

For a well trained StoNet for the mapping

X→Y

, the output of the last hidden

layer Yhsatisﬁes SDR condition in (3).

The proof simply follows the above arguments and the properties of the StoNet. Proposition 2.1

implies that the StoNet can be a useful and ﬂexible tool for nonlinear SDR. However, the conventional

optimization algorithm such as stochastic gradient descent (SGD) is no longer applicable for training

the StoNet. In the next section, we propose to train the StoNet using an adaptive stochastic gradient

MCMC algorithm. At the end of the paper, we discuss how to determine the dimension of

via

regularization at the output layer of the StoNet.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NonlinearSufcientDimensionReductionwithaStochasticNeuralNetworkSiqiLiangPurdueUniversityWestLafayette,IN47906liang257@purdue.eduYanSunPurdueUniversityWestLafayette,IN47907sun748@purdue.eduFamingLiangPurdueUniversityWestLafayette,IN47907fmliang@purdue.eduAbstractSufcientdimensionreductionisapowerf...

展开>> 收起<<

Nonlinear Sufﬁcient Dimension Reduction with a Stochastic Neural Network Siqi Liang.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Nonlinear Sufﬁcient Dimension Reduction with a Stochastic Neural Network Siqi Liang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: