Sparsity in Continuous-Depth Neural Networks Hananeh Aliee Helmholtz MunichTill Richter

2025-05-03 0 0 5.67MB 22 页 10玖币
侵权投诉
Sparsity in Continuous-Depth Neural Networks
Hananeh Aliee
Helmholtz Munich
Till Richter
Helmholtz Munich
Mikhail Solonin
Technical University of Munich
Ignacio Ibarra
Helmholtz Munich
Fabian Theis
Technical University of Munich
Helmholtz Munich
Niki Kilbertus
Technical University of Munich
Helmholtz AI, Munich
{hananeh.aliee,till.richter,ignacio.ibarra,fabian.theis,niki.kilbertus}
@helmholtz-muenchen.de
Abstract
Neural Ordinary Differential Equations (NODEs) have proven successful in learn-
ing dynamical systems in terms of accurately recovering the observed trajectories.
While different types of sparsity have been proposed to improve robustness, the
generalization properties of NODEs for dynamical systems beyond the observed
data are underexplored. We systematically study the influence of weight and fea-
ture sparsity on forecasting as well as on identifying the underlying dynamical
laws. Besides assessing existing methods, we propose a regularization technique to
sparsify “input-output connections” and extract relevant features during training.
Moreover, we curate real-world datasets consisting of human motion capture and
human hematopoiesis single-cell RNA-seq data to realistically analyze different
levels of out-of-distribution (OOD) generalization in forecasting and dynamics
identification respectively. Our extensive empirical evaluation on these challenging
benchmarks suggests that weight sparsity improves generalization in the presence
of noise or irregular sampling. However, it does not prevent learning spurious
feature dependencies in the inferred dynamics, rendering them impractical for pre-
dictions under interventions, or for inferring the true underlying dynamics. Instead,
feature sparsity can indeed help with recovering sparse ground-truth dynamics
compared to unregularized NODEs.
1 Introduction
Extreme over-parameterization has shown to be part of the success story of deep neural networks
[
52
,
5
,
4
]. This has been linked to evidence that over-parameterized models are easier to train,
perhaps because they develop “convexity-like” properties that help convergence of gradient descent
[
10
,
14
]. However, over-parameterization comes at the cost of additional computational footprint
during training and inference [
13
]. Therefore, motivation for sparse neural networks are manifold
and include: (i) imitation of human learning, where neuron activity is typically sparse [
2
], (ii)
computational efficiency (speed up, memory reduction, scalability) [
20
], (iii) interpretability [
51
], as
well as (iv) avoiding overfitting or improving robustness [6].
Enforcing sparsity in model weights, also called model pruning, and its effects for standard predictive
modelling tasks have been extensively studied in the literature [
20
]. However, sparsity in continuous-
depth neural nets for modeling dynamical systems and its effects on generalization properties is still
Work done while at TUM. MS is currently employed by J.P. Morgan Chase & Co.;
mikhail.solonin@jpmorgan.com
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.14672v1 [cs.LG] 26 Oct 2022
underexplored. Neural Ordinary Differential Equations (NODEs) [
11
] have been introduced as the
limit of taking the number of layers in residual neural networks to infinity, resulting in continuous-
depth (or continuous-time) neural nets. NODEs were originally predominantly used for predictive
tasks such as classification or regression, where they learn some dynamics serving the predictive
task. These networks have also shown great promise for modeling noisy and irregular time-series.
More recently, it has been stated that NODEs can also be thought of as learning the underlying
dynamics (i.e., the ODE) that actually govern(s) the evolution of the observed trajectory. In the
prior scenario (predictive performance), a large body of literature has focused on regularizing the
number of model evaluations required for a single step of the ODE solver for improving the efficiency
[
16
,
22
,
17
,
39
,
18
,
38
]. These regularization techniques rely on learning one out of many possible
equivalent dynamical systems giving rise to the same ultimate performance, but are easier and faster
to solve. Crucially, these regularizers do not explicitly target sparsity in the model weights or in the
features used by the model.
X1
X0
X2
X3
X1
X0
X2
X3
Inputs Arbitrary architecture Outputs
X1
X0
X2
X3
Inputs Arbitrary architecture Output
X0
weight sparsity
feature sparsity
Figure 1: Model vs feature sparsity
In the latter scenario (inferring dynamical laws), we as-
sume the existence of a ground truth ODE that governs
the dynamics of the observed trajectories. Here, two types
of sparsity may be of interest. First, motivated by stan-
dard network pruning, sparsity in model weights (weight
sparsity or model sparsity) can reduce computational re-
quirements for inference [
28
], see Figure 1(top). Second,
it has been argued that feature sparsity, i.e., reducing the
number of inputs a given output of the NODE relies on,
can improve identifiability of the underlying dynamical
law [
3
,
7
,
30
], see Figure 1(bottom). The motivation for
feature sparsity often relies on an argument from causality,
where it is common to assume sparse and modular causal
dependencies between variables [45, 54].
In this paper, we empirically assess the impact of various
types of sparsity-enforcing methods on the performance
of NODEs on different types of OOD generalization for
both prediction and for inferring dynamical laws on real-
world datasets. While most existing methods target weight
sparsity, we highlight that feature sparsity can improve
interpretability and dynamical system inference by propos-
ing a new regularization technique for continuous-time
models. Specifically, our main contributions are the following:
We propose PathReg
2
, a differentiable L0-based regularizer that enforces sparsity of “input-output
paths” and leads to both feature and weight sparsity.
We extend L0 regularization [
31
] and LassoNet [
26
] to NODEs and perform extensive comparisons
with existing models including vanilla NODE [
11
], C-NODE [
3
], GroupLasso for NODEs [
7
], and
our PathReg.
We define a useful metric for evaluating feature sparsity, demonstrate the differences between
weight and feature sparsity, and explore whether and when one implies the other.
To cover at least part of the immense diversity of time-series problems, each of which implying
different types of OOD challenges, we curate large, real-world datasets consisting of human motion
capture (
mocap.cs.cmu.edu
) as well as human hematopoiesis single-cell RNA-seq [
32
] data for
our empirical evaluations.
2 Background
2.1 Continuous-depth neural nets
Among the plethora of deep learning based methods to estimate dynamical systems from data
[
29
,
44
,
27
,
38
], we focus on Neural Ordinary Differential Equations (NODEs). In NODEs, a neural
network with parameters
θ
is used to learn the function
fθf
from data, where
f
defines a (first
2The python implementation is available at: https://github.com/theislab/PathReg
2
order) ODE in its explicit representation
˙
X=f(X, t)
[
11
]. Starting from the initial observation
X(a)
at some time
t=a
, an explicit iterative ODE solver is deployed to predict
X(t)
for
t(a, b]
using
the current derivative estimates from
fθ
. The parameters
θ
are then updated via backpropagation
on the mean squared error (MSE) between predictions and observations. As discussed extensively
in the literature, NODEs can outperform traditional ODE parameter inference techniques in terms
of reconstruction error, especially for non-linear dynamics
f
[
11
,
15
]. In particular, one advantage
of NODEs over previous methods for inferring non-linear dynamics such as SINDy [
9
] is that no
dictionary of non-linear basis functions has to be pre-specified.
A variant of NODEs for second order systems called SONODE exploits the common reduction
of higher-order ODEs to first order systems [
37
]. The second required initial condition, the initial
velocity
˙
X(a)
, is simply learned from
X(a)
via another neural network in an end-to-end fashion. Our
experiments build on the public NODE and SONODE implementations. However, our framework
is readily applicable to most other continuous-depth neural nets including Augmented NODE [
15
],
latent NODEs [44], and neural stochastic DEs [27, 48].
2.2 Sparsity and generalization
Two prominent motivations for enforcing sparsity in over-parameterized deep learning models are
(a) pruning large networks for efficiency (speed, memory), (b) as a regularizer that can improve
interpretability or prevent overfitting. In both cases, one aims at preserving (as much as possible)
i.i.d. generalization performance, i.e., performance on unseen data from the same distribution as the
training data, compared to a non-sparse model [20].
Another line of research has explored the link between generalizability of neural nets and causal
learning [
45
], where generalization outside the i.i.d. setting, is conjectured to require an underlying
causal model. Deducing true laws of nature purely from observational data could be considered an
instance of inferring a causal model of the world. Learning the correct causal model enables accurate
predictions not only on the observed data (next observation), but also under distribution shifts for
example under interventions [
45
]. A common assumption in causal modeling is that each variable
depends on (or is a function of) few other variables [
45
,
54
,
36
,
8
]. In the ODE context, we interpret
the variables (or their derivatives) that enter a specific component
fi
as causal parents, such that we
can write
˙
Xi=f(pa(Xi), t)
[
35
]. Thus, feature sparsity translates to each variable
Xi
having only
few parents
pa(Xi)
, which can also be interpreted as asking for “simple” dynamics. Since weight
sparsity as well as regularizing the number of function evaluations in NODEs can also be considered
to bias them towards “simpler dynamics”, these notions are not strictly disjoint, raising the question
whether one implies the other.
In terms of feature sparsity, Aliee et al.
[3]
, Bellot et al.
[7]
study system identification and causal
structure inference from time-series data using NODEs. They suggest that enforcing sparsity in
the number of causal interactions improves parameter estimation as well as predictions under in-
terventions. Let us write out
fθ
as a fully connected net with
L
hidden layers parameterized by
θ:= (Wl, bl)L+1
l=1 as
fθ(X) = WL+1σ(. . . σ(W2σ(W1X+b1) + b2). . .)(1)
with element-wise activation function
σ
,
l
-th layer weights
Wl
, and biases
bl
. Aliee et al.
[3]
then
seek to reduce the overall number of parents of all variables by attempting to cancel all contributions
of a given input on a given output node through the neural net. In a linear setting, where
σ(x) = x
,
the regularization term is defined by3
kAk1,1=kWL+1 . . . W 1k1,1(2)
where
Aij = 0
if and only if the
i
-th output is constant in the
j
-th input. In the non-linear setting,
for certain
σ(x)6=x
, the regularizer
kAk1,1=k|WL+1|···|W1|k1,1
(with entry wise absolute
values on all
Wl
) is an upper bound on the number of input-output dependencies, i.e., for each
output
i
summing up all the inputs
j
it is not constant in. Regularizing input gradients [
42
,
41
,
43
] is
another alternative to train neural networks that depend on fewer inputs, however it is not scalable to
high-dimensional regression tasks.
3The 1,1norm kWk1,1is the sum of absolute values of all entries of W.
3
Bellot et al.
[7]
instead train a separate neural net
fθi:RnR
for each variable
Xi
and penalize
NODEs using GroupLasso on the inputs via
n
X
k,i=1
k[W1
i]·,kk2(3)
where
W1
i
is the weight matrix in the input layer of
fi
and
[W1
i]·,k
refers to the
kth
column of
W1
i
that should simultaneously (as a group) be set to zero or not. While enforcing strict feature
sparsity (instead of regularizing an upper bound), parallel training of multiple NODE networks can
be computationally expensive (Figure 1, bottom). While this work suggests that sparsity of causal
interactions helps system identification, its empirical evaluation predominantly focuses on synthetic
data settings, leaving performance on real data underexplored.
Another recent work suggests that standard weight or neuron pruning improves generalization for
NODEs [
28
]. The authors show that pruning lowers empirical risk in density estimation tasks and
decreases Hessian’s eigenvalues, thus obtaining better generalization (flat minima). However, the
effect of sparsity on identifying the underlying dynamics as well as the generalization properties of
NODEs to forecast future values are not assessed.
3 Sparsification of neural ODEs
In an attempt to combine the strengths of both weight and feature sparsity, we propose a new
regularization technique, called PathReg, which compares favorably to both C-NODE [
3
] and
GroupLasso [
7
]. Before introducing PathReg, we describe how to extend existing methods to NODEs
for an exhaustive empirical evaluation.
3.1 Methods
L0 regularization.
Inspired by [
31
], we use a differentiable L0 norm regularization method that
can be incorporated in the objective function and optimized via stochastic gradient descent. The L0
regularizer prunes the network during training by encouraging weights to get exactly zero using a set
of non-negative stochastic gates
z
. For an efficient gradient-based optimization, Louizos et al.
[31]
propose to use a continuous random variable
s
with distribution
q(s)
and parameters
φ
, where
z
is
then given by
sq(s|φ), z = min(1,max(0, s)) .(4)
Gate
z
is a hard-sigmoid rectification of
s2
that allows the gate to be exactly zero. While we have the
freedom to choose any smoothing distribution
q(s)
, we use binary concrete distribution [
33
,
21
] as
suggested by the original work [
31
]. The regularization term is then defined as the probability of
s
being positive
q(z6= 0 |φ)=1Q(s0|φ),(5)
where
Q
is the cumulative distribution function of
s
. Minimizing the regularizer pushes many gates
to be zero, which implies weight sparsity as these gates are multiplied with model weights. L0
regularization can be added directly to the NODE network fθ.
LassoNet
is a feature selection method [
26
] that uses an input-to-output skip (residual) connection
that allows a feature to participate in a hidden unit only if its skip connection is active. LassoNet can
be thought of as a residual feed-forward neural net
Y=STX+hθ(X)
, where
hθ
denotes another
feed-forward network with parameters
θ
,
S Rn×n
refers to the weights in the residual layer, and
Y
are the responses. To enforce feature sparsity, L1 regularization is applied to the weights of the skip
connection, defined as
||S||1,1
(where
k·k1,1
denotes the element-wise L1 norm). A constraint term
with factor
ρ
is then added to the objective to budget the non-linearity involved for feature
k
relative
to the importance of Xk
min
θ,SLE(θ, S) + λ||S||1subject to k[W1]·,k kρ||[S]·,k||2for k∈ {1, . . . , n},(6)
where
[W1]·,k
denotes the
kth
column of the first layer weights of
hθ
, and
[S]·,k
represents the
kth
column of S. When ρ= 0, only the skip connection remains (standard linear Lasso), while ρ→ ∞
corresponds to unregularized network training.
4
Figure 2: The distributions of network weights and paths weights (over the entries of the matrix in
Eq. 13) using PathReg applied to single-cell data in Section 4.3. PathReg increases both model and
feature sparsity.
LassoNet regularization can be extended to NODEs by adding the skip connection either before or
after the integration (the ODE solver). If added before the integration, which we call Inner LassoNet,
a linear function of Xis added to its time derivative
˙
X=STX+fθ(X, t), X(0) = x0.(7)
Adding the skip connection after the integration (and the predictor
o
), called Outer LassoNet, yields
Xt=STXt+o(ODESolver(fθ, x0, t0, t)) .(8)
PathReg (ours).
While L0 regularization minimizes the number of non-zero weights in a network, it
does not necessarily lead to feature sparsity meaning that the response variables can still depend on
all features including spurious ones. To constrain the number of input-output paths, i.e., to enforce
feature-sparsity, we regularize the probability of any path throughout the entire network contributing
to an output. To this end, we use non-negative stochastic gates
z=g(s)
similar to Eq. 4, where the
probability of an input-output path Pbeing non-zero is given by
q(P 6= 0) = Y
z∈P
q(z6= 0 |φ)(9)
and we constrain #paths
X
i=1 Y
z∈Pi
q(z6= 0 |φ)(10)
to minimize the number of paths that yield non-zero contributions from inputs to outputs. This is
equivalent to regularizing the gate adjacency matrix
Az=GL+1 ·. . . ·G1
. Where
Gl
is a probability
matrix corresponding to the probability of the
l
-th layer gates being positive. Then,
Azij
represents
the sum of the probabilities of all paths between the
i
-th input and the
j
-th output. Ultimately, we
thus obtain our PathReg regularization term
kAzk1,1=kGL+1 . . . G1k1,1with Gl
ij =ql(zij 6= 0 |φi,j ),(11)
where
ql(zij 6= 0)
with parameters
φi,j
is the probability of the
l
-th layer gate
zij
being nonzero.
Regularizing
kAzk1,1
, minimizes the number of paths between inputs and outputs and induces no
shrinkage on the actual values of the weights. Therefore, we can utilize other regularizers on
θ
such
as
kAk1,1
in conjunction with PathReg similar to Eq. 2. In this work, we consider the following
overall loss function
R(θ, φ) = Eq(s|φ)1
NN
X
i=1
Lo(f(xi;θg(s)), yi+λ0kAzk1,1+λ1kAk1,1
=LE(θ, φ) + λ0kAzk1,1+λ1kAk1,1(12)
where
L
corresponds to the loss function of the original predictive task,
LE
the overall predictive loss,
o
is the NODE-based model with
f
modeling time derivatives,
g(·) := min(1,max(0,·))
,
λ0, λ10
are regularization parameters, and
is entry-wise multiplication.
LE
measures how well the model
5
摘要:

SparsityinContinuous-DepthNeuralNetworksHananehAlieeHelmholtzMunichTillRichterHelmholtzMunichMikhailSoloninTechnicalUniversityofMunichIgnacioIbarraHelmholtzMunichFabianTheisTechnicalUniversityofMunichHelmholtzMunichNikiKilbertusTechnicalUniversityofMunichHelmholtzAI,Munich{hananeh.aliee,till.richte...

展开>> 收起<<
Sparsity in Continuous-Depth Neural Networks Hananeh Aliee Helmholtz MunichTill Richter.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:5.67MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注