Sparsity in Continuous-Depth Neural Networks Hananeh Aliee Helmholtz MunichTill Richter

2025-05-03 1 0 5.67MB 22 页 10玖币

侵权投诉

Sparsity in Continuous-Depth Neural Networks

Hananeh Aliee

Helmholtz Munich

Till Richter

Helmholtz Munich

Mikhail Solonin∗

Technical University of Munich

Ignacio Ibarra

Helmholtz Munich

Fabian Theis

Technical University of Munich

Helmholtz Munich

Niki Kilbertus

Technical University of Munich

Helmholtz AI, Munich

{hananeh.aliee,till.richter,ignacio.ibarra,fabian.theis,niki.kilbertus}

@helmholtz-muenchen.de

Abstract

Neural Ordinary Differential Equations (NODEs) have proven successful in learn-

ing dynamical systems in terms of accurately recovering the observed trajectories.

While different types of sparsity have been proposed to improve robustness, the

generalization properties of NODEs for dynamical systems beyond the observed

data are underexplored. We systematically study the inﬂuence of weight and fea-

ture sparsity on forecasting as well as on identifying the underlying dynamical

laws. Besides assessing existing methods, we propose a regularization technique to

sparsify “input-output connections” and extract relevant features during training.

Moreover, we curate real-world datasets consisting of human motion capture and

human hematopoiesis single-cell RNA-seq data to realistically analyze different

levels of out-of-distribution (OOD) generalization in forecasting and dynamics

identiﬁcation respectively. Our extensive empirical evaluation on these challenging

benchmarks suggests that weight sparsity improves generalization in the presence

of noise or irregular sampling. However, it does not prevent learning spurious

feature dependencies in the inferred dynamics, rendering them impractical for pre-

dictions under interventions, or for inferring the true underlying dynamics. Instead,

feature sparsity can indeed help with recovering sparse ground-truth dynamics

compared to unregularized NODEs.

1 Introduction

Extreme over-parameterization has shown to be part of the success story of deep neural networks

[

]. This has been linked to evidence that over-parameterized models are easier to train,

perhaps because they develop “convexity-like” properties that help convergence of gradient descent

[

]. However, over-parameterization comes at the cost of additional computational footprint

during training and inference [

]. Therefore, motivation for sparse neural networks are manifold

and include: (i) imitation of human learning, where neuron activity is typically sparse [

], (ii)

computational efﬁciency (speed up, memory reduction, scalability) [

], (iii) interpretability [

], as

well as (iv) avoiding overﬁtting or improving robustness [6].

Enforcing sparsity in model weights, also called model pruning, and its effects for standard predictive

modelling tasks have been extensively studied in the literature [

]. However, sparsity in continuous-

depth neural nets for modeling dynamical systems and its effects on generalization properties is still

∗

Work done while at TUM. MS is currently employed by J.P. Morgan Chase & Co.;

mikhail.solonin@jpmorgan.com

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.14672v1 [cs.LG] 26 Oct 2022

underexplored. Neural Ordinary Differential Equations (NODEs) [

] have been introduced as the

limit of taking the number of layers in residual neural networks to inﬁnity, resulting in continuous-

depth (or continuous-time) neural nets. NODEs were originally predominantly used for predictive

tasks such as classiﬁcation or regression, where they learn some dynamics serving the predictive

task. These networks have also shown great promise for modeling noisy and irregular time-series.

More recently, it has been stated that NODEs can also be thought of as learning the underlying

dynamics (i.e., the ODE) that actually govern(s) the evolution of the observed trajectory. In the

prior scenario (predictive performance), a large body of literature has focused on regularizing the

number of model evaluations required for a single step of the ODE solver for improving the efﬁciency

[

]. These regularization techniques rely on learning one out of many possible

equivalent dynamical systems giving rise to the same ultimate performance, but are easier and faster

to solve. Crucially, these regularizers do not explicitly target sparsity in the model weights or in the

features used by the model.

Inputs Arbitrary architecture Outputs

Inputs Arbitrary architecture Output

weight sparsity

feature sparsity

Figure 1: Model vs feature sparsity

In the latter scenario (inferring dynamical laws), we as-

sume the existence of a ground truth ODE that governs

the dynamics of the observed trajectories. Here, two types

of sparsity may be of interest. First, motivated by stan-

dard network pruning, sparsity in model weights (weight

sparsity or model sparsity) can reduce computational re-

quirements for inference [

], see Figure 1(top). Second,

it has been argued that feature sparsity, i.e., reducing the

number of inputs a given output of the NODE relies on,

can improve identiﬁability of the underlying dynamical

law [

], see Figure 1(bottom). The motivation for

feature sparsity often relies on an argument from causality,

where it is common to assume sparse and modular causal

dependencies between variables [45, 54].

In this paper, we empirically assess the impact of various

types of sparsity-enforcing methods on the performance

of NODEs on different types of OOD generalization for

both prediction and for inferring dynamical laws on real-

world datasets. While most existing methods target weight

sparsity, we highlight that feature sparsity can improve

interpretability and dynamical system inference by propos-

ing a new regularization technique for continuous-time

models. Speciﬁcally, our main contributions are the following:

•

We propose PathReg

, a differentiable L0-based regularizer that enforces sparsity of “input-output

paths” and leads to both feature and weight sparsity.

•

We extend L0 regularization [

] and LassoNet [

] to NODEs and perform extensive comparisons

with existing models including vanilla NODE [

], C-NODE [

], GroupLasso for NODEs [

], and

our PathReg.

•

We deﬁne a useful metric for evaluating feature sparsity, demonstrate the differences between

weight and feature sparsity, and explore whether and when one implies the other.

•

To cover at least part of the immense diversity of time-series problems, each of which implying

different types of OOD challenges, we curate large, real-world datasets consisting of human motion

capture (

mocap.cs.cmu.edu

) as well as human hematopoiesis single-cell RNA-seq [

] data for

our empirical evaluations.

2 Background

2.1 Continuous-depth neural nets

Among the plethora of deep learning based methods to estimate dynamical systems from data

[

], we focus on Neural Ordinary Differential Equations (NODEs). In NODEs, a neural

network with parameters

is used to learn the function

fθ≈f

from data, where

deﬁnes a (ﬁrst

2The python implementation is available at: https://github.com/theislab/PathReg

order) ODE in its explicit representation

X=f(X, t)

[

]. Starting from the initial observation

X(a)

at some time

t=a

, an explicit iterative ODE solver is deployed to predict

X(t)

for

t∈(a, b]

using

the current derivative estimates from

fθ

. The parameters

are then updated via backpropagation

on the mean squared error (MSE) between predictions and observations. As discussed extensively

in the literature, NODEs can outperform traditional ODE parameter inference techniques in terms

of reconstruction error, especially for non-linear dynamics

[

]. In particular, one advantage

of NODEs over previous methods for inferring non-linear dynamics such as SINDy [

] is that no

dictionary of non-linear basis functions has to be pre-speciﬁed.

A variant of NODEs for second order systems called SONODE exploits the common reduction

of higher-order ODEs to ﬁrst order systems [

]. The second required initial condition, the initial

velocity

X(a)

, is simply learned from

X(a)

via another neural network in an end-to-end fashion. Our

experiments build on the public NODE and SONODE implementations. However, our framework

is readily applicable to most other continuous-depth neural nets including Augmented NODE [

latent NODEs [44], and neural stochastic DEs [27, 48].

2.2 Sparsity and generalization

Two prominent motivations for enforcing sparsity in over-parameterized deep learning models are

(a) pruning large networks for efﬁciency (speed, memory), (b) as a regularizer that can improve

interpretability or prevent overﬁtting. In both cases, one aims at preserving (as much as possible)

i.i.d. generalization performance, i.e., performance on unseen data from the same distribution as the

training data, compared to a non-sparse model [20].

Another line of research has explored the link between generalizability of neural nets and causal

learning [

], where generalization outside the i.i.d. setting, is conjectured to require an underlying

causal model. Deducing true laws of nature purely from observational data could be considered an

instance of inferring a causal model of the world. Learning the correct causal model enables accurate

predictions not only on the observed data (next observation), but also under distribution shifts for

example under interventions [

]. A common assumption in causal modeling is that each variable

depends on (or is a function of) few other variables [

]. In the ODE context, we interpret

the variables (or their derivatives) that enter a speciﬁc component

as causal parents, such that we

can write

Xi=f(pa(Xi), t)

[

]. Thus, feature sparsity translates to each variable

having only

few parents

pa(Xi)

, which can also be interpreted as asking for “simple” dynamics. Since weight

sparsity as well as regularizing the number of function evaluations in NODEs can also be considered

to bias them towards “simpler dynamics”, these notions are not strictly disjoint, raising the question

whether one implies the other.

In terms of feature sparsity, Aliee et al.

[3]

, Bellot et al.

[7]

study system identiﬁcation and causal

structure inference from time-series data using NODEs. They suggest that enforcing sparsity in

the number of causal interactions improves parameter estimation as well as predictions under in-

terventions. Let us write out

fθ

as a fully connected net with

hidden layers parameterized by

θ:= (Wl, bl)L+1

l=1 as

fθ(X) = WL+1σ(. . . σ(W2σ(W1X+b1) + b2). . .)(1)

with element-wise activation function

-th layer weights

, and biases

. Aliee et al.

[3]

then

seek to reduce the overall number of parents of all variables by attempting to cancel all contributions

of a given input on a given output node through the neural net. In a linear setting, where

σ(x) = x

the regularization term is deﬁned by3

kAk1,1=kWL+1 . . . W 1k1,1(2)

where

Aij = 0

if and only if the

-th output is constant in the

-th input. In the non-linear setting,

for certain

σ(x)6=x

, the regularizer

kAk1,1=k|WL+1|···|W1|k1,1

(with entry wise absolute

values on all

) is an upper bound on the number of input-output dependencies, i.e., for each

output

summing up all the inputs

it is not constant in. Regularizing input gradients [

] is

another alternative to train neural networks that depend on fewer inputs, however it is not scalable to

high-dimensional regression tasks.

3The 1,1norm kWk1,1is the sum of absolute values of all entries of W.

Bellot et al.

[7]

instead train a separate neural net

fθi:Rn→R

for each variable

and penalize

NODEs using GroupLasso on the inputs via

k,i=1

k[W1

i]·,kk2(3)

where

is the weight matrix in the input layer of

and

[W1

i]·,k

refers to the

kth

column of

that should simultaneously (as a group) be set to zero or not. While enforcing strict feature

sparsity (instead of regularizing an upper bound), parallel training of multiple NODE networks can

be computationally expensive (Figure 1, bottom). While this work suggests that sparsity of causal

interactions helps system identiﬁcation, its empirical evaluation predominantly focuses on synthetic

data settings, leaving performance on real data underexplored.

Another recent work suggests that standard weight or neuron pruning improves generalization for

NODEs [

]. The authors show that pruning lowers empirical risk in density estimation tasks and

decreases Hessian’s eigenvalues, thus obtaining better generalization (ﬂat minima). However, the

effect of sparsity on identifying the underlying dynamics as well as the generalization properties of

NODEs to forecast future values are not assessed.

3 Sparsiﬁcation of neural ODEs

In an attempt to combine the strengths of both weight and feature sparsity, we propose a new

regularization technique, called PathReg, which compares favorably to both C-NODE [

] and

GroupLasso [

]. Before introducing PathReg, we describe how to extend existing methods to NODEs

for an exhaustive empirical evaluation.

3.1 Methods

L0 regularization.

Inspired by [

], we use a differentiable L0 norm regularization method that

can be incorporated in the objective function and optimized via stochastic gradient descent. The L0

regularizer prunes the network during training by encouraging weights to get exactly zero using a set

of non-negative stochastic gates

. For an efﬁcient gradient-based optimization, Louizos et al.

[31]

propose to use a continuous random variable

with distribution

q(s)

and parameters

, where

then given by

s∼q(s|φ), z = min(1,max(0, s)) .(4)

Gate

is a hard-sigmoid rectiﬁcation of

that allows the gate to be exactly zero. While we have the

freedom to choose any smoothing distribution

q(s)

, we use binary concrete distribution [

] as

suggested by the original work [

]. The regularization term is then deﬁned as the probability of

being positive

q(z6= 0 |φ)=1−Q(s≤0|φ),(5)

where

is the cumulative distribution function of

. Minimizing the regularizer pushes many gates

to be zero, which implies weight sparsity as these gates are multiplied with model weights. L0

regularization can be added directly to the NODE network fθ.

LassoNet

is a feature selection method [

] that uses an input-to-output skip (residual) connection

that allows a feature to participate in a hidden unit only if its skip connection is active. LassoNet can

be thought of as a residual feed-forward neural net

Y=STX+hθ(X)

, where

hθ

denotes another

feed-forward network with parameters

S ∈ Rn×n

refers to the weights in the residual layer, and

are the responses. To enforce feature sparsity, L1 regularization is applied to the weights of the skip

connection, deﬁned as

||S||1,1

(where

k·k1,1

denotes the element-wise L1 norm). A constraint term

with factor

is then added to the objective to budget the non-linearity involved for feature

relative

to the importance of Xk

min

θ,SLE(θ, S) + λ||S||1subject to k[W1]·,k k∞≤ρ||[S]·,k||2for k∈ {1, . . . , n},(6)

where

[W1]·,k

denotes the

kth

column of the ﬁrst layer weights of

hθ

, and

[S]·,k

represents the

kth

column of S. When ρ= 0, only the skip connection remains (standard linear Lasso), while ρ→ ∞

corresponds to unregularized network training.

Figure 2: The distributions of network weights and paths weights (over the entries of the matrix in

Eq. 13) using PathReg applied to single-cell data in Section 4.3. PathReg increases both model and

feature sparsity.

LassoNet regularization can be extended to NODEs by adding the skip connection either before or

after the integration (the ODE solver). If added before the integration, which we call Inner LassoNet,

a linear function of Xis added to its time derivative

X=STX+fθ(X, t), X(0) = x0.(7)

Adding the skip connection after the integration (and the predictor

), called Outer LassoNet, yields

Xt=STXt+o(ODESolver(fθ, x0, t0, t)) .(8)

PathReg (ours).

While L0 regularization minimizes the number of non-zero weights in a network, it

does not necessarily lead to feature sparsity meaning that the response variables can still depend on

all features including spurious ones. To constrain the number of input-output paths, i.e., to enforce

feature-sparsity, we regularize the probability of any path throughout the entire network contributing

to an output. To this end, we use non-negative stochastic gates

z=g(s)

similar to Eq. 4, where the

probability of an input-output path Pbeing non-zero is given by

q(P 6= 0) = Y

z∈P

q(z6= 0 |φ)(9)

and we constrain #paths

i=1 Y

z∈Pi

q(z6= 0 |φ)(10)

to minimize the number of paths that yield non-zero contributions from inputs to outputs. This is

equivalent to regularizing the gate adjacency matrix

Az=GL+1 ·. . . ·G1

. Where

is a probability

matrix corresponding to the probability of the

-th layer gates being positive. Then,

Azij

represents

the sum of the probabilities of all paths between the

-th input and the

-th output. Ultimately, we

thus obtain our PathReg regularization term

kAzk1,1=kGL+1 . . . G1k1,1with Gl

ij =ql(zij 6= 0 |φi,j ),(11)

where

ql(zij 6= 0)

with parameters

φi,j

is the probability of the

-th layer gate

zij

being nonzero.

Regularizing

kAzk1,1

, minimizes the number of paths between inputs and outputs and induces no

shrinkage on the actual values of the weights. Therefore, we can utilize other regularizers on

such

kAk1,1

in conjunction with PathReg similar to Eq. 2. In this work, we consider the following

overall loss function

R(θ, φ) = Eq(s|φ)1

NN

i=1

Lo(f(xi;θg(s)), yi+λ0kAzk1,1+λ1kAk1,1

=LE(θ, φ) + λ0kAzk1,1+λ1kAk1,1(12)

where

corresponds to the loss function of the original predictive task,

the overall predictive loss,

is the NODE-based model with

modeling time derivatives,

g(·) := min(1,max(0,·))

λ0, λ1≥0

are regularization parameters, and



is entry-wise multiplication.

measures how well the model

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SparsityinContinuous-DepthNeuralNetworksHananehAlieeHelmholtzMunichTillRichterHelmholtzMunichMikhailSoloninTechnicalUniversityofMunichIgnacioIbarraHelmholtzMunichFabianTheisTechnicalUniversityofMunichHelmholtzMunichNikiKilbertusTechnicalUniversityofMunichHelmholtzAI,Munich{hananeh.aliee,till.richte...

展开>> 收起<<

Sparsity in Continuous-Depth Neural Networks Hananeh Aliee Helmholtz MunichTill Richter.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sparsity in Continuous-Depth Neural Networks Hananeh Aliee Helmholtz MunichTill Richter

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: