
order) ODE in its explicit representation
˙
X=f(X, t)
[
11
]. Starting from the initial observation
X(a)
at some time
t=a
, an explicit iterative ODE solver is deployed to predict
X(t)
for
t∈(a, b]
using
the current derivative estimates from
fθ
. The parameters
θ
are then updated via backpropagation
on the mean squared error (MSE) between predictions and observations. As discussed extensively
in the literature, NODEs can outperform traditional ODE parameter inference techniques in terms
of reconstruction error, especially for non-linear dynamics
f
[
11
,
15
]. In particular, one advantage
of NODEs over previous methods for inferring non-linear dynamics such as SINDy [
9
] is that no
dictionary of non-linear basis functions has to be pre-specified.
A variant of NODEs for second order systems called SONODE exploits the common reduction
of higher-order ODEs to first order systems [
37
]. The second required initial condition, the initial
velocity
˙
X(a)
, is simply learned from
X(a)
via another neural network in an end-to-end fashion. Our
experiments build on the public NODE and SONODE implementations. However, our framework
is readily applicable to most other continuous-depth neural nets including Augmented NODE [
15
],
latent NODEs [44], and neural stochastic DEs [27, 48].
2.2 Sparsity and generalization
Two prominent motivations for enforcing sparsity in over-parameterized deep learning models are
(a) pruning large networks for efficiency (speed, memory), (b) as a regularizer that can improve
interpretability or prevent overfitting. In both cases, one aims at preserving (as much as possible)
i.i.d. generalization performance, i.e., performance on unseen data from the same distribution as the
training data, compared to a non-sparse model [20].
Another line of research has explored the link between generalizability of neural nets and causal
learning [
45
], where generalization outside the i.i.d. setting, is conjectured to require an underlying
causal model. Deducing true laws of nature purely from observational data could be considered an
instance of inferring a causal model of the world. Learning the correct causal model enables accurate
predictions not only on the observed data (next observation), but also under distribution shifts for
example under interventions [
45
]. A common assumption in causal modeling is that each variable
depends on (or is a function of) few other variables [
45
,
54
,
36
,
8
]. In the ODE context, we interpret
the variables (or their derivatives) that enter a specific component
fi
as causal parents, such that we
can write
˙
Xi=f(pa(Xi), t)
[
35
]. Thus, feature sparsity translates to each variable
Xi
having only
few parents
pa(Xi)
, which can also be interpreted as asking for “simple” dynamics. Since weight
sparsity as well as regularizing the number of function evaluations in NODEs can also be considered
to bias them towards “simpler dynamics”, these notions are not strictly disjoint, raising the question
whether one implies the other.
In terms of feature sparsity, Aliee et al.
[3]
, Bellot et al.
[7]
study system identification and causal
structure inference from time-series data using NODEs. They suggest that enforcing sparsity in
the number of causal interactions improves parameter estimation as well as predictions under in-
terventions. Let us write out
fθ
as a fully connected net with
L
hidden layers parameterized by
θ:= (Wl, bl)L+1
l=1 as
fθ(X) = WL+1σ(. . . σ(W2σ(W1X+b1) + b2). . .)(1)
with element-wise activation function
σ
,
l
-th layer weights
Wl
, and biases
bl
. Aliee et al.
[3]
then
seek to reduce the overall number of parents of all variables by attempting to cancel all contributions
of a given input on a given output node through the neural net. In a linear setting, where
σ(x) = x
,
the regularization term is defined by3
kAk1,1=kWL+1 . . . W 1k1,1(2)
where
Aij = 0
if and only if the
i
-th output is constant in the
j
-th input. In the non-linear setting,
for certain
σ(x)6=x
, the regularizer
kAk1,1=k|WL+1|···|W1|k1,1
(with entry wise absolute
values on all
Wl
) is an upper bound on the number of input-output dependencies, i.e., for each
output
i
summing up all the inputs
j
it is not constant in. Regularizing input gradients [
42
,
41
,
43
] is
another alternative to train neural networks that depend on fewer inputs, however it is not scalable to
high-dimensional regression tasks.
3The 1,1norm kWk1,1is the sum of absolute values of all entries of W.
3