
(MDNs) [33], which can be used with NLE as a density estimator: instead of re-estimating the
surrogate likelihood from scratch, we marginalize it analytically with respect to a given (set of)
features before applying Bayes rule to obtain the posterior estimate. This way, we can efficiently
compare the posterior uncertainty with and without including a certain feature.
For a simple linear Gaussian model and non-linear HH models, we show that the obtained posterior
estimates are as accurate and robust as when repeatedly retraining density estimators from scratch,
despite being much faster to compute. We then apply our algorithm to study which features are the
most useful for constraining posteriors over parameters of the HH model. Our tool allows us also to
study which model parameter is affected by which summary feature. Finally, we suggest a greedy
feature selection strategy to select useful features from a predefined set.
2 Methods
2.1 Neural Likelihood Estimation (NLE)
Neural Likelihood Estimation [32] is a SBI method which approximates the likelihood
p(x|θ)
of
the data
x
given the model parameters
θ
by training a conditional neural density estimator on data
generated from simulations of a mechanistic model. Using Bayes rule, the approximate likelihood
can then be used to obtain an estimate of the posterior (Fig. 1). Unlike NPE [15, 16], NLE requires
an additional sampling or inference step [34] to obtain the posterior distribution. However, it allows
access to the intermediate likelihood approximation, a property we will later exploit to develop our
efficient feature selection algorithm (see Sec. 2.3).
First, a set of N parameters are sampled from the prior distribution
θn∼p(θ), n ∈ {1, . . . , N}
.
With these parameters, the simulator is run to implicitly sample from the model’s likelihood function
according to
xn∼p(x|θn)
. Here,
xn= (x1, . . . , xNf)
are feature vectors that are usually taken
to be a function of the simulator output
s
with the individual features
xi=fi(s)
, where
i∈
{1, . . . , Nf}
, rather than the output directly. For mechanistic models whose output are time series,
f
also reduces the dimensionality of the data to a set of lower dimensional data features. The resulting
training data
{xn,θn}1:N∼p(x,θ)
can then be used to train a conditional density estimator
qφ(x|θ)
, parameterized by
φ
, to approximate the likelihood function. Here, we use a Mixture
Density Network for this task [33], the output density of which is parameterised by a mixture of
Gaussians, which can be marginalized analytically. Thus,
ˆp(x|θ) = PkπkN(µk,Σk)
with the
parameters
(µk,Σk,π)
being non-linear functions of the inputs
θ
and the network parameters
φ
.
The parameters
φ
are optimized by maximizing the log-likelihood
L(x,θ) = 1
NPnlog qφ(x|θ)
over the training data with respect to
φ
. As the number of training samples goes to infinity, this
is equivalent to maximizing the negative Kullback-Leibler (KL) divergence between the true and
approximate posterior for every x∈supp(x)[32]:
Ep(θ,x)[log qφ(x|θ)] = −Ep(θ)[DKL(p(x|θ)|| qφ(x|θ))] + const. (1)
After obtaining such a tractable likelihood surrogate, Bayes rule can be used to obtain an estimate
of the posterior conditioned on an observation
xo
(see Eq. 2) for instance via Markov Chain Monte
Carlo (MCMC):
ˆp(θ|xo)∝qφ(xo|θ)p(θ)(2)
2.2 A naive algorithm for quantifying feature importance
Given this posterior estimate, we would now like to answer the following question: for which feature
xi
from a vector of features
x
does the uncertainty of the posterior estimate
ˆp(θ|x)
increase the
most, when it is ignored? For a single feature, a naive and costly algorithm does the following: iterate
over
x
to obtain
x\i= (x1,...xi−1, xi+1, . . . , xN)
, train a total of N+1 density estimators to obtain
the likelihoods
ˆp(x\i|θ)
and
ˆp(x|θ)
, sample their associated posteriors and compare
ˆp(θ|x\i)
to the
reference posterior
ˆp(θ|x)
with every feature present. The same procedure can also be applied to
quantify the contribution of any arbitrary subset of features. As this procedure requires estimating the
likelihood based on the reduced feature set from scratch for each feature (set) (Fig. 1, grey arrows),
it is computationally costly. To quantify the contribution of different features more efficiently, we
would thus need a way to avoid re-estimating the posterior with each feature left out.
3