
2
parameter estimation [
24
]. For parameter estimation,
these methods have included variational inference [
25
,
26
],
likelihood ratio estimation [
27
], and posterior estimation
with normalizing flows [
10
,
26
,
28
,
29
]. Aside from di-
rectly estimating parameters, normalizing flows have also
been used to accelerate classical samplers, with significant
efficiency improvements [30].
Neural density estimation and importance sampling
have previously been combined [
31
] under the guise of
“neural importance sampling” [
32
], and similar approaches
have been applied in several contexts [
33
–
36
]. Our con-
tributions are to (1) extend this to amortized simulation-
based inference, (2) use it to improve results generated
with classical inference methods such as MCMC, and (3)
to highlight how the use of a forward Kullback-Leibler
(KL) loss improves reliability. We also apply it to the
challenging real-world problem of GW inference.
2
We
demonstrate results that far outperform classical methods
in terms of sample efficiency and parallelizability, while
maintaining accuracy and including simple diagnostics.
We therefore expect this work to accelerate the devel-
opment and verification of probabilistic deep learning
approaches across science.
Method.—Dingo trains a conditional density-
estimation neural network
q
(
θ|d
) to approximate
p
(
θ|d
)
based on simulated data sets (
θ, d
) with
θ∼p
(
θ
),
d∼p
(
d|θ
)—an approach called neural posterior
estimation (NPE) [
38
]. Once trained, Dingo can
rapidly produce (approximate) posterior samples for any
measured data
d
. In practice, results may deviate from
the true posterior due to insufficient training, lack of
network expressivity, or out-of-distribution (OOD) data
(i.e., data inconsistent with the training distribution).
Although it was shown in [
10
] that these deviations are
often negligible, verification of results requires comparing
against expensive standard samplers.
Here, we describe an efficient method to verify and cor-
rect Dingo results using importance sampling (IS) [
16
].
Starting from a collection of
n
samples
θi∼q
(
θ|d
)
(the “proposal”) we assign to each one an importance
weight
wi
=
p
(
d|θi
)
p
(
θi
)
/q
(
θi|d
). For a perfect pro-
posal,
wi
=
constant
, but more generally the num-
ber of effective samples is related to the variance,
neff
= (
Piwi
)
2/Pi
(
w2
i
) [
39
]. The sample efficiency
ϵ
=
neff/n ∈
(0
,
1] arises naturally as a quality measure
of the proposal.
Importance sampling requires evaluation of
p
(
d|θ
)
p
(
θ
)
rather than the normalized posterior. The Bayesian evi-
dence can then be estimated from the normalization of
2
A similar approach using convolutional networks to parametrize
Gaussian and von Mises proposals was used to estimate the sky
position alone [
37
] Using the normalizing flow proposal (as we
do here) significantly improves the flexiblity of the conditional
density estimator and enables inference of all parameters.
the weights as
p
(
d
)=1
/n Piwi
. The standard devia-
tion of the log evidence,
σlog p(d)
=
p(1 −ϵ)/(n·ϵ)
(see
Supplemental Material), scales with 1
/√n
, enabling very
precise estimates. The evidence is furthermore unbiased
if the support of the posterior is fully covered by the
proposal distribution [
40
]. The log evidence does have a
bias, but this scales as 1
/n
, and in all cases considered
here is completely negligible (see Supplemental Material).
If
q
(
θ|d
) fails to cover the entire posterior, the evidence
itself would also be biased, toward lower values.
NPE is particularly well-suited for IS because of two
key properties. First, by construction the proposal has
tractable density, such that we can not only sample
from
q
(
θ|d
), but also evaluate it. Second, the NPE pro-
posal is expected to always cover the entire posterior
support. This is because, during training, NPE min-
imizes the forward KL divergence
DKL
(
p
(
θ|d
)
||q
(
θ|d
)).
This diverges unless
supp
(
p
(
θ|d
))
⊆supp
(
q
(
θ|d
)), making
the loss “probability-mass covering”. Probability mass
coverage is not guaranteed for finite sets of samples gen-
erated with stochastic samplers like MCMC (which can
miss distributional modes), or machine learning meth-
ods with other training objectives like variational infer-
ence [12, 41, 42].
Neural importance sampling can in fact be used to
improve posterior samples from any inference method
provided the likelihood is tractable. If the method pro-
vides only samples (without density) then one must first
train an (unconditional) density estimator
q
(
θ
) (e.g., a
normalizing flow [
12
,
13
,
43
]) to use as proposal. This is
generally fast for an unconditional flow, and using the
forward KL loss guarantees that the proposal will cover
the samples. Success, however, relies on the quality of the
initial samples: if they are light-tailed, sample efficiency
will be poor, and if they are not mass-covering, the evi-
dence will be biased. Nevertheless, for initial samples that
well represent the posterior, this technique can provide
quick verification and improvement.
In the context of GWs, we refer to neural importance
sampling with Dingo as Dingo-IS. Although this tech-
nique requires likelihood evaluations at inference time,
in practice it is much faster than other likelihood-based
methods because of its high sample efficiency and par-
allelizability. Indeed, Dingo samples are independent
and identically distributed, trivially enabling full par-
allelization of likelihood evaluations. This is a crucial
advantage compared to inherently sequential methods
such as MCMC.
Results.—For our experiments, we prepare Dingo net-
works as described in [
10
], with several modifications.
First, we extend the priors over component masses to
m1, m2∈
[10
,
120] M
⊙
and dimensionless spin magnitudes
to
a1, a2∈
[0
,
0
.
99]. We also use the waveform models IM-
RPhenomXPHM [
22
] and SEOBNRv4PHM [
23
], which
include higher radiative multipoles and more realistic
precession. Finally, in addition to networks for the first