
model on each summary statistic
x∼q(x),(5)
zj∼Bernoulli(ρ), j = 1, . . . , D (6)
yj|xj, zj∼N(xj, σ2),if zj= 0
Cauchy(xj, τ),if zj= 1 j= 1, . . . , D (7)
where
x∈RD
. As the summary statistics have varying scales, we standardise them to have
mean zero and variance one. The Bernoulli variables indicate whether the
j
-th summary statistic
is approximately well-specified; we use the probability of
ρ= 0.5
to express the belief that it is
equally likely a summary statistic will be misspecified or well-specified a priori. When a summary
statistic is well-specified (i.e.,
zj= 0
), we take the error distribution to be a tight Gaussian (a spike)
centred on
xj
, choosing
σ= 0.01
to enforce consistency between the simulator and the observed
data for the
j
-th statistic. In contrast, in the misspecified case (i.e.,
zj= 1
), we take the error model
to be the much wider and heavy tailed Cauchy distribution (a slab). Generally, we might expect
misspecification to be relatively subtle, but some model inadequacies can lead to catastrophically
misspecified summary statistics. We chose the Cauchy scale
τ= 0.25
, to reflect this, as it places
half the mass within
±0.25
standard deviations of
xj
, but the long tails accommodate summary
statistics that are highly misspecified. The sparsity-inducing effect of this error model matches what
we consider to be a common scenario, namely that a proportion of the summary statistics are jointly
consistent with the simulator, whereas others may be incompatible. If we expect a priori more or
fewer of the summary statistics to be misspecified, the prior misspecification probability can be
adjusted accordingly. By marginalising over
z
, the spike and slab error model can also be written as
p(y|x) =
D
Y
j=1 (1 −ρ)·p(yj|xj, zj= 0) + ρ·p(yj|xj, zj= 1),(8)
and as such can be viewed as an equally weighted mixture of the Gaussian spike,
p(yj|xj, zj= 0)
,
and the Cauchy slab,
p(yj|xj, zj= 1)
, for each summary statistic. We note that error model
hyperparameter tuning approaches could be considered, e.g. a reasonable heuristic would be to
choose an error model that is broad enough that the denoised data
˜
x
tend not to be outliers in
q(x)
,
compared to simulations
2
. However, we chose to keep the hyperparameters consistent across tasks,
to avoid the risk of overfitting to the tasks and to demonstrate that neither strong prior knowledge on
the error model, nor careful hyperparameter tuning is necessary to yield substantial improvements in
performance.
A key advantage of the spike and slab error model is given by the latent variable
z
. Similar to
posterior inclusion probabilities in Bayesian regression, the posterior frequency of being in the slab,
Pr(zj= 1 |y)
, can be used as an indicator of the posterior misspecification probability for the
j
-th summary statistic. By comparing to the prior misspecification probability
ρ
, we can say that if
Pr(zj= 1 |y)> ρ
, the model provides evidence of misspecification for the
j
-th summary statistic,
and if
Pr(zj= 1 |y)< ρ
, it provides evidence it is well-specified (Talbott, 2016). For the purpose
of generality, the latent variable
z
was not included when describing RNPE thus far. However, we
can jointly infer the posterior
ˆp(x,z|y)
in step 5 of algorithm 1, by using an MCMC algorithm
that supports sampling both continuous and discrete variables. We used mixed Hamiltonian Monte
Carlo (HMC) (Zhou, 2020, 2022) from the NumPyro python package (Phan et al., 2019), which is an
adaptation of HMC (Neal et al., 2011; Duane et al., 1987), for this purpose.
3 Related Work
3.1 Model Criticism in SBI
Posterior predictive checks have been widely used in SBI, to compare the predictive samples to the
observed data (e.g. Durkan et al., 2020; Greenberg et al., 2019; Papamakarios et al., 2019b). Although
this is a form of model criticism, we note a key limitation of this approach is that if a discrepancy is
discovered, it may not be clear whether this is due to the failure of the inference procedure, or due to
simulator misspecification. In general, it may be possible to identify the presence of misspecification
using anomaly/novelty detection. One such approach was suggested by Schmitt et al. (2021) in the
2
This could for example be assessed by estimating highest density regions of
q(x)
, using the density quantile
approach from Hyndman (1996).
5