
Figure 2: Schematic depicts how the loss is com-
puted in NRE algorithms. (
θ
,
x
) pairs are sampled
from distributions at the top of the figure, enter-
ing the loss functions as depicted. NRE-Ccontrols
the number of contrastive classes with
K
and the
weight of independent and dependent terms with
p0
and
pK
.NRE-Cgeneralizes other algorithms.
Hyperparameters recovering NRE-Aand NRE-B
are listed next to the name within the dashed areas.
Notation details are defined in Section 2.1.
Density estimation [
5
,
55
,
56
] can fit the like-
lihood [
2
,
15
,
43
,
57
] or posterior [
6
,
21
,
42
,
54
] directly; however, an appealing alterna-
tive for practitioners is estimating a ratio be-
tween distributions [
12
,
16
,
30
,
34
,
70
]. Specifi-
cally, the likelihood-to-evidence ratio
p(θ|x)
p(θ)=
p(x|θ)
p(x)=p(θ,x)
p(θ)p(x)
. Unlike the other methods,
ratio estimation enables easy aggregation of in-
dependent and identically drawn data
x
. Ratio
and posterior estimation can compute bounds
on the mutual information and an importance
sampling diagnostic.
Estimating
p(x|θ)
p(x)
can be formulated as a bi-
nary classification task [
30
], where the classi-
fier
σ◦fw(θ,x)
distinguishes between pairs
(θ,x)
sampled either from the joint distribution
p(θ,x)
or the product of its marginals
p(θ)p(x)
.
We call it NRE-A. The optimal classifier has
fw(θ,x)≈log p(θ|x)
p(θ).(1)
Here,
σ
represents the sigmoid function,
◦
im-
plies function composition, and
fw
is a neural
network with weights
w
. As a part of an effort
to unify different SBI methods and to improve
simulation-efficiency, Durkan et al.
[16]
refor-
mulated the classification task to identify which
of
K
possible
θk
was responsible for simulating
x. We refer to it as NRE-B. At optimum
gw(θ,x)≈log p(θ|x)
p(θ)+cw(x),(2)
where an additional bias,
cw(x)
, appears.
gw
represents another neural network. The
cw(x)
term
nullifies many of the advantages ratio estimation offers.
cw(x)
can be arbitrarily pathological in
x
,
meaning that the normalizing constant can take on extreme values. This limits the applicability of
verification tools like the importance sampling-based diagnostic in Section 2.2.
The
cw(x)
term also arises in contrastive learning [
23
,
72
] with Ma and Collins
[45]
attempting to
estimate it in order to reduce its impact. We will propose a method that discourages this bias instead.
Further discussion in Appendix D.
There is a distinction in deep learning-based SBI between amortized and sequential algorithms
which produce surrogate models that estimate any posterior
p(θ|x)
or a specific posterior
p(θ|xo)
respectively. Amortized algorithms sample parameters from the prior, while sequential algorithms use
an alternative proposal distribution–increasing efficiency at the expense of flexibility. Amortization is
usually necessary to compute diagnostics that do not require samples from
p(θ|xo)
and amortized
estimators are empirically more reliable [31]. Our study therefore focuses on amortized algorithms.
Contribution We design a more general formulation of likelihood-to-evidence ratio estimation
as a multiclass problem in which the bias inherent to NRE-Bis discouraged by the loss function
and it does not appear at optimum. Figure 1 diagrams the interpolated performance as a function of
hyperparameters. It shows which settings recover NRE-Aand NRE-B, also indicating that highest
performance occurs with settings distant from these. Figure 2 shows the relationship of the loss
functions. We call our framework NRE-C1and expound the details in Section 2.
An existing importance sampling diagnostic [
30
] tests whether a classifier can distinguish
p(x|θ)
samples from from samples from
p(x)
weighted by the estimated ratio. We demonstrate that, when
estimating accurate posteriors, our proposed NRE-Cpasses this diagnostic while NRE-Bdoes not.
1The code for our project can be found at https://github.com/bkmi/cnre under the Apache License 2.0.
2