Contrastive Neural Ratio Estimation for Simulation-based Inference Benjamin Kurt Miller

2025-04-26 0 0 7.12MB 34 页 10玖币
侵权投诉
Contrastive Neural Ratio Estimation
for Simulation-based Inference
Benjamin Kurt Miller
University of Amsterdam
b.k.miller@uva.nl
Christoph Weniger
University of Amsterdam
c.weniger@uva.nl
Patrick Forré
University of Amsterdam
p.d.forre@uva.nl
Abstract
Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A)
or a multiclass (NRE-B) classification task. In contrast to the binary classification
framework, the current formulation of the multiclass version has an intrinsic and
unknown bias term, making otherwise informative diagnostics unreliable. We
propose a multiclass framework free from the bias inherent to NRE-Bat optimum,
leaving us in the position to run diagnostics that practitioners depend on. It
also recovers NRE-Ain one corner case and NRE-Bin the limiting case. For fair
comparison, we benchmark the behavior of all algorithms in both familiar and novel
training regimes: when jointly drawn data is unlimited, when data is fixed but prior
draws are unlimited, and in the commonplace fixed data and parameters setting.
Our investigations reveal that the highest performing models are distant from the
competitors (NRE-A,NRE-B) in hyperparameter space. We make a recommendation
for hyperparameters distinct from the previous models. We suggest two bounds
on the mutual information as performance metrics for simulation-based inference
methods, without the need for posterior samples, and provide experimental results.
This version corrects a minor implementation error in γ, improving results.
1 Introduction
Figure 1: Conceptual, interpolated map from in-
vestigated hyperparameters of proposed algorithm
NRE-Cto a measurement of posterior exactness
using the Classifier Two-Sample Test. Best 0.5,
worst 1.0. Red dot indicates NRE-As hyperparam-
eters,
γ= 1
and
K= 1
[
30
]. Purple line implies
NRE-B[
16
] with
γ=
and
K1
.NRE-Ccov-
ers the entire plane, generalizing other methods.
Best performance occurs with
K > 1
and
γ1
,
in contrast with the settings of existing algorithms.
We begin with a motivating example: Consider
the task of inferring the mass of an exoplanet
θo
from the light curve observations
xo
of a distant
star. We design a computer program that maps
hypothetical mass
θ
to a simulated light curve
x
using relevant physical theory. Our simulator
computes
x
from
θ
, but the inverse mapping is
unspecified and likely intractable. Simulation-
based inference (SBI) puts this problem in a
probabilistic context [
13
,
65
]. Although we can-
not analytically evaluate it, we assume that the
simulator is sampling from the conditional prob-
ability distribution
p(x|θ)
. After specifying a
prior
p(θ)
, the inverse amounts to estimating the
posterior
p(θ|xo)
. This problem setting occurs
across scientific domains [
1
,
7
,
10
,
11
,
29
] where
θ
generally represents input parameters of the
simulator and
x
the simulated output observa-
tion. Our design goal is to produce a surrogate
model
ˆp(θ|x)
approximating the posterior for
any data xwhile limiting excessive simulation.
Preprint. Under review.
arXiv:2210.06170v3 [stat.ML] 4 Jul 2024
Figure 2: Schematic depicts how the loss is com-
puted in NRE algorithms. (
θ
,
x
) pairs are sampled
from distributions at the top of the figure, enter-
ing the loss functions as depicted. NRE-Ccontrols
the number of contrastive classes with
K
and the
weight of independent and dependent terms with
p0
and
pK
.NRE-Cgeneralizes other algorithms.
Hyperparameters recovering NRE-Aand NRE-B
are listed next to the name within the dashed areas.
Notation details are defined in Section 2.1.
Density estimation [
5
,
55
,
56
] can fit the like-
lihood [
2
,
15
,
43
,
57
] or posterior [
6
,
21
,
42
,
54
] directly; however, an appealing alterna-
tive for practitioners is estimating a ratio be-
tween distributions [
12
,
16
,
30
,
34
,
70
]. Specifi-
cally, the likelihood-to-evidence ratio
p(θ|x)
p(θ)=
p(x|θ)
p(x)=p(θ,x)
p(θ)p(x)
. Unlike the other methods,
ratio estimation enables easy aggregation of in-
dependent and identically drawn data
x
. Ratio
and posterior estimation can compute bounds
on the mutual information and an importance
sampling diagnostic.
Estimating
p(x|θ)
p(x)
can be formulated as a bi-
nary classification task [
30
], where the classi-
fier
σfw(θ,x)
distinguishes between pairs
(θ,x)
sampled either from the joint distribution
p(θ,x)
or the product of its marginals
p(θ)p(x)
.
We call it NRE-A. The optimal classifier has
fw(θ,x)log p(θ|x)
p(θ).(1)
Here,
σ
represents the sigmoid function,
im-
plies function composition, and
fw
is a neural
network with weights
w
. As a part of an effort
to unify different SBI methods and to improve
simulation-efficiency, Durkan et al.
[16]
refor-
mulated the classification task to identify which
of
K
possible
θk
was responsible for simulating
x. We refer to it as NRE-B. At optimum
gw(θ,x)log p(θ|x)
p(θ)+cw(x),(2)
where an additional bias,
cw(x)
, appears.
gw
represents another neural network. The
cw(x)
term
nullifies many of the advantages ratio estimation offers.
cw(x)
can be arbitrarily pathological in
x
,
meaning that the normalizing constant can take on extreme values. This limits the applicability of
verification tools like the importance sampling-based diagnostic in Section 2.2.
The
cw(x)
term also arises in contrastive learning [
23
,
72
] with Ma and Collins
[45]
attempting to
estimate it in order to reduce its impact. We will propose a method that discourages this bias instead.
Further discussion in Appendix D.
There is a distinction in deep learning-based SBI between amortized and sequential algorithms
which produce surrogate models that estimate any posterior
p(θ|x)
or a specific posterior
p(θ|xo)
respectively. Amortized algorithms sample parameters from the prior, while sequential algorithms use
an alternative proposal distribution–increasing efficiency at the expense of flexibility. Amortization is
usually necessary to compute diagnostics that do not require samples from
p(θ|xo)
and amortized
estimators are empirically more reliable [31]. Our study therefore focuses on amortized algorithms.
Contribution We design a more general formulation of likelihood-to-evidence ratio estimation
as a multiclass problem in which the bias inherent to NRE-Bis discouraged by the loss function
and it does not appear at optimum. Figure 1 diagrams the interpolated performance as a function of
hyperparameters. It shows which settings recover NRE-Aand NRE-B, also indicating that highest
performance occurs with settings distant from these. Figure 2 shows the relationship of the loss
functions. We call our framework NRE-C1and expound the details in Section 2.
An existing importance sampling diagnostic [
30
] tests whether a classifier can distinguish
p(x|θ)
samples from from samples from
p(x)
weighted by the estimated ratio. We demonstrate that, when
estimating accurate posteriors, our proposed NRE-Cpasses this diagnostic while NRE-Bdoes not.
1The code for our project can be found at https://github.com/bkmi/cnre under the Apache License 2.0.
2
Taking inspiration from mutual information estimation [
59
], we propose applying a variational bound
on the mutual information between
θ
and
x
in a novel way–as an informative metric measuring a
lower bound on the Kullback-Leibler divergence between surrogate posterior estimate
pw(θ|x)
and
p(θ|x)
, averaged over
p(x)
. Unlike with two-sample testing methods commonly used in machine
learning literature [
44
], our metric samples only from
p(θ,x)
, which is always available in SBI, and
does not require samples from the intractable
p(θ|x)
. Our metric is meaningful to scientists working
on problems with intractable posteriors. The technique requires estimating the partition function,
which can be expensive. We find the metric to be well correlated with results from two-sample tests.
We evaluate NRE-Band NRE-Cin a fair comparison in several training regimes in Section 3. We
perform a hyperparameter search on three simulators with tractable likelihood by benchmarking the
behavior when (a) jointly drawn pairs
(θ,x)
are unlimited or when jointly drawn pairs
(θ,x)
are
fixed but we (b) can draw from the prior
p(θ)
without limit or (c) are restricted to the initial pairs. We
also perform the SBI benchmark of Lueckmann et al. [44] with our recommended hyperparameters.
2 Methods
The ratio between probability distributions can be estimated using the “likelihood ratio trick” by
training a classifier to distinguish samples [
12
,
19
,
27
,
30
,
51
,
67
,
70
]. We first summarize the
loss functions of NRE-Aand NRE-Bwhich approximate the intractable likelihood-to-evidence ratio
r(x|θ):=p(x|θ)
p(x)
. We then elaborate on our proposed generalization, NRE-C. Finally, we explain
how to recover NRE-Aand NRE-Bwithin our framework and comment on the normalization properties.
NRE-AHermans et al.
[30]
train a binary classifier to distinguish
(θ,x)
pairs drawn dependently
p(θ,x)
from those drawn independently
p(θ)p(x)
. This classifier is parameterized by a neural
network fwwhich approximates log r(x|θ). We seek optimal network weights
warg min
w
1
2B"B
X
b=1
log 1σfw(θ(b),x(b))+
B
X
b=1
log σfw(θ(b),x(b))#(3)
θ(b),x(b)p(θ)p(x)
and
θ(b),x(b)p(θ,x)
over
B
samples. NRE-As ratio estimate converges
to
fw= log p(x|θ)
p(x)
given unlimited model flexibility and data. Details can be found in Appendix A.
NRE-BDurkan et al.
[16]
train a classifier that selects from among
K
parameters
(θ1,...,θK)
which could have generated
x
, in contrast with NRE-As binary possibilities. One of these parameters
θk
is always drawn jointly with
x
. The classifier is parameterized by a neural network
gw
which
approximates log r(x|θ). Training is done over Bsamples by finding
warg min
w"1
B
B
X
b=1
log exp gw(θ(b)
k,x(b))
PK
i=1 exp gw(θ(b)
i,x(b))#(4)
where
θ(b)
1,...,θ(b)
Kp(θ)
and
x(b)p(x|θ(b)
k)
. Given unlimited model flexibility and data
NRE-Bs ratio estimate converges to gw(θ,x) = log p(θ|x)
p(θ)+cw(x). Details are in Appendix A.
2.1 Contrastive Neural Ratio Estimation
Our proposed algorithm NRE-Ctrains a classifier to identify which
θ
among
K
candidates is
responsible for generating a given
x
, inspired by NRE-B. We added another option that indicates
x
was drawn independently, inspired by NRE-A. The introduction of the additional class yields a ratio
without the specific cw(x)bias at optimum. Define Θ:= (θ1, ..., θK)and conditional probability
pNRE-C(Θ,x|y=k):=p(θ1)· · · p(θK)p(x)k= 0
p(θ1)· · · p(θK)p(x|θk)k= 1, . . . , K .(5)
3
We set marginal probabilities
p(y=k):=pK
for all
k1
and
p(y= 0) :=p0
, yielding the
relationship
p0= 1 KpK
. Let the odds of any pair being drawn dependently to completely
independently be γ:=KpK
p0. We now use Bayes’ formula to compute the conditional probability
p(y=k|Θ,x) = p(y=k)p(Θ,x|y=k)/p(Θ,x|y= 0)
PK
i=0 p(y=i)p(Θ,x|y=i)/p(Θ,x|y= 0)
=p(y=k)p(Θ,x|y=k)/p(Θ,x|y= 0)
p(y= 0) + PK
i=1 p(y=i)p(Θ,x|y=i)/p(Θ,x|y= 0)
=(K
K+γPK
i=1 r(x|θi)k= 0
γ r(x|θk)
K+γPK
i=1 r(x|θi)k= 1, . . . , K .
(6)
We dropped the NRE-Csubscript and substituted in
γ
to replace the
p(y)
class probabilities. We train
a classifier, parameterized by neural network hw(θ,x)with weights w, to approximate (6) by
qw(y=k|Θ,x) = (K
K+γPK
i=1 exp hw(θi,x)k= 0
γexp hw(θk,x))
K+γPK
i=1 exp hw(θi,x)k= 1, . . . , K. .(7)
We note that (7) still satisfies PK
k=0 qw(y=k|Θ,x)=1, no matter the parameterization.
Optimization We design a loss function that encourages
hw(θ,x) = log p(x|θ)
p(x)
at convergence,
and holds at optimum with unlimited flexibility and data. We introduce the cross entropy loss
(w):=Ep(y,Θ,x)[log qw(y|Θ,x)]
=p0Ep(Θ,x|y=0) [log qw(y= 0 |Θ,x)] pK
K
X
k=1
Ep(Θ,x|y=k)[log qw(y=k|Θ,x)]
=p0Ep(Θ,x|y=0) [log qw(y= 0 |Θ,x)] KpKEp(Θ,x|y=K)[log qw(y=K|Θ,x)](8)
and minimize it towards
warg minw(w)
. We point out that the final term is symmetric up to
permutation of
Θ
, enabling the replacement of the sum by multiplication with
K
. When
γ
and
K
are known,
p0=1
1+γ
and
pK=1
K
γ
1+γ
under our constraints. Without loss of generality, we let
θ1,...,θKp(θ)
and
xp(x|θK)
. An empirical estimate of the loss on
B
samples is therefore
ˆ
γ,K (w):=1
B"1
1 + γ
B
X
b=1
log qwy= 0 |Θ(b),x(b)
+γ
1 + γ
B
X
b=1
log qwy=K|Θ(b),x(b)#.
(9)
In the first term, the classifier sees a completely independently drawn sample of
x
and
Θ
while
θK
is
drawn jointly with
x
in the second term. In both terms, the classifier considers
K
choices. In practice,
we bootstrap both
θ(b)
1,...,θ(b)
K
and
θ(b)
1,...,θ(b)
K1
from the same mini-batch and compare them
to the same x, similarly to NRE-Aand NRE-B. Proof of the above is in Appendix B.
Recovering NRE-Aand NRE-BNRE-Cis general because specific hyperparameter settings recover
NRE-Aand NRE-B. To recover NRE-Aone should set γ= 1 and K= 1 in (9) yielding
ˆ
1,1(w) = 1
2B"B
X
b=1
log 1
1 + exp hw(θ(b),x(b))+
B
X
b=1
log exp hw(θ(b),x(b))
1 + exp hw(θ(b),x(b))#
=1
2B"
B
X
b=1
log 1σhw(θ(b),x(b))+
B
X
b=1
log σhw(θ(b),x(b))#(10)
4
where we dropped the lower index. Recovering NRE-Brequires taking the limit
γ→ ∞
in the loss
function. In that case, the first term goes to zero, and second term converges to the softmax function.
ˆ
,K (w) = lim
γ→∞
ˆ
γ,K (w) = 1
B"B
X
b=1
log exp hw(θk,x))
PK
i=1 exp hw(θi,x)#(11)
is determined by substitution into (9). Both equations are obviously the same as their counterparts.
Estimating a normalized posterior In the limit of infinite data and infinite neural network capacity
(width, depth) the optimal classifier trained using NRE-C(with γR+) satisfies the equality:
hw(θ,x) = log p(θ|x)
p(θ).(12)
In particular, we have that the following normalizing constant is trivial:
Z(x) := Zexp (hw(θ,x)) p(θ)dθ=Zp(θ|x)dθ= 1.(13)
This is a result of Lemma 1 in Appendix B. However, practitioners never operate in this setting, rather
they use finite sample sizes and neural networks with limited capacity that are optimized locally. The
non-optimal function
exp(hw(θ,x))
does not have a direct interpretation as a ratio of probability
distributions, rather as the function to weigh the prior
p(θ)
to approximate the unnormalized posterior.
In other words, we find the following approximation for the posterior p(θ|x):
pw(θ|x) := exp(hw(θ,x))
Zw(x)p(θ), Zw(x) := Zexp (hw(θ,x)) p(θ)dθ,(14)
where in general the normalizing constant is not trivial, i.e.
Zw(x)̸= 1
. As stated above, the NRE-C
(and NRE-A) objective encourages
Zw(x)
to converge to
1
. This is in sharp contrast to NRE-B, where
even at optimum with an unrestricted function class a non-trivial x-dependent bias term can appear.
There is no restriction on how pathological the NRE-Bbias
cw(x)
can be. Consider a minimizer of
(4)
, the NRE-Bloss function,
hw+cw(x)
. Adding any function
d(x)
cancels out in the fraction and
is also a minimizer of
(4)
. This freedom complicates any numerical computation of the normalizing
constant and renders the importance sampling diagnostic from Section 2.2 generally inapplicable.
We report Monte Carlo estimates of Zw(x)on a test problem across hyperparameters in Figure 14.
2.2 Measuring performance & ratio estimator diagnostics
SBI is difficult to verify because, for many use cases, the practitioner cannot compare surrogate
pw(θ|x)
to the intractable ground truth
p(θ|x)
. Incongruous with the practical use case for SBI,
much of the literature has focused on measuring the similarity between surrogate and posterior
using two-samples tests on tractable problems. For comparison with literature, we first reference
a two-sample exactness metric which requires a tractable posterior. We then discuss diagnostics
which do not require samples from
p(θ|x)
, commenting on the relevance for each NRE algorithm
with empirical results. Further, we find that a known variational bound to the mutual information
is tractable to estimate within SBI, that it bounds the average Kullback-Leibler divergence between
surrogate and posterior, and propose to use it for model comparison on intractable inference tasks.
Comparing to a tractable posterior with estimates of exactness Assessments of approximate
posterior quality are available when samples can be drawn from both the posterior
θp(θ|x)
and
the approximation
θq(θ|x)
. In the deep learning-based SBI literature, exactness is measured as a
function of computational cost, usually simulator calls. We investigate this with NRE-Cin Section 3.3.
Based on the recommendations of Lueckmann et al.
[44]
our experimental results are measured using
the Classifier Two-Sample Test (C2ST) [
17
,
40
,
41
]. A classifier is trained to distinguish samples
from either the surrogate or the ground truth posterior. An average classification probability on
holdout data of 1.0 implies that samples from each distribution are easily identified; 0.5 implies either
the distributions are the same or the classifier does not have the capacity to distinguish them.
5
摘要:

ContrastiveNeuralRatioEstimationforSimulation-basedInferenceBenjaminKurtMillerUniversityofAmsterdamb.k.miller@uva.nlChristophWenigerUniversityofAmsterdamc.weniger@uva.nlPatrickForréUniversityofAmsterdamp.d.forre@uva.nlAbstractLikelihood-to-evidenceratioestimationisusuallycastaseitherabinary(NRE-A)or...

展开>> 收起<<
Contrastive Neural Ratio Estimation for Simulation-based Inference Benjamin Kurt Miller.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:7.12MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注