Contrastive Neural Ratio Estimation for Simulation-based Inference Benjamin Kurt Miller

2025-04-26 0 0 7.12MB 34 页 10玖币

侵权投诉

Contrastive Neural Ratio Estimation

for Simulation-based Inference

Benjamin Kurt Miller

University of Amsterdam

b.k.miller@uva.nl

Christoph Weniger

University of Amsterdam

c.weniger@uva.nl

Patrick Forré

University of Amsterdam

p.d.forre@uva.nl

Abstract

Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A)

or a multiclass (NRE-B) classiﬁcation task. In contrast to the binary classiﬁcation

framework, the current formulation of the multiclass version has an intrinsic and

unknown bias term, making otherwise informative diagnostics unreliable. We

propose a multiclass framework free from the bias inherent to NRE-Bat optimum,

leaving us in the position to run diagnostics that practitioners depend on. It

also recovers NRE-Ain one corner case and NRE-Bin the limiting case. For fair

comparison, we benchmark the behavior of all algorithms in both familiar and novel

training regimes: when jointly drawn data is unlimited, when data is ﬁxed but prior

draws are unlimited, and in the commonplace ﬁxed data and parameters setting.

Our investigations reveal that the highest performing models are distant from the

competitors (NRE-A,NRE-B) in hyperparameter space. We make a recommendation

for hyperparameters distinct from the previous models. We suggest two bounds

on the mutual information as performance metrics for simulation-based inference

methods, without the need for posterior samples, and provide experimental results.

This version corrects a minor implementation error in γ, improving results.

1 Introduction

Figure 1: Conceptual, interpolated map from in-

vestigated hyperparameters of proposed algorithm

NRE-Cto a measurement of posterior exactness

using the Classiﬁer Two-Sample Test. Best 0.5,

worst 1.0. Red dot indicates NRE-A’s hyperparam-

eters,

γ= 1

and

K= 1

[

]. Purple line implies

NRE-B[

] with

γ=∞

and

K≥1

.NRE-Ccov-

ers the entire plane, generalizing other methods.

Best performance occurs with

K > 1

and

γ≈1

in contrast with the settings of existing algorithms.

We begin with a motivating example: Consider

the task of inferring the mass of an exoplanet

θo

from the light curve observations

of a distant

star. We design a computer program that maps

hypothetical mass

to a simulated light curve

using relevant physical theory. Our simulator

computes

from

, but the inverse mapping is

unspeciﬁed and likely intractable. Simulation-

based inference (SBI) puts this problem in a

probabilistic context [

]. Although we can-

not analytically evaluate it, we assume that the

simulator is sampling from the conditional prob-

ability distribution

p(x|θ)

. After specifying a

prior

p(θ)

, the inverse amounts to estimating the

posterior

p(θ|xo)

. This problem setting occurs

across scientiﬁc domains [

] where

generally represents input parameters of the

simulator and

the simulated output observa-

tion. Our design goal is to produce a surrogate

model

ˆp(θ|x)

approximating the posterior for

any data xwhile limiting excessive simulation.

Preprint. Under review.

arXiv:2210.06170v3 [stat.ML] 4 Jul 2024

Figure 2: Schematic depicts how the loss is com-

puted in NRE algorithms. (

) pairs are sampled

from distributions at the top of the ﬁgure, enter-

ing the loss functions as depicted. NRE-Ccontrols

the number of contrastive classes with

and the

weight of independent and dependent terms with

and

.NRE-Cgeneralizes other algorithms.

Hyperparameters recovering NRE-Aand NRE-B

are listed next to the name within the dashed areas.

Notation details are deﬁned in Section 2.1.

Density estimation [

] can ﬁt the like-

lihood [

] or posterior [

] directly; however, an appealing alterna-

tive for practitioners is estimating a ratio be-

tween distributions [

]. Speciﬁ-

cally, the likelihood-to-evidence ratio

p(θ|x)

p(θ)=

p(x|θ)

p(x)=p(θ,x)

p(θ)p(x)

. Unlike the other methods,

ratio estimation enables easy aggregation of in-

dependent and identically drawn data

. Ratio

and posterior estimation can compute bounds

on the mutual information and an importance

sampling diagnostic.

Estimating

p(x|θ)

p(x)

can be formulated as a bi-

nary classiﬁcation task [

], where the classi-

ﬁer

σ◦fw(θ,x)

distinguishes between pairs

(θ,x)

sampled either from the joint distribution

p(θ,x)

or the product of its marginals

p(θ)p(x)

We call it NRE-A. The optimal classiﬁer has

fw(θ,x)≈log p(θ|x)

p(θ).(1)

Here,

represents the sigmoid function,

◦

im-

plies function composition, and

is a neural

network with weights

. As a part of an effort

to unify different SBI methods and to improve

simulation-efﬁciency, Durkan et al.

[16]

refor-

mulated the classiﬁcation task to identify which

possible

θk

was responsible for simulating

x. We refer to it as NRE-B. At optimum

gw(θ,x)≈log p(θ|x)

p(θ)+cw(x),(2)

where an additional bias,

cw(x)

, appears.

represents another neural network. The

cw(x)

term

nulliﬁes many of the advantages ratio estimation offers.

cw(x)

can be arbitrarily pathological in

meaning that the normalizing constant can take on extreme values. This limits the applicability of

veriﬁcation tools like the importance sampling-based diagnostic in Section 2.2.

The

cw(x)

term also arises in contrastive learning [

] with Ma and Collins

[45]

attempting to

estimate it in order to reduce its impact. We will propose a method that discourages this bias instead.

Further discussion in Appendix D.

There is a distinction in deep learning-based SBI between amortized and sequential algorithms

which produce surrogate models that estimate any posterior

p(θ|x)

or a speciﬁc posterior

p(θ|xo)

respectively. Amortized algorithms sample parameters from the prior, while sequential algorithms use

an alternative proposal distribution–increasing efﬁciency at the expense of ﬂexibility. Amortization is

usually necessary to compute diagnostics that do not require samples from

p(θ|xo)

and amortized

estimators are empirically more reliable [31]. Our study therefore focuses on amortized algorithms.

Contribution We design a more general formulation of likelihood-to-evidence ratio estimation

as a multiclass problem in which the bias inherent to NRE-Bis discouraged by the loss function

and it does not appear at optimum. Figure 1 diagrams the interpolated performance as a function of

hyperparameters. It shows which settings recover NRE-Aand NRE-B, also indicating that highest

performance occurs with settings distant from these. Figure 2 shows the relationship of the loss

functions. We call our framework NRE-C1and expound the details in Section 2.

An existing importance sampling diagnostic [

] tests whether a classiﬁer can distinguish

p(x|θ)

samples from from samples from

p(x)

weighted by the estimated ratio. We demonstrate that, when

estimating accurate posteriors, our proposed NRE-Cpasses this diagnostic while NRE-Bdoes not.

1The code for our project can be found at https://github.com/bkmi/cnre under the Apache License 2.0.

Taking inspiration from mutual information estimation [

], we propose applying a variational bound

on the mutual information between

and

in a novel way–as an informative metric measuring a

lower bound on the Kullback-Leibler divergence between surrogate posterior estimate

pw(θ|x)

and

p(θ|x)

, averaged over

p(x)

. Unlike with two-sample testing methods commonly used in machine

learning literature [

], our metric samples only from

p(θ,x)

, which is always available in SBI, and

does not require samples from the intractable

p(θ|x)

. Our metric is meaningful to scientists working

on problems with intractable posteriors. The technique requires estimating the partition function,

which can be expensive. We ﬁnd the metric to be well correlated with results from two-sample tests.

We evaluate NRE-Band NRE-Cin a fair comparison in several training regimes in Section 3. We

perform a hyperparameter search on three simulators with tractable likelihood by benchmarking the

behavior when (a) jointly drawn pairs

(θ,x)

are unlimited or when jointly drawn pairs

(θ,x)

are

ﬁxed but we (b) can draw from the prior

p(θ)

without limit or (c) are restricted to the initial pairs. We

also perform the SBI benchmark of Lueckmann et al. [44] with our recommended hyperparameters.

2 Methods

The ratio between probability distributions can be estimated using the “likelihood ratio trick” by

training a classiﬁer to distinguish samples [

]. We ﬁrst summarize the

loss functions of NRE-Aand NRE-Bwhich approximate the intractable likelihood-to-evidence ratio

r(x|θ):=p(x|θ)

p(x)

. We then elaborate on our proposed generalization, NRE-C. Finally, we explain

how to recover NRE-Aand NRE-Bwithin our framework and comment on the normalization properties.

NRE-AHermans et al.

[30]

train a binary classiﬁer to distinguish

(θ,x)

pairs drawn dependently

p(θ,x)

from those drawn independently

p(θ)p(x)

. This classiﬁer is parameterized by a neural

network fwwhich approximates log r(x|θ). We seek optimal network weights

w∗∈arg min

−1

2B"B

b=1

log 1−σ◦fw(θ(b),x(b))+

b′=1

log σ◦fw(θ(b′),x(b′))#(3)

θ(b),x(b)∼p(θ)p(x)

and

θ(b′),x(b′)∼p(θ,x)

over

samples. NRE-A’s ratio estimate converges

fw∗= log p(x|θ)

p(x)

given unlimited model ﬂexibility and data. Details can be found in Appendix A.

NRE-BDurkan et al.

[16]

train a classiﬁer that selects from among

parameters

(θ1,...,θK)

which could have generated

, in contrast with NRE-A’s binary possibilities. One of these parameters

θk

is always drawn jointly with

. The classiﬁer is parameterized by a neural network

which

approximates log r(x|θ). Training is done over Bsamples by ﬁnding

w∗∈arg min

w"−1

b′=1

log exp ◦gw(θ(b′)

k,x(b′))

i=1 exp ◦gw(θ(b′)

i,x(b′))#(4)

where

θ(b′)

1,...,θ(b′)

K∼p(θ)

and

x(b′)∼p(x|θ(b′)

. Given unlimited model ﬂexibility and data

NRE-B’s ratio estimate converges to gw∗(θ,x) = log p(θ|x)

p(θ)+cw∗(x). Details are in Appendix A.

2.1 Contrastive Neural Ratio Estimation

Our proposed algorithm NRE-Ctrains a classiﬁer to identify which

among

candidates is

responsible for generating a given

, inspired by NRE-B. We added another option that indicates

was drawn independently, inspired by NRE-A. The introduction of the additional class yields a ratio

without the speciﬁc cw(x)bias at optimum. Deﬁne Θ:= (θ1, ..., θK)and conditional probability

pNRE-C(Θ,x|y=k):=p(θ1)· · · p(θK)p(x)k= 0

p(θ1)· · · p(θK)p(x|θk)k= 1, . . . , K .(5)

We set marginal probabilities

p(y=k):=pK

for all

k≥1

and

p(y= 0) :=p0

, yielding the

relationship

p0= 1 −KpK

. Let the odds of any pair being drawn dependently to completely

independently be γ:=KpK

p0. We now use Bayes’ formula to compute the conditional probability

p(y=k|Θ,x) = p(y=k)p(Θ,x|y=k)/p(Θ,x|y= 0)

i=0 p(y=i)p(Θ,x|y=i)/p(Θ,x|y= 0)

=p(y=k)p(Θ,x|y=k)/p(Θ,x|y= 0)

p(y= 0) + PK

i=1 p(y=i)p(Θ,x|y=i)/p(Θ,x|y= 0)

=(K

K+γPK

i=1 r(x|θi)k= 0

γ r(x|θk)

K+γPK

i=1 r(x|θi)k= 1, . . . , K .

(6)

We dropped the NRE-Csubscript and substituted in

to replace the

p(y)

class probabilities. We train

a classiﬁer, parameterized by neural network hw(θ,x)with weights w, to approximate (6) by

qw(y=k|Θ,x) = (K

K+γPK

i=1 exp ◦hw(θi,x)k= 0

γexp ◦hw(θk,x))

K+γPK

i=1 exp ◦hw(θi,x)k= 1, . . . , K. .(7)

We note that (7) still satisﬁes PK

k=0 qw(y=k|Θ,x)=1, no matter the parameterization.

Optimization We design a loss function that encourages

hw(θ,x) = log p(x|θ)

p(x)

at convergence,

and holds at optimum with unlimited ﬂexibility and data. We introduce the cross entropy loss

ℓ(w):=Ep(y,Θ,x)[−log qw(y|Θ,x)]

=−p0Ep(Θ,x|y=0) [log qw(y= 0 |Θ,x)] −pK

k=1

Ep(Θ,x|y=k)[log qw(y=k|Θ,x)]

=−p0Ep(Θ,x|y=0) [log qw(y= 0 |Θ,x)] −KpKEp(Θ,x|y=K)[log qw(y=K|Θ,x)](8)

and minimize it towards

w∗∈arg minwℓ(w)

. We point out that the ﬁnal term is symmetric up to

permutation of

, enabling the replacement of the sum by multiplication with

. When

and

are known,

p0=1

1+γ

and

pK=1

1+γ

under our constraints. Without loss of generality, we let

θ1,...,θK∼p(θ)

and

x∼p(x|θK)

. An empirical estimate of the loss on

samples is therefore

ℓγ,K (w):=−1

B"1

1 + γ

b=1

log qwy= 0 |Θ(b),x(b)

+γ

1 + γ

b′=1

log qwy=K|Θ(b′),x(b′)#.

(9)

In the ﬁrst term, the classiﬁer sees a completely independently drawn sample of

and

while

θK

drawn jointly with

in the second term. In both terms, the classiﬁer considers

choices. In practice,

we bootstrap both

θ(b)

1,...,θ(b)

and

θ(b′)

1,...,θ(b′)

K−1

from the same mini-batch and compare them

to the same x, similarly to NRE-Aand NRE-B. Proof of the above is in Appendix B.

Recovering NRE-Aand NRE-BNRE-Cis general because speciﬁc hyperparameter settings recover

NRE-Aand NRE-B. To recover NRE-Aone should set γ= 1 and K= 1 in (9) yielding

ℓ1,1(w) = −1

2B"B

b=1

log 1

1 + exp ◦hw(θ(b),x(b))+

b′=1

log exp ◦hw(θ(b′),x(b′))

1 + exp ◦hw(θ(b′),x(b′))#

=−1

2B"−

b=1

log 1−σ◦hw(θ(b),x(b))+

b′=1

log σ◦hw(θ(b′),x(b′))#(10)

where we dropped the lower index. Recovering NRE-Brequires taking the limit

γ→ ∞

in the loss

function. In that case, the ﬁrst term goes to zero, and second term converges to the softmax function.

ℓ∞,K (w) = lim

γ→∞

ℓγ,K (w) = −1

B"B

b′=1

log exp ◦hw(θk,x))

i=1 exp ◦hw(θi,x)#(11)

is determined by substitution into (9). Both equations are obviously the same as their counterparts.

Estimating a normalized posterior In the limit of inﬁnite data and inﬁnite neural network capacity

(width, depth) the optimal classiﬁer trained using NRE-C(with γ∈R+) satisﬁes the equality:

hw∗(θ,x) = log p(θ|x)

p(θ).(12)

In particular, we have that the following normalizing constant is trivial:

Z(x) := Zexp (hw∗(θ,x)) p(θ)dθ=Zp(θ|x)dθ= 1.(13)

This is a result of Lemma 1 in Appendix B. However, practitioners never operate in this setting, rather

they use ﬁnite sample sizes and neural networks with limited capacity that are optimized locally. The

non-optimal function

exp(hw(θ,x))

does not have a direct interpretation as a ratio of probability

distributions, rather as the function to weigh the prior

p(θ)

to approximate the unnormalized posterior.

In other words, we ﬁnd the following approximation for the posterior p(θ|x):

pw(θ|x) := exp(hw(θ,x))

Zw(x)p(θ), Zw(x) := Zexp (hw(θ,x)) p(θ)dθ,(14)

where in general the normalizing constant is not trivial, i.e.

Zw(x)̸= 1

. As stated above, the NRE-C

(and NRE-A) objective encourages

Zw(x)

to converge to

. This is in sharp contrast to NRE-B, where

even at optimum with an unrestricted function class a non-trivial x-dependent bias term can appear.

There is no restriction on how pathological the NRE-Bbias

cw(x)

can be. Consider a minimizer of

(4)

, the NRE-Bloss function,

hw∗+cw∗(x)

. Adding any function

d(x)

cancels out in the fraction and

is also a minimizer of

(4)

. This freedom complicates any numerical computation of the normalizing

constant and renders the importance sampling diagnostic from Section 2.2 generally inapplicable.

We report Monte Carlo estimates of Zw(x)on a test problem across hyperparameters in Figure 14.

2.2 Measuring performance & ratio estimator diagnostics

SBI is difﬁcult to verify because, for many use cases, the practitioner cannot compare surrogate

pw(θ|x)

to the intractable ground truth

p(θ|x)

. Incongruous with the practical use case for SBI,

much of the literature has focused on measuring the similarity between surrogate and posterior

using two-samples tests on tractable problems. For comparison with literature, we ﬁrst reference

a two-sample exactness metric which requires a tractable posterior. We then discuss diagnostics

which do not require samples from

p(θ|x)

, commenting on the relevance for each NRE algorithm

with empirical results. Further, we ﬁnd that a known variational bound to the mutual information

is tractable to estimate within SBI, that it bounds the average Kullback-Leibler divergence between

surrogate and posterior, and propose to use it for model comparison on intractable inference tasks.

Comparing to a tractable posterior with estimates of exactness Assessments of approximate

posterior quality are available when samples can be drawn from both the posterior

θ∼p(θ|x)

and

the approximation

θ∼q(θ|x)

. In the deep learning-based SBI literature, exactness is measured as a

function of computational cost, usually simulator calls. We investigate this with NRE-Cin Section 3.3.

Based on the recommendations of Lueckmann et al.

[44]

our experimental results are measured using

the Classiﬁer Two-Sample Test (C2ST) [

]. A classiﬁer is trained to distinguish samples

from either the surrogate or the ground truth posterior. An average classiﬁcation probability on

holdout data of 1.0 implies that samples from each distribution are easily identiﬁed; 0.5 implies either

the distributions are the same or the classiﬁer does not have the capacity to distinguish them.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContrastiveNeuralRatioEstimationforSimulation-basedInferenceBenjaminKurtMillerUniversityofAmsterdamb.k.miller@uva.nlChristophWenigerUniversityofAmsterdamc.weniger@uva.nlPatrickForréUniversityofAmsterdamp.d.forre@uva.nlAbstractLikelihood-to-evidenceratioestimationisusuallycastaseitherabinary(NRE-A)or...

展开>> 收起<<

Contrastive Neural Ratio Estimation for Simulation-based Inference Benjamin Kurt Miller.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Contrastive Neural Ratio Estimation for Simulation-based Inference Benjamin Kurt Miller

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: