Machine-learning-assisted Monte Carlo fails at sampling computationally hard problems Simone Ciarella1Jeanne Trinquier2 1Martin Weigt2and Francesco Zamponi1

2025-05-02 0 0 4.59MB 24 页 10玖币

侵权投诉

Machine-learning-assisted Monte Carlo

fails at sampling computationally hard problems

Simone Ciarella,

1, ∗

Jeanne Trinquier,

2, 1, ∗

Martin Weigt,

and Francesco Zamponi

Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL,

CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France

Sorbonne Université, CNRS, Institut de Biologie Paris Seine,

Biologie Computationnelle et Quantitative LCQB, F-75005 Paris, France

(Dated: March 13, 2023)

Several strategies have been recently proposed in order to improve Monte Carlo sampling eﬃciency

using machine learning tools. Here, we challenge these methods by considering a class of problems

that are known to be exponentially hard to sample using conventional local Monte Carlo at low

enough temperatures. In particular, we study the antiferromagnetic Potts model on a random graph,

which reduces to the coloring of random graphs at zero temperature. We test several machine-

learning-assisted Monte Carlo approaches, and we ﬁnd that they all fail. Our work thus provides

good benchmarks for future proposals for smart sampling algorithms.

I. INTRODUCTION

A. Motivations

Sampling from a given target probability distribution

(

σ1,· · · , σN

)over

degrees of freedom can become

extremely hard when

is large. A universal (i.e. system-

independent) strategy for sampling consists in starting

from a random conﬁguration of

{σi}i=1,··· ,N

, and

generating a local Monte Carlo Markov Chain (MCMC),

by sequentially proposing an update of one of the

σi

, and

accepting or rejecting it with a proper probability (e.g.

Metropolis-Hastings), until convergence [

]. However,

for large

, the convergence time of the MCMC can

grow exponentially in

, because of non-trivial long-

range correlations that make local decorrelation extremely

hard [2].

A solution to this problem consists in identifying the

proper set of correlated variables, and proposing global

updates of such variables together, in such a way to speed

up convergence [

]. However, this process is not universal,

because it relies on the proper identiﬁcation of system-

dependent correlations, which is not always possible. For

instance, in disordered systems such as spin glasses, the

nature of correlated domains is extremely elusive and

proper global moves are not easy to identify [

]. An-

other approach, which has been particularly successful

in atomistic models of glasses, consists in unconstraining

some degrees of freedom, evolve them and constrain them

back [

–

], but again it is model-speciﬁc. Alternative pro-

posals based on a renormalization group approach [

]

also rely on the identiﬁcation of system-dependent collec-

tive variables.

A recently developed line of research, see e.g. [

–

proposed to solve the problem in an elegant and universal

way, by machine learning proper MCMC moves. In a

∗

These authors contributed equally. Email: simone.ciarella@ens.fr,

jeanne.trinquier@ens.fr

nutshell, the idea is to learn an auxiliary probability

distribution

(

), which (i) can be sampled eﬃciently

(e.g. linearly in

) and (ii) provides a good approximation

of the target probability. Then, the hope is to use the

auxiliary distribution to propose smart MCMC moves.

Using this strategy with autoregressive architectures that

ensure eﬃcient sampling, some authors found convergence

speedup [

], but others found less promising

results [20].

In order to make these studies more systematic, and

really assess the performance of the method, it is impor-

tant to have good benchmarks, i.e. problems that are

guaranteed to be really hard to sample by local MCMC.

In the early 90s, the very same problem had to be faced

to assess the performance of local search algorithms that

looked for solution of optimization or satisﬁability prob-

lems [

]. In that case, the problem of generating good

benchmarks was solved by introducing an ensemble of

random instances of the problem under study [

–

]. It

was later shown, both numerically and analytically, that

these random optimization/satisﬁability problems require

a time scaling exponentially in

for proper sampling at

low enough temperatures in certain regions of parameter

space [

]. Hence, they provide very good benchmarks for

sampling algorithms. Yet, the recent attempts to apply

machine learning methods to speed-up sampling have not

considered these benchmarks.

In this paper, we consider a prototypical hard-to-sample

random problem, namely the coloring of random graphs,

and we show that all the proposed methods fail to solve

it. Our results conﬁrm that this class of problems are

a real challenge for sampling methods, even assisted by

smart machine-learned moves. The model investigated

in [

] possibly belong to this class. In addition, we

discuss some practical issues such as mode-collapse in

learning the auxiliary model, which happens when the

target probability distribution has multiple peaks and the

auxiliary model only learns one (or a subset) of them.

arXiv:2210.11145v3 [cond-mat.dis-nn] 10 Mar 2023

B. State of the art

Before proceeding, we provide a short review of the

papers that motivated our study. Because the ﬁeld is

evolving rapidly, this does not aim at being an exhaustive

review, and despite our best eﬀorts, it is possible that we

missed some relevant references.

Ref. [

] considered the general problem of whether a

target probability distribution

can be approximated

by a simpler one

, in particular by considering the

Kullback-Leibler (KL) divergence

DKL(Pt||Pa) = log Pt(σ)

Pa(σ)Pt

.(1)

If this quantity is proportional to

for

N→ ∞

, then

(

)

/Pa

(

)is typically exponential in

, and as a result

samples proposed from

are very unlikely to be accepted

. A small

DKL

(

Pt||Pa

)

(ideally vanishing for

N→ ∞

) seems therefore to be a necessary condition for

a good auxiliary probability, which provides a quantitative

measure of condition (ii) above. Ref. [

] suggested, by

using small disordered systems (

N∼20

), that there might

be a phase transition, for

N→ ∞

, separating a phase

where

DKL

(

Pt||Pa

)

vanishes identically and a phase

where it is positive.

Ref. [

] proposed, more speciﬁcally, to use autoregres-

sive models as tractable architectures for

. In these

architectures, Pais represented using Bayes’ rule,

Pa(σ) = P1

a(σ1)P2

a(σ2|σ1)· · · PN

a(σN|σN−1,· · · , σ1).

(2)

Each term

is then approximated by a neural network,

which takes as input

{σ1,· · · , σi−1}

and gives as output

, i.e. the probability of

σi

conditioned to the input.

Such a representation of

, also called Masked Autoen-

coder for Distribution Estimator (MADE) [

], allows for

very eﬃcient sampling, because one can ﬁrst sample

σ1

then

σ2

given

σ1

, and so on, in a time scaling as the sum

of the computational complexity of evaluating each of

the

, which is typically polynomial in

for reasonable

architectures. Hence, this scheme satisﬁes condition (i)

above. The simplest choice for such a neural network is

a linear layer followed by a softmax activation function.

Ref. [

] showed that using such an architecture, several

statistical models could be well approximated, and the

Boltzmann distribution of a Sherrington-Kirkpatrick (SK)

spin glass model (with

= 20) could be eﬃciently sam-

pled. Note that the model in Ref. [

] was trained by

avariational procedure, which minimizes

DKL

(

Pa||Pt

)

instead of

DKL

(

Pt||Pa

). This method is computationally

very eﬃcient as it only requires an average over

, which

can be sampled easily, instead of

, but it is prone to

mode-collapse (see Sec. II for details). Moreover, this

work was limited to quite small N.

Following up on Ref. [

], Ref. [

] considered as tar-

get probability the Boltzmann distribution of a two-

dimensional (2d) Edwards-Anderson (EA) spin glass

model at various temperatures

, and used a Neural Au-

toregressive Distribution Estimator (NADE) [

], which

is a variation of the MADE meant to reduce the number

of parameters. Furthermore, the model was trained using

a diﬀerent scheme from Ref. [

], called sequential temper-

ing, which tries to minimize

DKL

(

Pt||Pa

), thus preventing

mode-collapse. To this aim, at ﬁrst, a sample from

generated at high temperature, which is easy, and used

to learn

. Then, temperature is slightly reduced and

smart MCMC sampling is performed using the

learned

at the previous step, to generate a new sample from

which is then used in the next step. If

remains a good

approximation to

and MCMC sampling is eﬃcient, this

strategy ensures a correct minimization of

DKL

(

Pt||Pa

This was shown to be the case in Ref. [

], down to low

temperatures for a 2d EA model of up to

= 225 spins.

Ref. [

] introduced a diﬀerent scheme for learning

This adaptive scheme combines local MCMC moves with

smart

-assisted MCMC moves, together with an online

training of

. It was successfully tested using a diﬀerent

architecture for

(called normalizing ﬂows), on problems

with two stable states separated by a high free energy

barrier. Note that normalized ﬂows can be equivalently

interpreted as autoregressive models [

–

]. Ref. [

]

also proved the eﬀectiveness of smart assisted MCMC

moves in a 2d Ising model and an Ising-like frustrated

plaquette model.

Several other groups [

–

] investigated a problem re-

lated to sampling, namely that of simulated annealing [

]

for ﬁnding ground states of optimization problems. This

is an a priori slightly easier problem, because simulated

annealing does not need to equilibrate at all temperatures

to ﬁnd a solution [

]. In these works, simulated

annealing moves were once again assisted by machine

learning. Ref. [

] tested their procedure on the 2d EA

and SK models, and Ref. [

] considered a 2d, 3d, and

4d EA model. However, while ﬁnding the exact ground

state of the SK and EA (for

d≥

3) models is hard, in

practice for not too large random instances the problem

can be solved by a proper implementation of standard

simulated annealing [

], and the scaling of these methods

with system size remains poorly investigated. Ref. [

]

considered the graph coloring problem, which is the zero-

temperature version of the benchmark problem we pro-

pose to use in this work, and found that a Graph Neural

Network (GNN) can propose moves that allow one to

eﬃciently ﬁnd a proper coloring with comparable perfor-

mances to (but not outperforming) state-of-the-art local

search algorithm. Additionally, GNN have shown to be

successful at solving discrete combinatorial problems [

but they do not provide much advantage over classical

greedy algorithms, and sometimes they can even show

worse performance [

]. Finally, Ref. [

] showed that

the machine-learning-assisted simulated annealing scheme

does not work on a glassy problem with a rough energy

landscape.

These works provided a series of inspiring ideas to im-

prove sampling in disordered systems via machine learning

FIG. 1. Sketches of the autoregressive architectures used in this work.

smart MCMC moves. Yet, the question of whether ma-

chine learning can really speed up sampling in problems

that are exponentially hard to sample via local MCMC

remains open. This wide class of systems include many

problems of interest, such as optimization problems (e.g.

random SAT or random graph coloring) [

] and mean-

ﬁeld glass-forming materials [39–42].

C. Summary

In this work, we test machine-learning-assisted MCMC

in what is considered to be a prototypical hard-to-sample

model, namely the coloring of random graphs [

Before doing that, we also tested and reproduced previous

results in simpler cases.

The models we consider are:

(1)

The mean-ﬁeld ferromagnetic problem, usually

called the Curie-Weiss (CW) model, to gain some

analytical insight into the diﬀerent ways of training

the auxiliary model.

(2)

A two-dimensional Edwards-Anderson spin glass (2d

EA) model. We consider this as an ‘easy’ problem

(because, for instance, its ground state can be found

in polynomial time), and we use it to reproduce pre-

vious results, compare diﬀerent architectures, and

gain insight on the role of some hyperparameters.

(3)

The coloring (COL) of a random graph, which at

ﬁnite temperature becomes an antiferromagnetic

Potts model. In the proper range of parameters,

this problem is proven to be exponentially hard to

sample via local MCMC [2, 44], and we use it as a

benchmark to understand whether smart MCMC

can improve the sampling eﬃciency.

Any machine learning model that satisﬁes the autoregres-

sive property can be trained and used as an auxiliary

distribution to propose smart moves. However, on the

one hand, for complex problems, shallow or simple models

might not be expressive enough to accurately learn the

target distribution. On the other hand, if a problem can

be easily solved by a simpler model, there is no need to

employ complex deep architectures. In this paper we used

several standard architectures illustrated in Fig. 1 and

detailed in the SI:

•

The MADE [

], which is an autoregressive deep

neural network; when its depth is equal to zero, this

corresponds to a ‘shallow’ or single-layer autoregres-

sive model.

•

The NADE [

], which corresponds to a MADE

with additional constraints on the parameters, with

diﬀerent depths and number of hidden units. This

architecture has proven to be eﬀective in image de-

tection [

], ﬁltering [

] and quantum systems [

•

For the coloring, because neither the MADE nor the

NADE perform well, we also tested an autoregres-

sive GNN that we called Graph Autoregressive Dis-

tribution Estimator (GADE), and a non-symmetric

MADE (called ColoredMADE).

Finally, we tried several strategies to learn the auxiliary

model:

(I)

Maximum likelihood: we generate a sample from

the target distribution, and we use it to train the

auxiliary model by maximum likelihood. While

this is not a technique to generate samples from

because one needs the samples to begin with, it is

the best way to test if a given architecture for

expressive enough.

(II)

Variational: we minimize the KL divergence

DKL

(

Pa||Pt

), which also corresponds to the vari-

ational free energy of the auxiliary model when

considered as an approximation of the true one [

(III)

Sequential tempering: we train the auxiliary model

at higher temperature

, then use it to generate

samples at lower

, and use the new samples to

re-train the auxiliary model, and so on [14, 15].

The core of our work is the application of methods (II)

and (III) to attempt a sampling at low temperatures, for

which local MCMC does not decorrelate fast enough. For

the 2d EA model, we ﬁnd that basically all the techniques

and architectures perform well down to very low temper-

atures, although (II) is more prone to mode-collapse. We

conﬁrm that machine learning MCMC moves can provide

a speedup in this case [

]. For the COL problem, we ﬁnd

that none of these methods work well, even at moderately

high temperatures located within the paramagnetic phase

of the model.

II. METHODS

We consider a speciﬁc application of the general scheme

discussed in Sec. I, in which we want to approximate the

Boltzmann distribution associated to the ‘true’ Hamilto-

nian

(

)at a ﬁxed inverse temperature

= 1

(the

target),

Pt(σ) = PB(σ) = e−βH(σ)

Z,(3)

with an autoregressive (AR) network, i.e.

Pa(σ) = PAR(σ)

. In most cases, sampling is easy

at small

and becomes harder and harder as

increased. We now discuss diﬀerent possible strategies to

learn a proper PAR(σ).

A. Maximum likelihood

If a suﬃciently large sample of conﬁgurations

{σm}m=1,··· ,M

is available, independently and identically

sampled from

(

), it is possible to train the model by

maximizing the probability of the sample according to

the AR model itself, i.e. to use the maximum likelihood

method.

Assuming that the AR model is speciﬁed by a set of

parameters

, we maximize the likelihood of the observed

data, deﬁned as

L(θ) =

m=1

PAR(σm|θ).(4)

Equivalently, if

Pemp

(

) =

MPM

m=1 δσ,σm

is the empiri-

cal distribution of the sample, we minimize

D(θ) = DKL(Pemp||PAR)

=−log M−1

m=1

log PAR(σm|θ).(5)

The optimal parameters ˆ

θare given by

θ= argmaxθL(θ) = argminθD(θ).(6)

The gradient of

(

)can be computed analytically in

terms of the

σm

, because

log PAR

(

σ|θ

)is given explicitly

(or by back-propagation) as a function of

at ﬁxed

and the training can thus be performed eﬃciently.

Note that if

(

)has multiple peaks, for suﬃciently

large

all these peaks are represented in the empirical

distribution with the correct weights. Hence, when per-

forming maximum likelihood to learn

PAR

(

), the learned

model should be able to represent all the peaks of

(

provided the AR model has enough free parameters, i.e.

is expressive enough. Then, mode-collapse will not occur.

Obviously, the maximum-likelihood approach relies on

the quality of the initial sample, which has to be repre-

sentative of the true distribution. Such a sample is by

deﬁnition diﬃcult to obtain for the really hard sampling

problems that we want to solve. Moreover, if we were

able to obtain such a sample by conventional means, there

would be no need for any smart MCMC scheme. Never-

theless, the maximum-likelihood approach constitutes a

reliable and eﬀective way to test if a speciﬁc AR archi-

tecture is capable of learning the complexity of a speciﬁc

problem. We will thus use this scheme, in cases where

standard sampling from

(

)is possible, to test the

expressive quality of our AR architectures.

B. Variational approach

Ref. [

] proposes to bypass the need for sampling

from

(

)by using a variational approach. In-

stead of minimizing

DKL

(

PB||PAR

)or its approximation

DKL

(

Pemp||PAR

), as in Sec. II A, we want here to mini-

mize

DKL

(

PAR||PB

) =

PσPAR

(

)

log PAR(σ)

PB(σ)

, or equiva-

lently [13] the variational free energy:

βF [PAR] = X

PAR(σ)(βH(σ) + log PAR(σ))

=βF [PB] + DKL(PAR||PB).

(7)

As it is well known in statistical mechanics, because the

KL divergence is positive,

[

PAR

]is minimized when

PAR

and

DKL

(

PAR||PB

) = 0, and otherwise it

provides an upper bound to the true free energy.

The gradient with respect to the parameters

that

deﬁne the AR model can be written as an expectation

value over the AR model itself [13],

β∇θF[PAR] = hQ(σ)∇θlog PAR(σ)iPAR ,

Q(σ) = βH(σ) + log PAR(σ).(8)

The learning can then be done by gradient descent on

[

PAR

], sampling from the AR model to estimate the

gradient via Eq.

(8)

. We used, as a condition to stop the

gradient descent, that the variance of

(

)over batches

of generated data is smaller than a given threshold. In-

deed, if the AR distribution is exactly the Boltzmann one,

then

(

)is a constant. Reciprocally, if the variance of

(

)is zero, then the AR distribution is proportional

to the Boltzmann distribution whenever

PAR

(

)

but not necessarily over all possible

, due to mode-

collapse. More speciﬁcally, if we have mode-collapse, then

PAR

(

) =

(

)

in some regions of the space of

(typ-

ically around some of the peaks of

) and

PAR

(

)=0

elsewhere, where the proportionality constant

represents the total probability covered by

PAR

(

)in

(

). We then obtain

DKL

(

PAR||PB

) =

−log Z>

because the regions where

PAR

(

) = 0 do not contribute

to the sum. While this solution has larger KL diver-

gence with respect to the optimal one (

PAR

DKL

(

PAR||PB

)=0), it can be a local minimum of

DKL

(

PAR||PB

)in which the variationally-trained AR

model can be trapped, thus learning only a part of the

landscape.

C. Local versus global MCMC

Standard local MCMC usually consists in selecting a

variable at random and then proposing a random (or

semi-random) change. If the MC moves respect micro-

scopic detailed balance, the MCMC is guaranteed to

converge to the correct Boltzmann distribution. This can

be achieved, e.g., by accepting/rejecting the proposed MC

moves following the Metropolis rule, where the acceptance

probability of a move from conﬁguration

σold

σnew

deﬁned as:

Acc [σold −→ σnew] = min 1,PB(σnew)

PB(σold).(9)

We call this scheme local MCMC because each move

consists in a change of a single degree of freedom.

In contrast, we can sample from our autoregressive

model

PAR

(

)to generate a new proposed conﬁguration

σnew

. It is still useful to respect microscopic detailed

balance in order to ensure convergence to equilibrium, and

for this reason the replacement

σold →σnew

is accepted

with probability

Acc [σold −→ σnew] = min 1,PB(σnew)×PAR(σold)

PB(σold)×PAR(σnew).

(10)

Note that, because

σnew

is generated from scratch by the

AR model, it is in most cases completely diﬀerent from

σold

, hence the resulting move is non-local and this is why

this scheme is called global MCMC.

We also note that this global MCMC scheme is very

similar to importance sampling, in which

i.i.d. samples

σm

are generated from

PAR

(

), and then reweighted by

(

σm

) =

(

σm

)

/PAR

(

σm

)to compute averages. How-

ever, the formulation in terms of a MCMC is convenient

to compare with local MCMC, to monitor eﬃciency via

the acceptance rate (which is morally equivalent to a par-

ticipation ratio in importance sampling), and to perform

smart protocols (e.g. sequential tempering) during the

MCMC dynamics [

]. This is why we stick to this

formulation in this paper.

The reweighting factor

(

) =

(

)

/PAR

(

)that

appears both in importance sampling and in Eq.

(10)

the crucial quantity for the eﬃciency of the global MCMC

scheme. If

(

)is typically exponential in

, then its

ﬂuctuations are too wild and moves are almost never ac-

cepted. The KL divergence is precisely the average of

log W

(

), either on the Boltzmann or on the AR distribu-

tion, and if it is too large (in particular, growing linearly

in N) the method is doomed to failure.

D. Sequential tempering

Sequential tempering is a technique used in Ref. [

]

to learn the AR probability at a larger

using data

from lower

, which can be convenient because collecting

data becomes harder upon increasing

. The ﬁrst step

consists in collecting a sample via local MCMC at low

where sampling is easy, and then training an AR model to

reproduce this sample by maximum likelihood (Sec. II A).

Next, in order to create a new sample at

δβ

, we use

global MCMC by proposing moves with the previous AR

model, at the new temperature. The acceptance rule then

becomes:

Acc [σold −→ σnew] = min 1,e−(β+δβ)H(σnew )PAR(σold)

e−(β+δβ)H(σold )PAR(σnew).

(11)

We then learn a new AR model from the new sample by

maximum likelihood, and iterate until either we reach the

temperature of interest, or the convergence time of the

global MCMC exceeds some ﬁxed threshold, indicating

a failure of the training procedure. A related adaptive

global MCMC scheme has been introduced in Ref. [

]

and is detailed in the SI.

E. Evaluation of the AR model

Once the learning is completed, one can use several

observables to evaluate the quality of the AR model.

By completion of the learning we mean convergence of

the gradient ascent in maximum likelihood (Sec. II A),

convergence of the gradient descent in the variational free

energy (Sec. II B), or reaching the target temperature with

high enough acceptance rate in the sequential tempering

(Sec. II D).

As a ﬁrst check, we can use the AR model to estimate

thermodynamic observables (energy, entropy, correlations)

of the true Hamiltonian. If the AR model has a lower en-

tropy than the true one, the AR model is probably suﬀer-

ing mode-collapse. Another interesting observable is the

KL divergence, either

DKL

(

PAR||PB

)or

DKL

(

PB||PAR

which measures how well the AR model approximates the

target one, and more quantitatively provides the average

of the reweighting factor, as discussed in Sec. II C. A more

easily accessible quality measure consists in generating

samples with the AR model, then evolving them with

local MCMC and checking if the energy remains constant

and the correlation functions remain time-translationally

invariant [

], as expected if the initial conﬁguration is

a good equilibrium one.

A very important measure of the quality of the AR

model is the acceptance rate of the global MCMC, as

a high acceptance rate indicates that the AR model de-

scribes well the true model, at least in the region explored

by the global MCMC. Yet, the AR model could be mode-

collapsed and still keep a high acceptance rate, because

the global MCMC would then only explore the region on

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Machine-learning-assistedMonteCarlofailsatsamplingcomputationallyhardproblemsSimoneCiarella,1,JeanneTrinquier,2,1,MartinWeigt,2andFrancescoZamponi11LaboratoiredePhysiquedel'EcoleNormaleSupérieure,ENS,UniversitéPSL,CNRS,SorbonneUniversité,UniversitédeParis,F-75005Paris,France2SorbonneUniversité,CNR...

展开>> 收起<<

Machine-learning-assisted Monte Carlo fails at sampling computationally hard problems Simone Ciarella1Jeanne Trinquier2 1Martin Weigt2and Francesco Zamponi1.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Machine-learning-assisted Monte Carlo fails at sampling computationally hard problems Simone Ciarella1Jeanne Trinquier2 1Martin Weigt2and Francesco Zamponi1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: