
2
B. State of the art
Before proceeding, we provide a short review of the
papers that motivated our study. Because the field is
evolving rapidly, this does not aim at being an exhaustive
review, and despite our best efforts, it is possible that we
missed some relevant references.
Ref. [
25
] considered the general problem of whether a
target probability distribution
Pt
can be approximated
by a simpler one
Pa
, in particular by considering the
Kullback-Leibler (KL) divergence
DKL(Pt||Pa) = log Pt(σ)
Pa(σ)Pt
.(1)
If this quantity is proportional to
N
for
N→ ∞
, then
Pt
(
σ
)
/Pa
(
σ
)is typically exponential in
N
, and as a result
samples proposed from
Pa
are very unlikely to be accepted
in
Pt
. A small
DKL
(
Pt||Pa
)
/N
(ideally vanishing for
N→ ∞
) seems therefore to be a necessary condition for
a good auxiliary probability, which provides a quantitative
measure of condition (ii) above. Ref. [
25
] suggested, by
using small disordered systems (
N∼20
), that there might
be a phase transition, for
N→ ∞
, separating a phase
where
DKL
(
Pt||Pa
)
/N
vanishes identically and a phase
where it is positive.
Ref. [
13
] proposed, more specifically, to use autoregres-
sive models as tractable architectures for
Pa
. In these
architectures, Pais represented using Bayes’ rule,
Pa(σ) = P1
a(σ1)P2
a(σ2|σ1)· · · PN
a(σN|σN−1,· · · , σ1).
(2)
Each term
Pi
a
is then approximated by a neural network,
which takes as input
{σ1,· · · , σi−1}
and gives as output
Pi
a
, i.e. the probability of
σi
conditioned to the input.
Such a representation of
Pa
, also called Masked Autoen-
coder for Distribution Estimator (MADE) [
26
], allows for
very efficient sampling, because one can first sample
σ1
,
then
σ2
given
σ1
, and so on, in a time scaling as the sum
of the computational complexity of evaluating each of
the
Pi
a
, which is typically polynomial in
N
for reasonable
architectures. Hence, this scheme satisfies condition (i)
above. The simplest choice for such a neural network is
a linear layer followed by a softmax activation function.
Ref. [
13
] showed that using such an architecture, several
statistical models could be well approximated, and the
Boltzmann distribution of a Sherrington-Kirkpatrick (SK)
spin glass model (with
N
= 20) could be efficiently sam-
pled. Note that the model in Ref. [
13
] was trained by
avariational procedure, which minimizes
DKL
(
Pa||Pt
)
instead of
DKL
(
Pt||Pa
). This method is computationally
very efficient as it only requires an average over
Pa
, which
can be sampled easily, instead of
Pt
, but it is prone to
mode-collapse (see Sec. II for details). Moreover, this
work was limited to quite small N.
Following up on Ref. [
13
], Ref. [
14
] considered as tar-
get probability the Boltzmann distribution of a two-
dimensional (2d) Edwards-Anderson (EA) spin glass
model at various temperatures
T
, and used a Neural Au-
toregressive Distribution Estimator (NADE) [
27
], which
is a variation of the MADE meant to reduce the number
of parameters. Furthermore, the model was trained using
a different scheme from Ref. [
13
], called sequential temper-
ing, which tries to minimize
DKL
(
Pt||Pa
), thus preventing
mode-collapse. To this aim, at first, a sample from
Pt
is
generated at high temperature, which is easy, and used
to learn
Pa
. Then, temperature is slightly reduced and
smart MCMC sampling is performed using the
Pa
learned
at the previous step, to generate a new sample from
Pt
,
which is then used in the next step. If
Pa
remains a good
approximation to
Pt
and MCMC sampling is efficient, this
strategy ensures a correct minimization of
DKL
(
Pt||Pa
).
This was shown to be the case in Ref. [
14
], down to low
temperatures for a 2d EA model of up to
N
= 225 spins.
Ref. [
15
] introduced a different scheme for learning
Pa
.
This adaptive scheme combines local MCMC moves with
smart
Pa
-assisted MCMC moves, together with an online
training of
Pa
. It was successfully tested using a different
architecture for
Pa
(called normalizing flows), on problems
with two stable states separated by a high free energy
barrier. Note that normalized flows can be equivalently
interpreted as autoregressive models [
28
–
30
]. Ref. [
16
]
also proved the effectiveness of smart assisted MCMC
moves in a 2d Ising model and an Ising-like frustrated
plaquette model.
Several other groups [
17
–
20
] investigated a problem re-
lated to sampling, namely that of simulated annealing [
31
]
for finding ground states of optimization problems. This
is an a priori slightly easier problem, because simulated
annealing does not need to equilibrate at all temperatures
to find a solution [
32
,
33
]. In these works, simulated
annealing moves were once again assisted by machine
learning. Ref. [
17
] tested their procedure on the 2d EA
and SK models, and Ref. [
18
] considered a 2d, 3d, and
4d EA model. However, while finding the exact ground
state of the SK and EA (for
d≥
3) models is hard, in
practice for not too large random instances the problem
can be solved by a proper implementation of standard
simulated annealing [
34
], and the scaling of these methods
with system size remains poorly investigated. Ref. [
19
]
considered the graph coloring problem, which is the zero-
temperature version of the benchmark problem we pro-
pose to use in this work, and found that a Graph Neural
Network (GNN) can propose moves that allow one to
efficiently find a proper coloring with comparable perfor-
mances to (but not outperforming) state-of-the-art local
search algorithm. Additionally, GNN have shown to be
successful at solving discrete combinatorial problems [
35
],
but they do not provide much advantage over classical
greedy algorithms, and sometimes they can even show
worse performance [
36
,
37
]. Finally, Ref. [
20
] showed that
the machine-learning-assisted simulated annealing scheme
does not work on a glassy problem with a rough energy
landscape.
These works provided a series of inspiring ideas to im-
prove sampling in disordered systems via machine learning