Stochastic noise can be helpful for variational quantum algorithms Junyu Liu1 2 3 4Frederik Wilde5Antonio Anna Mele5Liang Jiang1 2and Jens Eisert5 6 7 1Pritzker School of Molecular Engineering The University of Chicago Chicago IL 60637 USA

2025-04-24 0 0 4.44MB 14 页 10玖币
侵权投诉
Stochastic noise can be helpful for variational quantum algorithms
Junyu Liu,
1, 2, 3, 4
Frederik Wilde,
5
Antonio Anna Mele,
5
Liang Jiang,
1, 2
and Jens Eisert
5, 6, 7
1
Pritzker School of Molecular Engineering, The University of Chicago, Chicago, IL 60637, USA
2
Chicago Quantum Exchange, Chicago, IL 60637, USA
3
Kadanoff Center for Theoretical Physics, The University of Chicago, Chicago, IL 60637, USA
4
qBraid Co., Chicago, IL 60615, USA
5
Dahlem Center for Complex Quantum Systems, Freie Universität Berlin, Berlin, 14195, Germany
6
Helmholtz-Zentrum Berlin für Materialien und Energie, 14109 Berlin, Germany
7
Fraunhofer Heinrich Hertz Institute, 10587 Berlin, Germany
Saddle points constitute a crucial challenge for first-
order gradient descent algorithms. In notions of classical
machine learning, they are avoided for example by means
of stochastic gradient descent methods. In this work, we
provide evidence that the saddle points problem can be
naturally avoided in variational quantum algorithms by ex-
ploiting the presence of stochasticity. We prove convergence
guarantees and present practical examples in numerical
simulations and on quantum hardware. We argue that
the natural stochasticity of variational algorithms can be
beneficial for avoiding strict saddle points, i.e., those saddle
points with at least one negative Hessian eigenvalue. This
insight that some levels of shot noise could help is expected
to add a new perspective to notions of near-term variational
quantum algorithms.
Quantum computing has for many years been a hugely inspir-
ing theoretical idea. Already in the 1980ies it was suggested
that quantum devices could possibly have superior computa-
tional capabilities over computers operating based on classical
laws [
1
,
2
]. It is a relatively recent development that devices
have been devised that may indeed have computational capa-
bilities beyond classical means [
3
6
]. These devices are going
substantially beyond what was possible not long ago. And still,
they are unavoidably noisy and imperfect for many years to
come. The quantum devices that are available today and pre-
sumably will be in the near future are often conceived as hybrid
quantum devices running variational quantum algorithms [
7
],
where a quantum circuit is addressed by a substantially larger
surrounding classical circuit. This classical circuit takes mea-
surements from the quantum device and appropriately varies
variational parameters of the quantum device in an update.
Large classes of variational quantum eigensolvers (VQE), the
quantum approximate optimization algorithm (QAOA) and
models for quantum-assisted machine learning are thought to
operate along those lines, based on suitable loss functions to
be minimized [
8
14
]. In fact, many near-term quantum algo-
rithms in the era of noisy intermediate-scale quantum (NISQ)
computing [
15
] belong to the class of variational quantum al-
gorithms. While this is an exciting development, it puts a lot of
burden to understanding how reasonable and practical classical
control can be conceived.
Generally, when the optimization space is high-dimensional
updates of the variational parameters are done via gradient eval-
uations [
16
19
], while zeroth-order and second-order methods
are, in principle, also applicable, but typically only up to a
limited number of parameters. This makes a lot of sense, as
one may think that going downhill in a variational quantum
algorithms is a good idea. That said, the concomitant classical
optimization problems are generally not convex optimization
problems and the variational landscapes are marred by glob-
ally suboptimal local optima and saddle points. In fact, it is
known that the problems of optimizing variational parameters
of quantum circuits are computationally hard in worst case
complexity [
20
]. While this is not of too much concern in
practical considerations (since it is often sufficient to find a
“good” local minimum instead of the global minimum) and
resembles an analogous situation in classical machine learning,
it does point to the fact that one should expect a rugged opti-
mization landscape, featuring different local minima as well
as saddle points. Such saddle points can indeed be a burden
to feasible and practical classical optimization of variational
quantum algorithms.
Figure 1. Stochasticity in variational quantum algorithms can help in
avoiding (strict) saddle points.
In this work, we establish the notion that in such situations,
small noise levels can actually be of substantial help. More
precisely, we show that some levels of statistical noise (specifi-
cally the kind of noise that naturally arises from a finite number
of measurements to estimate quantum expectation values) can
even be beneficial. We get inspiration from and build on a
powerful mathematical theory in classical machine learning:
there, theorems have been established that say that “noise” can
help gradient-descent optimization not to get stuck at saddle
points [
21
,
22
]. Building on such ideas we show that they can
be adapted and developed to be applicable to the variational
arXiv:2210.06723v2 [quant-ph] 8 Jun 2023
2
quantum algorithms setting. Then we argue that the “natural”
statistical noise of a quantum experiment can play the role of
the artificial noise that is inserted by hand in classical machine
learning algorithms to avoid saddle points. We maintain the
precise and rigorous mindset of Ref. [
21
], but show that the
findings have practical importance and can be made concrete
use of when running variational quantum algorithms on near-
term quantum devices. In previous studies it has been observed
anecdotally that small levels of noise can indeed be helpful
for improving the optimization procedure [
19
,
23
28
]. What
is more, variational algorithms have been seen as being noise-
robust in a sense [
29
]. That said, while in the past rigorous
convergence guarantees have been formulated for convex loss
functions of VQAs [
19
,
28
], in this work we focus on the non-
convex scenario, where saddle points and local minima are
present. Such a systematic and rigorous analysis of the type
we have conducted that explains the origin of the phenomenon
of noise facilitating optimization has been lacking.
It is important to stress that for noise we refer to in our
theorems is the type of noise that adds stochasticity to the
gradient estimations, such as the use of a finite number of
measurements or the zero-average fluctuations that are involved
in real experiments. Also, instances of global depolarizing
noise are covered as discussed in Appendix 2. Thus, in this
case, noise does not mean the generic quantum noise that
results from the interaction with the environment characterized
by completely positive and trace-preserving (CPTP) maps,
which can be substantially detrimental to the performance of
the algorithm [
30
,
31
]. In addition, it has been shown that
noisy CPTP maps in the circuit may significantly worsen the
problem of barren plateaus [
32
,
33
] which is one of the main
obstacles to the scalability of variational quantum algorithms
(VQAs).
We perform numerical experiments, and we show examples
where optimizations with gradient descent without noise get
stuck at saddle points, whereas if we add some noise, we can
escape this problem and get to the minimum—convincingly
demonstrating the functioning of the approach. We verify the
latter not only in a numerical simulation, but also making use
of data of a real IBM quantum machine.
Preliminaries
In our work, we will show how a class of saddle points, the
so-called strict saddle points, can be avoided in noisy gradient
descent. In developing our machinery, we build strongly on
the rigorous results laid out in Ref. [
21
] and uplift them to the
quantum setting at hand. For this, we do method development
in its own right. First, we introduce some useful definitions
and theorems (see Ref. [21] for a more in-depth discussion).
Throughout this work, we consider the problem of mini-
mizing a function
L:RpR
. We indicate its gradient at
θ
as
L(θ)
and its Hessian matrix at point
θ
as
2L(θ)
. We
denote as
∥·∥2
the
l2
norm of a vector.
∥·∥HS
and
∥·∥
denote
respectively the Hilbert-Schmidt norm and the largest eigen-
value norm of a matrix. We denote as
λmin(·)
the minimum
eigenvalue of a matrix.
Definition 1 (
L
-Lipschitz function).A function
g:RpRd
is L-Lipschitz if and only if
g(θ)g(ϕ)2Lθϕ2(1)
for every θand ϕ.
Definition 2 (
β
-strong smoothness).A differentiable function
L:RpR
is called
β
-strongly smooth if and only if its
gradient is a β-Lipschitz function, i.e.,
L(θ)L(ϕ)2βθϕ2,(2)
for every θand ϕ.
Definition 3 (Stationary point).If
L
is differentiable,
θ
is
defined a stationary point if
L(θ)2= 0.(3)
Definition 4 (
ϵ
-approximate stationary point).If
L
is differen-
tiable, θis defined an ϵ-approximate stationary point if
L(θ)2ϵ. (4)
Definition 5 (Local minimum, local maximum, and saddle
point).If Lis differentiable, a stationary point θis a
local minimum, if there exists
δ > 0
such that
L(θ)
L(θ)for any θwith θθ2δ.
Local maximum, if there exists
δ > 0
such that
L(θ)
L(θ)for any θwith θθ2δ.
Saddle point, otherwise.
Definition 6 (
ρ
-Lipschitz Hessian).A twice differentiable func-
tion Lhas ρ-Lipschitz Hessian matrix 2Lif and only if
2L(θ)2L(ϕ)
HS ρθϕ2(5)
for every θand ϕ(where ∥·∥HS is the Hilbert-Schmidt norm).
Definition 7 (Gradient descent).Given a differentiable func-
tion
L
, the gradient descent algorithm is defined by the update
rule
θt+1
i=θt
iηiL(θt),(6)
where η > 0is called learning rate.
The convergence time of the gradient descent algorithm is
given by the following theorem [21].
Theorem 8 (Gradient descent complexity).Given a
β
-strongly
smooth function
L(·)
, for any
ϵ > 0
, if we set the learning
rate as
η= 1
, then the number of iterations required by
the gradient descent algorithm such that it will visit an
ϵ
-
approximate stationary point is
Oβ(L(θ0)− L)
ϵ2,
where
θ0
is the initial point and
L
is the value of
L
computed
in the global minimum.
3
It is important to note that this result does not depend on the
number of free parameters. Also, the stationary point at which
the algorithm will converge is not necessarily a local minimum,
but can also be a saddle point. Note that a generic saddle point
satisfies
λmin(2L(θs)) 0
where
λmin(·)
is the minimum
eigenvalue. Now we define a subclass of saddle points.
Definition 9 (Strict saddle point).
θs
is a strict saddle point
for a twice differentiable function
L
if and only if
θs
is a
stationary point and if the minimum eigenvalue of the Hessian
is λmin(2L(θs)) <0.
Adding the strict condition, we remove the case in which
a saddle point satisfies
λmin(2L(θs)) = 0
. Moreover, note
that a local maximum respects our definition of strict saddle
point. Analogously to Ref. [
21
], in this work, we focus on
avoiding strict saddle points. Hence, it is useful to introduce
the following definition.
Definition 10 (Second-order stationary point).Given a twice
differentiable function
L(·), θ
is a second-order stationary
point if and only if
L(θ) = 0,and λmin(2L(θ)) 0.(7)
Definition 11 (
ϵ
-second-order stationary point).For a
ρ
-
Hessian Lipschitz function
L(·), θ
is an
ϵ
-second-order sta-
tionary point if
|L(θ)|2ϵand λmin(2L(θ)) ≥ −ρϵ. (8)
Gradient descent makes a non-zero step only when the gra-
dient is non-zero, and thus in the non-convex setting it will be
stuck at saddle points. A simple variant of GD is the perturbed
gradient descent (PGD) method [
21
] which adds randomness
to the iterates at each step.
Definition 12 (Perturbed gradient descent).Given a differen-
tiable function
L:RpR
, the perturbed gradient descent
algorithm is defined by the update rule
θt+1
i=θt
iηiLθt+ζt,(9)
where
η > 0
is the learning rate and
ζt
is a normally dis-
tributed random variable with mean
µ= 0
and variance
σ2=r2/p with rR.
In Ref. [
21
], the authors show that if we pick
r=˜
Θ(ϵ)
,
PGD will find an
ϵ
-second-order stationary point in a number
of iterations that has only a poly-logarithmic dependence on the
number of free parameters, i.e., it has the same complexity of
(standard) gradient descent up to poly-logarithmic dependence.
Theorem 13 ([
21
]).Let the function
L:RdR
be
β
strongly
smooth and such that it has a
ρ
Lipschitz-Hessian. Then, for
any
ϵ, δ > 0
, the
P GD
algorithm starting at the point
θ0
, with
parameters
η=˜
Θ(1)
and
r=˜
Θ(ϵ)
, will visit an
ϵ
second-
order stationary point at least once in the following number of
iterations, with probability at least 1δ
˜
Oβ(L(θ0)− L)
ϵ2
where
˜
O
and
˜
Θ
hide poly-logarithmic factors in
p, β, ρ, 1/ϵ, 1
and
L:= L(θ0)− L
.
θ0
is the ini-
tial point and
L
is the value of
L
computed in the global
minimum.
This theorem has been proven in Ref. [
21
] for Gaussian
distributions, but the authors have pointed out that this is not
strictly necessary and that it can be generalized to other types
of probability distributions in which appropriate concentration
inequalities can be applied (for a more in-depth discussion, see
Ref. [21]).
In Ref. [
34
], it has been shown that although the standard
GD (without perturbations) almost always escapes the saddle
points asymptotically [
35
], there are (non-pathological) cases
in which the optimization requires exponential time to escape.
This highlights the importance of using gradient descent with
perturbations.
Figure 2. Comparison of the loss evolution with or without noise. The
noise levels are manually-added Gaussian distributions, and we keep
the same initial conditions. (a) Four different values of the standard
deviation
r
. (b) Noiseless case and the noisy case with the standard
deviation of the noise r= 0.1.
摘要:

StochasticnoisecanbehelpfulforvariationalquantumalgorithmsJunyuLiu,1,2,3,4FrederikWilde,5AntonioAnnaMele,5LiangJiang,1,2andJensEisert5,6,71PritzkerSchoolofMolecularEngineering,TheUniversityofChicago,Chicago,IL60637,USA2ChicagoQuantumExchange,Chicago,IL60637,USA3KadanoffCenterforTheoreticalPhysics,Th...

展开>> 收起<<
Stochastic noise can be helpful for variational quantum algorithms Junyu Liu1 2 3 4Frederik Wilde5Antonio Anna Mele5Liang Jiang1 2and Jens Eisert5 6 7 1Pritzker School of Molecular Engineering The University of Chicago Chicago IL 60637 USA.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:4.44MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注