Stochastic noise can be helpful for variational quantum algorithms Junyu Liu1 2 3 4Frederik Wilde5Antonio Anna Mele5Liang Jiang1 2and Jens Eisert5 6 7 1Pritzker School of Molecular Engineering The University of Chicago Chicago IL 60637 USA

2025-04-24 0 0 4.44MB 14 页 10玖币

侵权投诉

Stochastic noise can be helpful for variational quantum algorithms

Junyu Liu,

1, 2, 3, 4

Frederik Wilde,

Antonio Anna Mele,

Liang Jiang,

1, 2

and Jens Eisert

5, 6, 7

Pritzker School of Molecular Engineering, The University of Chicago, Chicago, IL 60637, USA

Chicago Quantum Exchange, Chicago, IL 60637, USA

Kadanoff Center for Theoretical Physics, The University of Chicago, Chicago, IL 60637, USA

qBraid Co., Chicago, IL 60615, USA

Dahlem Center for Complex Quantum Systems, Freie Universität Berlin, Berlin, 14195, Germany

Helmholtz-Zentrum Berlin für Materialien und Energie, 14109 Berlin, Germany

Fraunhofer Heinrich Hertz Institute, 10587 Berlin, Germany

Saddle points constitute a crucial challenge for ﬁrst-

order gradient descent algorithms. In notions of classical

machine learning, they are avoided for example by means

of stochastic gradient descent methods. In this work, we

provide evidence that the saddle points problem can be

naturally avoided in variational quantum algorithms by ex-

ploiting the presence of stochasticity. We prove convergence

guarantees and present practical examples in numerical

simulations and on quantum hardware. We argue that

the natural stochasticity of variational algorithms can be

beneﬁcial for avoiding strict saddle points, i.e., those saddle

points with at least one negative Hessian eigenvalue. This

insight that some levels of shot noise could help is expected

to add a new perspective to notions of near-term variational

quantum algorithms.

Quantum computing has for many years been a hugely inspir-

ing theoretical idea. Already in the 1980ies it was suggested

that quantum devices could possibly have superior computa-

tional capabilities over computers operating based on classical

laws [

]. It is a relatively recent development that devices

have been devised that may indeed have computational capa-

bilities beyond classical means [

–

]. These devices are going

substantially beyond what was possible not long ago. And still,

they are unavoidably noisy and imperfect for many years to

come. The quantum devices that are available today and pre-

sumably will be in the near future are often conceived as hybrid

quantum devices running variational quantum algorithms [

where a quantum circuit is addressed by a substantially larger

surrounding classical circuit. This classical circuit takes mea-

surements from the quantum device and appropriately varies

variational parameters of the quantum device in an update.

Large classes of variational quantum eigensolvers (VQE), the

quantum approximate optimization algorithm (QAOA) and

models for quantum-assisted machine learning are thought to

operate along those lines, based on suitable loss functions to

be minimized [

–

]. In fact, many near-term quantum algo-

rithms in the era of noisy intermediate-scale quantum (NISQ)

computing [

] belong to the class of variational quantum al-

gorithms. While this is an exciting development, it puts a lot of

burden to understanding how reasonable and practical classical

control can be conceived.

Generally, when the optimization space is high-dimensional

updates of the variational parameters are done via gradient eval-

uations [

–

], while zeroth-order and second-order methods

are, in principle, also applicable, but typically only up to a

limited number of parameters. This makes a lot of sense, as

one may think that going downhill in a variational quantum

algorithms is a good idea. That said, the concomitant classical

optimization problems are generally not convex optimization

problems and the variational landscapes are marred by glob-

ally suboptimal local optima and saddle points. In fact, it is

known that the problems of optimizing variational parameters

of quantum circuits are computationally hard in worst case

complexity [

]. While this is not of too much concern in

practical considerations (since it is often sufﬁcient to ﬁnd a

“good” local minimum instead of the global minimum) and

resembles an analogous situation in classical machine learning,

it does point to the fact that one should expect a rugged opti-

mization landscape, featuring different local minima as well

as saddle points. Such saddle points can indeed be a burden

to feasible and practical classical optimization of variational

quantum algorithms.

Figure 1. Stochasticity in variational quantum algorithms can help in

avoiding (strict) saddle points.

In this work, we establish the notion that in such situations,

small noise levels can actually be of substantial help. More

precisely, we show that some levels of statistical noise (speciﬁ-

cally the kind of noise that naturally arises from a ﬁnite number

of measurements to estimate quantum expectation values) can

even be beneﬁcial. We get inspiration from and build on a

powerful mathematical theory in classical machine learning:

there, theorems have been established that say that “noise” can

help gradient-descent optimization not to get stuck at saddle

points [

]. Building on such ideas we show that they can

be adapted and developed to be applicable to the variational

arXiv:2210.06723v2 [quant-ph] 8 Jun 2023

quantum algorithms setting. Then we argue that the “natural”

statistical noise of a quantum experiment can play the role of

the artiﬁcial noise that is inserted by hand in classical machine

learning algorithms to avoid saddle points. We maintain the

precise and rigorous mindset of Ref. [

], but show that the

ﬁndings have practical importance and can be made concrete

use of when running variational quantum algorithms on near-

term quantum devices. In previous studies it has been observed

anecdotally that small levels of noise can indeed be helpful

for improving the optimization procedure [

–

]. What

is more, variational algorithms have been seen as being noise-

robust in a sense [

]. That said, while in the past rigorous

convergence guarantees have been formulated for convex loss

functions of VQAs [

], in this work we focus on the non-

convex scenario, where saddle points and local minima are

present. Such a systematic and rigorous analysis of the type

we have conducted that explains the origin of the phenomenon

of noise facilitating optimization has been lacking.

It is important to stress that for noise we refer to in our

theorems is the type of noise that adds stochasticity to the

gradient estimations, such as the use of a ﬁnite number of

measurements or the zero-average ﬂuctuations that are involved

in real experiments. Also, instances of global depolarizing

noise are covered as discussed in Appendix 2. Thus, in this

case, noise does not mean the generic quantum noise that

results from the interaction with the environment characterized

by completely positive and trace-preserving (CPTP) maps,

which can be substantially detrimental to the performance of

the algorithm [

]. In addition, it has been shown that

noisy CPTP maps in the circuit may signiﬁcantly worsen the

problem of barren plateaus [

] which is one of the main

obstacles to the scalability of variational quantum algorithms

(VQAs).

We perform numerical experiments, and we show examples

where optimizations with gradient descent without noise get

stuck at saddle points, whereas if we add some noise, we can

escape this problem and get to the minimum—convincingly

demonstrating the functioning of the approach. We verify the

latter not only in a numerical simulation, but also making use

of data of a real IBM quantum machine.

Preliminaries

In our work, we will show how a class of saddle points, the

so-called strict saddle points, can be avoided in noisy gradient

descent. In developing our machinery, we build strongly on

the rigorous results laid out in Ref. [

] and uplift them to the

quantum setting at hand. For this, we do method development

in its own right. First, we introduce some useful deﬁnitions

and theorems (see Ref. [21] for a more in-depth discussion).

Throughout this work, we consider the problem of mini-

mizing a function

L:Rp→R

. We indicate its gradient at

∂L(θ)

and its Hessian matrix at point

∂2L(θ)

. We

denote as

∥·∥2

the

norm of a vector.

∥·∥HS

and

∥·∥∞

denote

respectively the Hilbert-Schmidt norm and the largest eigen-

value norm of a matrix. We denote as

λmin(·)

the minimum

eigenvalue of a matrix.

Deﬁnition 1 (

-Lipschitz function).A function

g:Rp→Rd

is L-Lipschitz if and only if

∥g(θ)−g(ϕ)∥2≤L∥θ−ϕ∥2(1)

for every θand ϕ.

Deﬁnition 2 (

-strong smoothness).A differentiable function

L:Rp→R

is called

-strongly smooth if and only if its

gradient is a β-Lipschitz function, i.e.,

∥∂L(θ)−∂L(ϕ)∥2≤β∥θ−ϕ∥2,(2)

for every θand ϕ.

Deﬁnition 3 (Stationary point).If

is differentiable,

θ∗

deﬁned a stationary point if

∥∂L(θ∗)∥2= 0.(3)

Deﬁnition 4 (

-approximate stationary point).If

is differen-

tiable, θ∗is deﬁned an ϵ-approximate stationary point if

∥∂L(θ∗)∥2≤ϵ. (4)

Deﬁnition 5 (Local minimum, local maximum, and saddle

point).If Lis differentiable, a stationary point θ∗is a

•

local minimum, if there exists

δ > 0

such that

L(θ∗)≤

L(θ)for any θwith ∥θ−θ∗∥2≤δ.

•

Local maximum, if there exists

δ > 0

such that

L(θ∗)≥

L(θ)for any θwith ∥θ−θ∗∥2≤δ.

•Saddle point, otherwise.

Deﬁnition 6 (

-Lipschitz Hessian).A twice differentiable func-

tion Lhas ρ-Lipschitz Hessian matrix ∂2Lif and only if



∂2L(θ)−∂2L(ϕ)

HS ≤ρ∥θ−ϕ∥2(5)

for every θand ϕ(where ∥·∥HS is the Hilbert-Schmidt norm).

Deﬁnition 7 (Gradient descent).Given a differentiable func-

tion

, the gradient descent algorithm is deﬁned by the update

rule

θt+1

i=θt

i−η∂iL(θt),(6)

where η > 0is called learning rate.

The convergence time of the gradient descent algorithm is

given by the following theorem [21].

Theorem 8 (Gradient descent complexity).Given a

-strongly

smooth function

L(·)

, for any

ϵ > 0

, if we set the learning

rate as

η= 1/β

, then the number of iterations required by

the gradient descent algorithm such that it will visit an

approximate stationary point is

Oβ(L(θ0)− L⋆)

ϵ2,

where

θ0

is the initial point and

L⋆

is the value of

computed

in the global minimum.

It is important to note that this result does not depend on the

number of free parameters. Also, the stationary point at which

the algorithm will converge is not necessarily a local minimum,

but can also be a saddle point. Note that a generic saddle point

satisﬁes

λmin(∂2L(θs)) ≤0

where

λmin(·)

is the minimum

eigenvalue. Now we deﬁne a subclass of saddle points.

Deﬁnition 9 (Strict saddle point).

θs

is a strict saddle point

for a twice differentiable function

if and only if

θs

is a

stationary point and if the minimum eigenvalue of the Hessian

is λmin(∂2L(θs)) <0.

Adding the strict condition, we remove the case in which

a saddle point satisﬁes

λmin(∂2L(θs)) = 0

. Moreover, note

that a local maximum respects our deﬁnition of strict saddle

point. Analogously to Ref. [

], in this work, we focus on

avoiding strict saddle points. Hence, it is useful to introduce

the following deﬁnition.

Deﬁnition 10 (Second-order stationary point).Given a twice

differentiable function

L(·), θ∗

is a second-order stationary

point if and only if

∂L(θ∗) = 0,and λmin(∂2L(θ∗)) ≥0.(7)

Deﬁnition 11 (

-second-order stationary point).For a

Hessian Lipschitz function

L(·), θ∗

is an

-second-order sta-

tionary point if

|∂L(θ∗)|2≤ϵand λmin(∂2L(θ∗)) ≥ −√ρϵ. (8)

Gradient descent makes a non-zero step only when the gra-

dient is non-zero, and thus in the non-convex setting it will be

stuck at saddle points. A simple variant of GD is the perturbed

gradient descent (PGD) method [

] which adds randomness

to the iterates at each step.

Deﬁnition 12 (Perturbed gradient descent).Given a differen-

tiable function

L:Rp→R

, the perturbed gradient descent

algorithm is deﬁned by the update rule

θt+1

i=θt

i−η∂iLθt+ζt,(9)

where

η > 0

is the learning rate and

ζt

is a normally dis-

tributed random variable with mean

µ= 0

and variance

σ2=r2/p with r∈R.

In Ref. [

], the authors show that if we pick

r=˜

Θ(ϵ)

PGD will ﬁnd an

-second-order stationary point in a number

of iterations that has only a poly-logarithmic dependence on the

number of free parameters, i.e., it has the same complexity of

(standard) gradient descent up to poly-logarithmic dependence.

Theorem 13 ([

]).Let the function

L:Rd→R

strongly

smooth and such that it has a

Lipschitz-Hessian. Then, for

any

ϵ, δ > 0

, the

P GD

algorithm starting at the point

θ0

, with

parameters

η=˜

Θ(1/β)

and

r=˜

Θ(ϵ)

, will visit an

ϵ−

second-

order stationary point at least once in the following number of

iterations, with probability at least 1−δ

Oβ(L(θ0)− L⋆)

ϵ2

where

and

hide poly-logarithmic factors in

p, β, ρ, 1/ϵ, 1/δ

and

∆L:= L(θ0)− L⋆

θ0

is the ini-

tial point and

L⋆

is the value of

computed in the global

minimum.

This theorem has been proven in Ref. [

] for Gaussian

distributions, but the authors have pointed out that this is not

strictly necessary and that it can be generalized to other types

of probability distributions in which appropriate concentration

inequalities can be applied (for a more in-depth discussion, see

Ref. [21]).

In Ref. [

], it has been shown that although the standard

GD (without perturbations) almost always escapes the saddle

points asymptotically [

], there are (non-pathological) cases

in which the optimization requires exponential time to escape.

This highlights the importance of using gradient descent with

perturbations.

Figure 2. Comparison of the loss evolution with or without noise. The

noise levels are manually-added Gaussian distributions, and we keep

the same initial conditions. (a) Four different values of the standard

deviation

. (b) Noiseless case and the noisy case with the standard

deviation of the noise r= 0.1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StochasticnoisecanbehelpfulforvariationalquantumalgorithmsJunyuLiu,1,2,3,4FrederikWilde,5AntonioAnnaMele,5LiangJiang,1,2andJensEisert5,6,71PritzkerSchoolofMolecularEngineering,TheUniversityofChicago,Chicago,IL60637,USA2ChicagoQuantumExchange,Chicago,IL60637,USA3KadanoffCenterforTheoreticalPhysics,Th...

收起<<

Stochastic noise can be helpful for variational quantum algorithms Junyu Liu1 2 3 4Frederik Wilde5Antonio Anna Mele5Liang Jiang1 2and Jens Eisert5 6 7 1Pritzker School of Molecular Engineering The University of Chicago Chicago IL 60637 USA.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Stochastic noise can be helpful for variational quantum algorithms Junyu Liu1 2 3 4Frederik Wilde5Antonio Anna Mele5Liang Jiang1 2and Jens Eisert5 6 7 1Pritzker School of Molecular Engineering The University of Chicago Chicago IL 60637 USA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: