Generative model for learning quantum ensemble via optimal transport loss Hiroyuki Tezuka123 Shumpei Uno24 and Naoki Yamamoto25

2025-05-06 0 0 2.04MB 25 页 10玖币

侵权投诉

Generative model for learning quantum ensemble

via optimal transport loss

Hiroyuki Tezuka∗,1,2,3, Shumpei Uno∗,2,4, and Naoki Yamamoto†,2,5

1Sony Group Corporation, 1-7-1 Konan, Minato-ku, Tokyo, 108-0075, Japan

2Quantum Computing Center, Keio University, Hiyoshi 3-14-1, Kohoku-ku, Yokohama 223-8522, Japan

3Graduate School of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa,

223- 8522, Japan

4Mizuho Research & Technologies, Ltd., 2-3 Kanda-Nishikicho, Chiyoda-ku, Tokyo, 101-8443, Japan

5Department of Applied Physics and Physico-Informatics, Keio University, Hiyoshi 3-14-1, Kohoku-ku, Yokohama

223-8522, Japan

Abstract

Generative modeling is an unsupervised machine learning framework, that exhibits strong performance in various

machine learning tasks. Recently we ﬁnd several quantum version of generative model, some of which are even

proven to have quantum advantage. However, those methods are not directly applicable to construct a generative

model for learning a set of quantum states, i.e., ensemble. In this paper, we propose a quantum generative model

that can learn quantum ensemble, in an unsupervised machine learning framework. The key idea is to introduce

a new loss function calculated based on optimal transport loss, which have been widely used in classical machine

learning due to its several good properties; e.g., no need to ensure the common support of two ensembles. We

then give in-depth analysis on this measure, such as the scaling property of the approximation error. We also

demonstrate the generative modeling with the application to quantum anomaly detection problem, that cannot be

handled via existing methods. The proposed model paves the way for a wide application such as the health check

of quantum devices and eﬃcient initialization of quantum computation.

1 Introduction

In the recent great progress of quantum algorithms for both noisy near-term and future fault-tolerant quantum

devices, particularly the quantum machine learning (QML) attracts huge attention. QML is largely categorised into

two regimes in view of the type of data, which can be roughly called classical data and quantum data. The former

has a conventional meaning used in the classical case; for the supervised learning scenario, e.g., a quantum system

is trained to give a prediction for a given classical data such as an image. As for the latter, on the other hand, the

task is to predict some property for a given quantum state drawn from a set of states, e.g., the phase of a many-body

quantum state, again in the supervised learning scenario. Thanks to the obvious diﬃculty to directly represent a huge

quantum state classically, some quantum advantage have been proven in QML for quantum data [1–3].

In the above paragraph we used the supervised learning setting to explain the diﬀerence of classical and quantum

data. But the success of unsupervised learning in classical machine learning, particularly the generative modeling,

is of course notable; actually a variety of algorithms have demonstrated strong performance in several applications,

such as image generation [4–6], molecular design [7], and anomaly detection [8]. Hence, it is quite reasonable that

several quantum unsupervised learning algorithms have been actively developed, such as quantum circuit born machine

(QCBM) [9,10], quantum generative adversarial network (QGAN) [11,12], and quantum autoencoder (QAE) [13,14].

Also, Ref. [12] studied the generative modeling problem for quantum data; the task is to construct a model quantum

system producing a set of quantum states, i.e., quantum ensemble, that approximates a given quantum ensemble. The

model quantum system contains latent variables, the change of which corresponds to the change of output quantum

state of the system. In classical case, such generative model governed by latent variables is called an implicit model. It

is known that, to eﬃciently train an implicit model, we are often encouraged to take the policy to minimize a distance

between the model dataset and training dataset, rather than minimizing e.g., the divergence between two probability

distributions. The optimal transport loss (OTL), which typically leads to the Wasserstein distance, is suitable for the

purpose of measuring the distance of two dataset; actually the quantum version of Wasserstein distance was proposed

in [15,16] and was applied to construct a generative model for quantum ensemble in QGAN framework [17,18].

Along this line of research, in this paper we also focus on the generative modeling problem for quantum ensemble.

We are motivated from the fact that the above-mentioned existing works employed the Wasserstein distance deﬁned

for two mixed quantum states corresponding to the training and model quantum ensembles, where each mixed state

∗These authors equally contributed to this work.

†e-mail address: yamamoto@appi.keio.ac.jp

arXiv:2210.10743v1 [quant-ph] 19 Oct 2022

is obtained by compressing all element of the quantum ensemble to a single mixed state. This is clearly problematic,

because this compression process loses a lot of information of the ensemble; for instance, single qubit pure states

uniformly distributed on the equator of the Bloch sphere may be compressed to a maximally mixed state, which

clearly does not recover the original ensemble. Accordingly, it is obvious that learning a single mixed state produced

from the training ensemble does not lead to a model system that can approximate the original training ensemble.

In this paper, hence, we propose a new quantum OTL, which directly measures the diﬀerence between two quantum

ensembles. The generative model can then be obtained by minimizing this quantum OTL between a training quantum

ensemble and the ensemble of pure quantum states produced from the model. As the generative model, we use a

parameterized quantum circuit (PQC) that contains tuning parameters and latent variables, which are both served by

the angles of single-qubit rotation gates. A notable feature of the proposed OTL is that this has a form of sum of local

functions that operates on a few neighboring qubits. This condition (i.e., the locality of the cost) is indeed necessary

to train the model without suﬀering from the so-called vanishing gradient issue [19], meaning that the gradient vector

with respect to the parameters decreases exponentially fast when increasing the number of qubits.

Using the proposed quantum OTL, which will be formally deﬁned in Section 3, we will show the following result.

The ﬁrst result is given in Section 4, which provides performance analysis of OTL and its gradient from several aspects;

e.g., scaling properties of OTL as a function of the number of training data and the number of measurement. We also

numerically conﬁrm that the gradient of OTL is certainly free from the vanishing gradient issue. The second result is

provided in Section 5, showing some example of constructing a generative model for quantum ensemble by minimizing

the OTL. This demonstration includes the application of quantum generative model to an anomaly detection problem

of quantum data; that is, once a generative model is constructed by learning a given quantum ensemble, then it can

be used to detect an anomaly quantum state by measuring the distance of this state to the output ensemble of the

model. Section 6gives a concluding remark, and some supporting materials including the proof of theorems are given

in Appendix.

2 Preliminaries

In this section, we ﬁrst review the implicit generative model for classical machine learning problems in Sec. 2.1. Next,

Sec. 2.2 is devoted to describe the general OTL, which can be eﬀectively used as a cost function to train a generative

model.

2.1 Implicit Generative Model

The generative model is used to approximate an unknown probability distribution that produces a given training

dataset. The basic strategy to construct a generative model is as follows; assuming the probability distribution α(x)

behind the given training dataset {xi}M

i=1 ∈ XM, where Xdenotes the space of random variables, we prepare a

parameterized probability distribution βθ(x) and learn the parameters θthat minimize an appropriate loss function

deﬁned on the training dataset.

In general, generative models are categorized to two types: prescribed models and implicit models. The prescribed

generative modeling explicitly deﬁnes βθ(x) with the parameters θ. Then we can calculate the log-likelihood function

of βθ(xi), and the parameters are determined by the maximum likelihood method, which corresponds to minimizing

the Kullback-Leibler divergence KL(αkβθ). On the other hand, the implicit generative modeling does not give us

an explicit form of βθ(x). An important feature of the implicit generative model is that it can easily describe a

probability distribution whose random variables are conﬁned on a hidden low-dimensional manifold; also the data-

generation process can be interpreted as a physical process from a latent variable to the data [20]. Examples of the

implicit generative model includes Variational Auto-Encoders [21], Generative Adversarial Networks [22], and the

ﬂow-based method [23]. This paper focuses on the implicit generative model.

In the implicit generative model we usually assume that the distribution behind the training data resides on a

relatively low-dimensional manifold. That is, an implicit generative model is expressed as a map of a random latent

variable zonto X;zresides in a latent space Zwhose dimension Nzis signiﬁcantly smaller than that of the sample

space, Nx. The latent random variable zfollows a known distribution γ(z) such as a uniform distribution or a Gaussian

distribution. That is, the implicit model distribution is given by βθ=Gθ#γ, where # is called the push-forward

operator [24] which moves the distribution γon Zto a probability distribution on Xthrough the map Gθ. This

implicit generative model is trained so that the set of samples generated from the model distribution are close to the

set of training data, by adjusting the parameters θto minimize some appropriate cost function Las follows:

θ?= arg min

θL(ˆαM,ˆ

βθ,Mg).(2.1)

ˆαM(x) and ˆ

βθ,Mg=Gθ#ˆγMg(z) denote empirical distributions deﬁned with the sampled data {xi}M

i=1 and {zi}Mg

i=1,

which follow the probability distributions α(x) and γ(z), respectively:

ˆαM(x) = 1

i=1

δ(x−xi),ˆγMg(z) = 1

i=1

δ(z−zi).(2.2)

2.2 Optimal Transport Loss

The OTL is used in various ﬁelds such as image analysis, natural language processing, and ﬁnance [24–27]. In

particular, the OTL is widely used as a loss function in the generative modeling, mainly because it can be applicable

even when the support of probability distributions do not match, and it can naturally incorporate the distance in the

sample space X[28–33]. The OTL is deﬁned as the minimum cost of moving a probability distribution αto another

distribution β:

Deﬁnition 1 (Optimal Transport Loss [34]).

Lc(α, β) = min

πZc(x,y)dπ(x,y),

subject to Zπ(x,y)dx=β(y),Zπ(x,y)dy=α(x), π(x,y)≥0,

(2.3)

where c(x,y)≥0is a non-negative function on X × X that represents the transport cost from xto y, and is called

the ground cost. Also, we call the set of couplings πthat minimizes Lc(α, β)as the optimal transport plan.

In general, the OTL does not meet the axioms of metric between probability distributions; but it does when the

ground cost is represented in terms of a metric function as follows:

Deﬁnition 2 (p-Wasserstein distance [35]).When the ground cost c(x,y)is expressed as c(x,y) = d(x,y)pwith a

metric function d(x,y)and a real positive constant p, the p-Wasserstein distance is deﬁned as

Wp(α, β) = Ldp(α, β)1/p.(2.4)

The p-Wasserstein distance satisﬁes the conditions of metric between probability distributions. That is, for arbi-

trary probability distributions α, β, γ, the p-Wasserstein distance Wpsatisﬁes Wp(α, β)≥0 and Wp(α, β) = Wp(β, α);

also it satisﬁes Wp(α, β) = 0 ⇔α=βand the triangle inequality Wp(α, γ)≤ Wp(α, β) + Wp(β, γ).

In general it is diﬃcult to directly handle the probability distributions αand βθto minimize the OTL Lc(α, βθ).

Instead, as mentioned in Eq. (2.1), we try to minimize the approximation of the OTL via the empirical distributions

(2.2):

Deﬁnition 3 (Empirical estimator for optimal transport loss [35]).

LcˆαM,ˆ

βθ,Mg= min

{πi,j }M,Mg

i,j=1

i=1

j=1

c(xi, Gθ(zj))πi,j ,

subject to

i=1

πi,j =1

j=1

πi,j =1

M, πi,j ≥0.

(2.5)

The empirical estimator converges as Lc(ˆαM,ˆ

βθ,M )→ Lc(α, βθ) in the limit M=Mg→ ∞. In general, the

speed of this convergence is of the order of O(M−1/Nx) with Nxthe dimension of the sample space X[36], but the

p-Wasserstein distance enjoys the following convergence law [37].

Theorem 4 (Convergence rate of p-Wasserstein distance).For the upper Wasserstein dimension d∗

p(α)(which is

given in Deﬁnition 4 of [37]) of the probability distribution α, the following expression holds when sis larger than

d∗

p(α):

E[Wp(α, ˆαM)] .O(M−1/s),(2.6)

where the expectation Eis taken with respect to the samples drawn from the empirical distribution ˆαM.

Intuitively, the upper Wasserstein dimension d∗

p(α) can be interpreted as the support dimension of the probability

distribution α, which corresponds to the dimension of the latent space, Nz, in the implicit generative model. Exploiting

the metric properties of the p-Wasserstein distance, the following corollaries are immediately derived from Theorem 4:

Corollary 5 (Convergence rate of p-Wasserstein distance between empirical distributions sampled from a common

distribution).Let ˆα1,M and ˆα2,M be two diﬀerent empirical distributions sampled from a common distribution α. The

number of samples is Min both empirical distributions. Then the following expression holds for s > d∗

p(α):

E[Wp(ˆα1,M ,ˆα2,M )] .O(M−1/s),(2.7)

where the expectation Eis taken with respect to the samples drawn from the empirical distributions ˆα1,M and ˆα2,M .

Corollary 6 (Convergence rate of p-Wasserstein distance between diﬀerent empirical distributions).Suppose that the

upper Wasserstein dimension of the probability distributions αand βθis at most d∗

p, then the following expression

holds for s > d∗

EhWp(α, βθ)− Wp(ˆαM,ˆ

βθ,M )i.O(M−1/s),(2.8)

where the expectation Eis taken with respect to the samples drawn from the empirical distribution ˆαMand ˆ

βθ,M .

These corollaries indicate that the empirical estimator (2.5) is a good estimator if the intrinsic dimension of the

training data and the dimension of the latent space Nzare suﬃciently small, because the Wasserstein dimension d∗

pis

almost the same as the intrinsic dimension of the training data and the latent dimension. In Sec. 4.1, we numerically

see that similar convergence laws hold even when the OTL is not the p-Wasserstein distance.

3 Learning algorithm of generative model for quantum ensemble

In Sec. 3.1, we deﬁne the new quantum OTL that can be suitably used in the learning algorithm of the generative

model for quantum ensemble. The learning algorithm is provided in Sec. 3.2.

3.1 Optimal transport loss with local ground cost

Our idea is to directly use Eq. (2.5) yet with the ground cost for quantum states, c(|ψi,|φi), rather than that for

classical data vectors, c(x,y). This actually enables us to deﬁne the OTL between quantum ensembles {|ψii} and

{|φii}, as follows:

Lc({|ψii},{|φii}) = min

{πi,j }X

i,j

c(|ψii,|φii)πi,j ,subject to X

πi,j =qj,X

πi,j =pi, πi,j ≥0,(3.1)

where piand qjare probabilities that |ψiiand |φiiappears, respectively. Note that we can deﬁne the transport loss

between the corresponding mixed states Pipi|ψiihψi|and Piqi|φiihφi|or some modiﬁcation of them, as discussed

in [17]; but as mentioned in Sec. 1, such mixed state loses the original conﬁguration of ensemble (e.g., single qubit

pure states uniformly distributed on the equator of the Bloch sphere) and thus are not applicable to our purpose.

Then our question is how to deﬁne the ground cost c(|ψi,|φi). An immediate choice might be the trace distance:

Deﬁnition 7 (Trace distance for pure states [38]).

ctr(|ψi,|φi) = p1− |hψ|φi|2.(3.2)

Because the trace distance satisﬁes the axioms of metric, we can deﬁne the p-Wasserstein distance for quantum

ensembles, Wp({|ψii},{|φii}) = Ldp({|ψii},{|φii})1/p, which allows us to have some useful properties described in

Corollary 5. It is also notable that the trace distance is relatively easy to compute on a quantum computer, using e.g.

the swap test [39] or the inversion test [40].

We now give an important remark. As will be formally described, our goal is to ﬁnd a quantum circuit that produces

a quantum ensemble (via changing latent variables) which best approximates a given quantum ensemble. This task

can be executed by the gradient descent method for a parametrized quantum circuit, but a naive setting leads to the

vanishing gradient issue, meaning that the gradient vector decays to zero exponentially fast with respect to the number

of qubits [19]. There have been several proposals found in the literature [41,42], but a common prerequisite is that the

cost should be a local one. To explain the meaning, let us consider the case where |φiis given by |φi=U|0i⊗nwhere

Uis a unitary matrix (which will be a parametrized unitary matrix U(θ) deﬁning the generative model) and nis the

number of qubits. Then the trace distance is based on the ﬁdelity |hψ|φi|2=|hψ|U|0i⊗n|2. This is the probability

to get all zeros via the global measurement on the state U†|ψiin the computational basis, which thus means that

the trace distance is a global cost; accordingly, the ﬁdelity-based learning method suﬀers from the vanishing gradient

issue. On the other hand, we ﬁnd that the following cost function is based on the localized ﬁdelity measurement.

Deﬁnition 8 (Ground cost for quantum states only with local measurements [43,44]).

clocal(|ψi,|φi) = clocal(|ψi, U |0i⊗n) = v

k=1

(1 −p(k)),

p(k)= Tr Pk

0U†|ψihψ|U, P k

0=I1⊗I2⊗ ··· ⊗

k-th bit

z }| {

|0ih0|k⊗··· ⊗ In,

(3.3)

where nis the number of qubits. Also, Iiand |0ih0|idenote the identity operator and the projection operator that act

on the i-th qubit, respectively; thus p(k)represents the probability of getting 0when observing the k-th qubit.

Equation (3.3) is certainly a local cost, and thus it may be used for realizing eﬀective learning free from the vanishing

gradient issue provided that some additional conditions (which will be described in Section 5) are satisﬁed. However,

importantly, clocal(|ψi,|φi) is not a distance between the two quantum states, because it is not symmetric and it does

not satisfy the triangle inequality, while the trace distance (3.2) satisﬁes the axiom of distance. Yet clocal(|ψi,|φi) is

always non-negative and becomes zero only when |ψi=|φi, meaning that clocal(|ψi,|φi) functions as a divergence.

Then we can prove that, in general, the OTL deﬁned with a divergence ground cost also functions as a divergence, as

follows. The proof is given in Appendix A.

Proposition 9. When the ground cost c(x,y)is a divergence satisfying

c(x,y)≥0,

c(x,y)=0 iﬀ x=y,(3.4)

then the OTL Lc(α, β)with c(x,y)is also a divergence. That is, Lc(α, β)satisﬁes the following properties for arbitrary

probability distributions αand β:

Lc(α, β)≥0,

Lc(α, β)=0 iﬀ α=β. (3.5)

Therefore, the OTL Lc({|ψii},{|φii}) given in Eq. (3.1) with the local ground cost clocal(|ψi,|φi) given in Eq.

(3.3) functions as a divergence. This means that Lc({|ψii},{|φii}) can be suitably used for evaluating the diﬀerence

between a given quantum ensemble and the set of output states of the generative model. At the same time, recall

that, for the purpose of avoiding the gradient vanishing issue, we had to give up using the ﬁdelity measure and

accordingly the distance property of the OTL. Hence we directly cannot use the desirable properties described in

Theorem 4, Corollary 5, and Corollary 6; nonetheless, in Section 4, we will discuss if similar properties do hold even

for the divergence measure.

3.2 Learning Algorithm

The goal of our task is to train an implicit generative model so that it outputs a quantum ensemble approximating

a given ensemble {|ψii}M

i=1; that is, our generative model contains tunable parameters and latent variables, as in the

classical case described in Section 2.1. In this paper, we employ the following implicit generative model:

Deﬁnition 10 (Implicit generative model on a quantum circuit).Using the initial state |0i⊗nand the parameterized

quantum circuit U(z,θ), the implicit generative model on a quantum circuit is deﬁned as

|φθ(z)i=U(z,θ)|0i⊗n.(3.6)

Here θis the vector of tunable parameters and zis the vector of latent variables that follow a known probability

distribution; both θand zare encoded in the rotation angles of rotation gates in U(z,θ).

The similar circuit model is also found in meta-VQE [45], which uses physical parameters such as the distance of

atomic nucleus instead of random latent variables z. Also, the model proposed in [12] introduces the latent variables

zas the computational basis of an initial state in the form |φθ(z)i=U(θ)|zi; however, in this model, states with

diﬀerent latent variables are always orthogonal with each other, and thus the model cannot capture a small change of

state in the Hilbert space via changing the latent variables. In contrast, the model (3.6) fulﬁlls this purpose as long as

the expressivity of the state with respect to zis enough. In addition, our model is advantageous in that the analytical

derivative is available by the parameter shift rule [46,47] not only for the tunable parameters θbut also for the latent

variables z. This feature will be eﬀectively utilized in the anomaly detection problem in Sec. 5.

Next, as for the learning cost, we take the following empirical estimator of OTL, calculated from the training data

{|ψii}M

i=1 and the samples of latent variables {zj}Mg

j=1:

Lclocal {|ψii}M

i=1,{|φθ(zj)i}Mg

j=1= min

{πi,j }M,Mg

i,j=1

i=1

j=1

clocal,i,j πi,j ,

subject to

i=1

πi,j =1

j=1

πi,j =1

M, πi,j ≥0.

(3.7)

where clocal,i,j is the ground cost given by

clocal,i,j =v

k=1

(1 −p(k)

i,j ),

p(k)

i,j = Tr Pk

0U†(zj,θ)|ψiihψi|U(zj,θ), P k

0=I1⊗I2⊗ ··· ⊗

k-th bit

z }| {

|0ih0|k⊗··· ⊗ In.

(3.8)

Note that in practice clocal,i,j is estimated with the ﬁnite number of measurements (shots); we denote ˜c(Ns)

local,i,j to be

the estimator with Nsshots for the ideal one clocal,i,j , and in this case the OTL is denoted as L˜c(Ns)

local

Based on the OTL (3.7), the pseudo-code of proposed algorithm is shown in Algorithm 1. The total num-

ber of training quantum states |ψiirequired for the parameter update is of the order O(MMgNs) in step 3, and

O(max(M, Mg)NsNp) in step 5, since the parameter shift rule [46,47] is applicable.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GenerativemodelforlearningquantumensembleviaoptimaltransportlossHiroyukiTezuka*,1,2,3,ShumpeiUno*,2,4,andNaokiYamamoto,2,51SonyGroupCorporation,1-7-1Konan,Minato-ku,Tokyo,108-0075,Japan2QuantumComputingCenter,KeioUniversity,Hiyoshi3-14-1,Kohoku-ku,Yokohama223-8522,Japan3GraduateSchoolofScienceandTe...

展开>> 收起<<

Generative model for learning quantum ensemble via optimal transport loss Hiroyuki Tezuka123 Shumpei Uno24 and Naoki Yamamoto25.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generative model for learning quantum ensemble via optimal transport loss Hiroyuki Tezuka123 Shumpei Uno24 and Naoki Yamamoto25

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: