Generative model for learning quantum ensemble via optimal transport loss Hiroyuki Tezuka123 Shumpei Uno24 and Naoki Yamamoto25

2025-05-06 0 0 2.04MB 25 页 10玖币
侵权投诉
Generative model for learning quantum ensemble
via optimal transport loss
Hiroyuki Tezuka,1,2,3, Shumpei Uno,2,4, and Naoki Yamamoto,2,5
1Sony Group Corporation, 1-7-1 Konan, Minato-ku, Tokyo, 108-0075, Japan
2Quantum Computing Center, Keio University, Hiyoshi 3-14-1, Kohoku-ku, Yokohama 223-8522, Japan
3Graduate School of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa,
223- 8522, Japan
4Mizuho Research & Technologies, Ltd., 2-3 Kanda-Nishikicho, Chiyoda-ku, Tokyo, 101-8443, Japan
5Department of Applied Physics and Physico-Informatics, Keio University, Hiyoshi 3-14-1, Kohoku-ku, Yokohama
223-8522, Japan
Abstract
Generative modeling is an unsupervised machine learning framework, that exhibits strong performance in various
machine learning tasks. Recently we find several quantum version of generative model, some of which are even
proven to have quantum advantage. However, those methods are not directly applicable to construct a generative
model for learning a set of quantum states, i.e., ensemble. In this paper, we propose a quantum generative model
that can learn quantum ensemble, in an unsupervised machine learning framework. The key idea is to introduce
a new loss function calculated based on optimal transport loss, which have been widely used in classical machine
learning due to its several good properties; e.g., no need to ensure the common support of two ensembles. We
then give in-depth analysis on this measure, such as the scaling property of the approximation error. We also
demonstrate the generative modeling with the application to quantum anomaly detection problem, that cannot be
handled via existing methods. The proposed model paves the way for a wide application such as the health check
of quantum devices and efficient initialization of quantum computation.
1 Introduction
In the recent great progress of quantum algorithms for both noisy near-term and future fault-tolerant quantum
devices, particularly the quantum machine learning (QML) attracts huge attention. QML is largely categorised into
two regimes in view of the type of data, which can be roughly called classical data and quantum data. The former
has a conventional meaning used in the classical case; for the supervised learning scenario, e.g., a quantum system
is trained to give a prediction for a given classical data such as an image. As for the latter, on the other hand, the
task is to predict some property for a given quantum state drawn from a set of states, e.g., the phase of a many-body
quantum state, again in the supervised learning scenario. Thanks to the obvious difficulty to directly represent a huge
quantum state classically, some quantum advantage have been proven in QML for quantum data [13].
In the above paragraph we used the supervised learning setting to explain the difference of classical and quantum
data. But the success of unsupervised learning in classical machine learning, particularly the generative modeling,
is of course notable; actually a variety of algorithms have demonstrated strong performance in several applications,
such as image generation [46], molecular design [7], and anomaly detection [8]. Hence, it is quite reasonable that
several quantum unsupervised learning algorithms have been actively developed, such as quantum circuit born machine
(QCBM) [9,10], quantum generative adversarial network (QGAN) [11,12], and quantum autoencoder (QAE) [13,14].
Also, Ref. [12] studied the generative modeling problem for quantum data; the task is to construct a model quantum
system producing a set of quantum states, i.e., quantum ensemble, that approximates a given quantum ensemble. The
model quantum system contains latent variables, the change of which corresponds to the change of output quantum
state of the system. In classical case, such generative model governed by latent variables is called an implicit model. It
is known that, to efficiently train an implicit model, we are often encouraged to take the policy to minimize a distance
between the model dataset and training dataset, rather than minimizing e.g., the divergence between two probability
distributions. The optimal transport loss (OTL), which typically leads to the Wasserstein distance, is suitable for the
purpose of measuring the distance of two dataset; actually the quantum version of Wasserstein distance was proposed
in [15,16] and was applied to construct a generative model for quantum ensemble in QGAN framework [17,18].
Along this line of research, in this paper we also focus on the generative modeling problem for quantum ensemble.
We are motivated from the fact that the above-mentioned existing works employed the Wasserstein distance defined
for two mixed quantum states corresponding to the training and model quantum ensembles, where each mixed state
These authors equally contributed to this work.
e-mail address: yamamoto@appi.keio.ac.jp
1
arXiv:2210.10743v1 [quant-ph] 19 Oct 2022
is obtained by compressing all element of the quantum ensemble to a single mixed state. This is clearly problematic,
because this compression process loses a lot of information of the ensemble; for instance, single qubit pure states
uniformly distributed on the equator of the Bloch sphere may be compressed to a maximally mixed state, which
clearly does not recover the original ensemble. Accordingly, it is obvious that learning a single mixed state produced
from the training ensemble does not lead to a model system that can approximate the original training ensemble.
In this paper, hence, we propose a new quantum OTL, which directly measures the difference between two quantum
ensembles. The generative model can then be obtained by minimizing this quantum OTL between a training quantum
ensemble and the ensemble of pure quantum states produced from the model. As the generative model, we use a
parameterized quantum circuit (PQC) that contains tuning parameters and latent variables, which are both served by
the angles of single-qubit rotation gates. A notable feature of the proposed OTL is that this has a form of sum of local
functions that operates on a few neighboring qubits. This condition (i.e., the locality of the cost) is indeed necessary
to train the model without suffering from the so-called vanishing gradient issue [19], meaning that the gradient vector
with respect to the parameters decreases exponentially fast when increasing the number of qubits.
Using the proposed quantum OTL, which will be formally defined in Section 3, we will show the following result.
The first result is given in Section 4, which provides performance analysis of OTL and its gradient from several aspects;
e.g., scaling properties of OTL as a function of the number of training data and the number of measurement. We also
numerically confirm that the gradient of OTL is certainly free from the vanishing gradient issue. The second result is
provided in Section 5, showing some example of constructing a generative model for quantum ensemble by minimizing
the OTL. This demonstration includes the application of quantum generative model to an anomaly detection problem
of quantum data; that is, once a generative model is constructed by learning a given quantum ensemble, then it can
be used to detect an anomaly quantum state by measuring the distance of this state to the output ensemble of the
model. Section 6gives a concluding remark, and some supporting materials including the proof of theorems are given
in Appendix.
2 Preliminaries
In this section, we first review the implicit generative model for classical machine learning problems in Sec. 2.1. Next,
Sec. 2.2 is devoted to describe the general OTL, which can be effectively used as a cost function to train a generative
model.
2.1 Implicit Generative Model
The generative model is used to approximate an unknown probability distribution that produces a given training
dataset. The basic strategy to construct a generative model is as follows; assuming the probability distribution α(x)
behind the given training dataset {xi}M
i=1 ∈ XM, where Xdenotes the space of random variables, we prepare a
parameterized probability distribution βθ(x) and learn the parameters θthat minimize an appropriate loss function
defined on the training dataset.
In general, generative models are categorized to two types: prescribed models and implicit models. The prescribed
generative modeling explicitly defines βθ(x) with the parameters θ. Then we can calculate the log-likelihood function
of βθ(xi), and the parameters are determined by the maximum likelihood method, which corresponds to minimizing
the Kullback-Leibler divergence KL(αkβθ). On the other hand, the implicit generative modeling does not give us
an explicit form of βθ(x). An important feature of the implicit generative model is that it can easily describe a
probability distribution whose random variables are confined on a hidden low-dimensional manifold; also the data-
generation process can be interpreted as a physical process from a latent variable to the data [20]. Examples of the
implicit generative model includes Variational Auto-Encoders [21], Generative Adversarial Networks [22], and the
flow-based method [23]. This paper focuses on the implicit generative model.
In the implicit generative model we usually assume that the distribution behind the training data resides on a
relatively low-dimensional manifold. That is, an implicit generative model is expressed as a map of a random latent
variable zonto X;zresides in a latent space Zwhose dimension Nzis significantly smaller than that of the sample
space, Nx. The latent random variable zfollows a known distribution γ(z) such as a uniform distribution or a Gaussian
distribution. That is, the implicit model distribution is given by βθ=Gθ#γ, where # is called the push-forward
operator [24] which moves the distribution γon Zto a probability distribution on Xthrough the map Gθ. This
implicit generative model is trained so that the set of samples generated from the model distribution are close to the
set of training data, by adjusting the parameters θto minimize some appropriate cost function Las follows:
θ?= arg min
θL(ˆαM,ˆ
βθ,Mg).(2.1)
ˆαM(x) and ˆ
βθ,Mg=Gθ#ˆγMg(z) denote empirical distributions defined with the sampled data {xi}M
i=1 and {zi}Mg
i=1,
which follow the probability distributions α(x) and γ(z), respectively:
ˆαM(x) = 1
M
M
X
i=1
δ(xxi),ˆγMg(z) = 1
Mg
Mg
X
i=1
δ(zzi).(2.2)
2
2.2 Optimal Transport Loss
The OTL is used in various fields such as image analysis, natural language processing, and finance [2427]. In
particular, the OTL is widely used as a loss function in the generative modeling, mainly because it can be applicable
even when the support of probability distributions do not match, and it can naturally incorporate the distance in the
sample space X[2833]. The OTL is defined as the minimum cost of moving a probability distribution αto another
distribution β:
Definition 1 (Optimal Transport Loss [34]).
Lc(α, β) = min
πZc(x,y)(x,y),
subject to Zπ(x,y)dx=β(y),Zπ(x,y)dy=α(x), π(x,y)0,
(2.3)
where c(x,y)0is a non-negative function on X × X that represents the transport cost from xto y, and is called
the ground cost. Also, we call the set of couplings πthat minimizes Lc(α, β)as the optimal transport plan.
In general, the OTL does not meet the axioms of metric between probability distributions; but it does when the
ground cost is represented in terms of a metric function as follows:
Definition 2 (p-Wasserstein distance [35]).When the ground cost c(x,y)is expressed as c(x,y) = d(x,y)pwith a
metric function d(x,y)and a real positive constant p, the p-Wasserstein distance is defined as
Wp(α, β) = Ldp(α, β)1/p.(2.4)
The p-Wasserstein distance satisfies the conditions of metric between probability distributions. That is, for arbi-
trary probability distributions α, β, γ, the p-Wasserstein distance Wpsatisfies Wp(α, β)0 and Wp(α, β) = Wp(β, α);
also it satisfies Wp(α, β) = 0 α=βand the triangle inequality Wp(α, γ)≤ Wp(α, β) + Wp(β, γ).
In general it is difficult to directly handle the probability distributions αand βθto minimize the OTL Lc(α, βθ).
Instead, as mentioned in Eq. (2.1), we try to minimize the approximation of the OTL via the empirical distributions
(2.2):
Definition 3 (Empirical estimator for optimal transport loss [35]).
LcˆαM,ˆ
βθ,Mg= min
{πi,j }M,Mg
i,j=1
M
X
i=1
Mg
X
j=1
c(xi, Gθ(zj))πi,j ,
subject to
M
X
i=1
πi,j =1
Mg
,
Mg
X
j=1
πi,j =1
M, πi,j 0.
(2.5)
The empirical estimator converges as Lc(ˆαM,ˆ
βθ,M )→ Lc(α, βθ) in the limit M=Mg→ ∞. In general, the
speed of this convergence is of the order of O(M1/Nx) with Nxthe dimension of the sample space X[36], but the
p-Wasserstein distance enjoys the following convergence law [37].
Theorem 4 (Convergence rate of p-Wasserstein distance).For the upper Wasserstein dimension d
p(α)(which is
given in Definition 4 of [37]) of the probability distribution α, the following expression holds when sis larger than
d
p(α):
E[Wp(α, ˆαM)] .O(M1/s),(2.6)
where the expectation Eis taken with respect to the samples drawn from the empirical distribution ˆαM.
Intuitively, the upper Wasserstein dimension d
p(α) can be interpreted as the support dimension of the probability
distribution α, which corresponds to the dimension of the latent space, Nz, in the implicit generative model. Exploiting
the metric properties of the p-Wasserstein distance, the following corollaries are immediately derived from Theorem 4:
Corollary 5 (Convergence rate of p-Wasserstein distance between empirical distributions sampled from a common
distribution).Let ˆα1,M and ˆα2,M be two different empirical distributions sampled from a common distribution α. The
number of samples is Min both empirical distributions. Then the following expression holds for s > d
p(α):
E[Wp(ˆα1,M ,ˆα2,M )] .O(M1/s),(2.7)
where the expectation Eis taken with respect to the samples drawn from the empirical distributions ˆα1,M and ˆα2,M .
Corollary 6 (Convergence rate of p-Wasserstein distance between different empirical distributions).Suppose that the
upper Wasserstein dimension of the probability distributions αand βθis at most d
p, then the following expression
holds for s > d
p:
EhWp(α, βθ)− Wp(ˆαM,ˆ
βθ,M )i.O(M1/s),(2.8)
where the expectation Eis taken with respect to the samples drawn from the empirical distribution ˆαMand ˆ
βθ,M .
3
These corollaries indicate that the empirical estimator (2.5) is a good estimator if the intrinsic dimension of the
training data and the dimension of the latent space Nzare sufficiently small, because the Wasserstein dimension d
pis
almost the same as the intrinsic dimension of the training data and the latent dimension. In Sec. 4.1, we numerically
see that similar convergence laws hold even when the OTL is not the p-Wasserstein distance.
3 Learning algorithm of generative model for quantum ensemble
In Sec. 3.1, we define the new quantum OTL that can be suitably used in the learning algorithm of the generative
model for quantum ensemble. The learning algorithm is provided in Sec. 3.2.
3.1 Optimal transport loss with local ground cost
Our idea is to directly use Eq. (2.5) yet with the ground cost for quantum states, c(|ψi,|φi), rather than that for
classical data vectors, c(x,y). This actually enables us to define the OTL between quantum ensembles {|ψii} and
{|φii}, as follows:
Lc({|ψii},{|φii}) = min
{πi,j }X
i,j
c(|ψii,|φii)πi,j ,subject to X
i
πi,j =qj,X
j
πi,j =pi, πi,j 0,(3.1)
where piand qjare probabilities that |ψiiand |φiiappears, respectively. Note that we can define the transport loss
between the corresponding mixed states Pipi|ψiihψi|and Piqi|φiihφi|or some modification of them, as discussed
in [17]; but as mentioned in Sec. 1, such mixed state loses the original configuration of ensemble (e.g., single qubit
pure states uniformly distributed on the equator of the Bloch sphere) and thus are not applicable to our purpose.
Then our question is how to define the ground cost c(|ψi,|φi). An immediate choice might be the trace distance:
Definition 7 (Trace distance for pure states [38]).
ctr(|ψi,|φi) = p1− |hψ|φi|2.(3.2)
Because the trace distance satisfies the axioms of metric, we can define the p-Wasserstein distance for quantum
ensembles, Wp({|ψii},{|φii}) = Ldp({|ψii},{|φii})1/p, which allows us to have some useful properties described in
Corollary 5. It is also notable that the trace distance is relatively easy to compute on a quantum computer, using e.g.
the swap test [39] or the inversion test [40].
We now give an important remark. As will be formally described, our goal is to find a quantum circuit that produces
a quantum ensemble (via changing latent variables) which best approximates a given quantum ensemble. This task
can be executed by the gradient descent method for a parametrized quantum circuit, but a naive setting leads to the
vanishing gradient issue, meaning that the gradient vector decays to zero exponentially fast with respect to the number
of qubits [19]. There have been several proposals found in the literature [41,42], but a common prerequisite is that the
cost should be a local one. To explain the meaning, let us consider the case where |φiis given by |φi=U|0inwhere
Uis a unitary matrix (which will be a parametrized unitary matrix U(θ) defining the generative model) and nis the
number of qubits. Then the trace distance is based on the fidelity |hψ|φi|2=|hψ|U|0in|2. This is the probability
to get all zeros via the global measurement on the state U|ψiin the computational basis, which thus means that
the trace distance is a global cost; accordingly, the fidelity-based learning method suffers from the vanishing gradient
issue. On the other hand, we find that the following cost function is based on the localized fidelity measurement.
Definition 8 (Ground cost for quantum states only with local measurements [43,44]).
clocal(|ψi,|φi) = clocal(|ψi, U |0in) = v
u
u
t1
n
n
X
k=1
(1 p(k)),
p(k)= Tr Pk
0U|ψihψ|U, P k
0=I1I2⊗ ··· ⊗
k-th bit
z }| {
|0ih0|k··· ⊗ In,
(3.3)
where nis the number of qubits. Also, Iiand |0ih0|idenote the identity operator and the projection operator that act
on the i-th qubit, respectively; thus p(k)represents the probability of getting 0when observing the k-th qubit.
Equation (3.3) is certainly a local cost, and thus it may be used for realizing effective learning free from the vanishing
gradient issue provided that some additional conditions (which will be described in Section 5) are satisfied. However,
importantly, clocal(|ψi,|φi) is not a distance between the two quantum states, because it is not symmetric and it does
not satisfy the triangle inequality, while the trace distance (3.2) satisfies the axiom of distance. Yet clocal(|ψi,|φi) is
always non-negative and becomes zero only when |ψi=|φi, meaning that clocal(|ψi,|φi) functions as a divergence.
Then we can prove that, in general, the OTL defined with a divergence ground cost also functions as a divergence, as
follows. The proof is given in Appendix A.
4
Proposition 9. When the ground cost c(x,y)is a divergence satisfying
c(x,y)0,
c(x,y)=0 iff x=y,(3.4)
then the OTL Lc(α, β)with c(x,y)is also a divergence. That is, Lc(α, β)satisfies the following properties for arbitrary
probability distributions αand β:
Lc(α, β)0,
Lc(α, β)=0 iff α=β. (3.5)
Therefore, the OTL Lc({|ψii},{|φii}) given in Eq. (3.1) with the local ground cost clocal(|ψi,|φi) given in Eq.
(3.3) functions as a divergence. This means that Lc({|ψii},{|φii}) can be suitably used for evaluating the difference
between a given quantum ensemble and the set of output states of the generative model. At the same time, recall
that, for the purpose of avoiding the gradient vanishing issue, we had to give up using the fidelity measure and
accordingly the distance property of the OTL. Hence we directly cannot use the desirable properties described in
Theorem 4, Corollary 5, and Corollary 6; nonetheless, in Section 4, we will discuss if similar properties do hold even
for the divergence measure.
3.2 Learning Algorithm
The goal of our task is to train an implicit generative model so that it outputs a quantum ensemble approximating
a given ensemble {|ψii}M
i=1; that is, our generative model contains tunable parameters and latent variables, as in the
classical case described in Section 2.1. In this paper, we employ the following implicit generative model:
Definition 10 (Implicit generative model on a quantum circuit).Using the initial state |0inand the parameterized
quantum circuit U(z,θ), the implicit generative model on a quantum circuit is defined as
|φθ(z)i=U(z,θ)|0in.(3.6)
Here θis the vector of tunable parameters and zis the vector of latent variables that follow a known probability
distribution; both θand zare encoded in the rotation angles of rotation gates in U(z,θ).
The similar circuit model is also found in meta-VQE [45], which uses physical parameters such as the distance of
atomic nucleus instead of random latent variables z. Also, the model proposed in [12] introduces the latent variables
zas the computational basis of an initial state in the form |φθ(z)i=U(θ)|zi; however, in this model, states with
different latent variables are always orthogonal with each other, and thus the model cannot capture a small change of
state in the Hilbert space via changing the latent variables. In contrast, the model (3.6) fulfills this purpose as long as
the expressivity of the state with respect to zis enough. In addition, our model is advantageous in that the analytical
derivative is available by the parameter shift rule [46,47] not only for the tunable parameters θbut also for the latent
variables z. This feature will be effectively utilized in the anomaly detection problem in Sec. 5.
Next, as for the learning cost, we take the following empirical estimator of OTL, calculated from the training data
{|ψii}M
i=1 and the samples of latent variables {zj}Mg
j=1:
Lclocal {|ψii}M
i=1,{|φθ(zj)i}Mg
j=1= min
{πi,j }M,Mg
i,j=1
M
X
i=1
Mg
X
j=1
clocal,i,j πi,j ,
subject to
M
X
i=1
πi,j =1
Mg
,
Mg
X
j=1
πi,j =1
M, πi,j 0.
(3.7)
where clocal,i,j is the ground cost given by
clocal,i,j =v
u
u
t1
n
n
X
k=1
(1 p(k)
i,j ),
p(k)
i,j = Tr Pk
0U(zj,θ)|ψiihψi|U(zj,θ), P k
0=I1I2⊗ ··· ⊗
k-th bit
z }| {
|0ih0|k··· ⊗ In.
(3.8)
Note that in practice clocal,i,j is estimated with the finite number of measurements (shots); we denote ˜c(Ns)
local,i,j to be
the estimator with Nsshots for the ideal one clocal,i,j , and in this case the OTL is denoted as L˜c(Ns)
local
.
Based on the OTL (3.7), the pseudo-code of proposed algorithm is shown in Algorithm 1. The total num-
ber of training quantum states |ψiirequired for the parameter update is of the order O(MMgNs) in step 3, and
O(max(M, Mg)NsNp) in step 5, since the parameter shift rule [46,47] is applicable.
5
摘要:

GenerativemodelforlearningquantumensembleviaoptimaltransportlossHiroyukiTezuka*,1,2,3,ShumpeiUno*,2,4,andNaokiYamamoto„,2,51SonyGroupCorporation,1-7-1Konan,Minato-ku,Tokyo,108-0075,Japan2QuantumComputingCenter,KeioUniversity,Hiyoshi3-14-1,Kohoku-ku,Yokohama223-8522,Japan3GraduateSchoolofScienceandTe...

展开>> 收起<<
Generative model for learning quantum ensemble via optimal transport loss Hiroyuki Tezuka123 Shumpei Uno24 and Naoki Yamamoto25.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:2.04MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注