Deep Neural Networks as the Semi-classical Limit of Topological Quantum Neural Networks The problem of generalisation

2025-05-06 0 0 3.67MB 22 页 10玖币
侵权投诉
Deep Neural Networks as the Semi-classical Limit of Topological Quantum Neural
Networks:
The problem of generalisation
Antonino Marcianò,1Emanuele Zappala,2Tommaso Torda,3Matteo
Lulli,4Stefano Giagu,3Chris Fields,5Deen Chen,6and Filippo Fabrocini7
1Center for Field Theory and Particle Physics & Department of Physics,
Fudan University, Jingwan campus, Jingsan Rd, 200433 Shanghai,
China and Laboratori Nazionali di Frascati INFN,
Via Enrico Fermi, 54, 00044 Frascati (Rome), Italy,
EU and INFN sezione Roma “Tor Vergata”, 00133 Rome, Italy, EU
2Department of Mathematics and Statistics, Idaho State University
Physical Science Complex | 921 S. 8th Ave., Stop 8085 | Pocatello, ID 83209
3Department of Physics, University of Rome “La Sapienza”, Piazzale Aldo Moro,
5, 00185 Rome, Italy, and the Laboratori Nazionali di Frascati INFN,
Via Enrico Fermi, 54, 00044 Frascati (Rome), Italy
4Department of Mechanics and Aerospace Engineering,
Southern University of Science and Technology, 1088 Xueyuan Avenue, 518055 Shenzhen, China§
5Allen Discovery Center, Tufts University, 200 College Avenue, 02155 Medford, MA, US
6College of Design and Innovation, Tongji University, 281 Fuxin Rd, 200092 Shanghai, China∗∗
7College of Design and Innovation, Tongji University, 281 Fuxin Rd,
200092 Shanghai, China and Institute for Computing Applications "Mario Picone",
Italy National Research Council, Via dei Taurini, 19, 00185 Rome, Italy, EU††
Deep Neural Networks miss a principled model of their operation. A novel framework for super-
vised learning based on Topological Quantum Field Theory that looks particularly well suited for
implementation on quantum processors has been recently explored. We propose using this frame-
work to understand the problem of generalisation in Deep Neural Networks. More specifically, in
this approach, Deep Neural Networks are viewed as the semi-classical limit of Topological Quantum
Neural Networks. A framework of this kind explains the overfitting behavior of Deep Neural Net-
works during the training step and the corresponding generalisation capabilities. We explore the
paradigmatic case of the perceptron, which we implement as the semiclassical limit of Topological
Quantum Neural Networks. We apply a novel algorithm we developed, showing that it obtains
similar results to standard neural networks, but without the need for training (optimisation).
I. INTRODUCTION
The problem of generalisation for deep neural networks
(DNNs), i.e. neural networks with several hidden layers,
is the problem of understanding how DNNs can success-
fully generalise. Test errors that are very close to their
training errors can be achieved, even when they have suf-
ficiently many model parameters to demonstrably “mem-
orise” the training data [13]. In other words, the prob-
lem of generalisation is the problem of understanding why
(at least some) overparametrised DNNs do not fail to
generalise when they display an overfitting regime. Gen-
eralisation failure would be expected from the standard
bias-variance trade-off; however, many DNNs evince a
“double-descent” behavior, in which the generalisation
Electronic address: marciano@fudan.edu.cn
Electronic address: emanuelezappala@isu.edu
Electronic address: tommaso.torda@uniroma1.it
§Electronic address: lulli@sustech.edu.cn
Electronic address: fieldsres@gmail.com
∗∗Electronic address: deenchen@tongji.edu.cn
††Electronic address: fabrocini@tongji.edu.cn
performance increases, instead of decreasing, as the num-
ber of model parameters grows [4]. As shown by Zhang
et al. [1], the methods of conventional generalisation the-
ory, i.e. statistical learning theory, demonstrably fail to
explain this behavior. The problem of generalisation has
generated a rich literature [517], but remains unsolved.
As Kevin Hartnett [18] has written in Quanta Magazine:
When we design a skyscraper, we expect
it will perform to specification: that the
tower will support so much weight and be
able to withstand an earthquake of a certain
strength. But with one of the most important
technologies of the modern world, we are ef-
fectively building blind. We play with differ-
ent designs, tinker with different setups, but
until we take it out for a test run, we do not
really know what it can do or where it will
fail.
We do not really know, in fact, after one or even sev-
eral test runs: it always remains the case that the “next”
generalisation problem will reveal failure. A principled
approach for studying generalisation in DNNs that goes
arXiv:2210.13741v2 [quant-ph] 11 Oct 2024
2
beyond conventional generalisation theory is, therefore,
needed.
Here we bring to completion our previous conjecture
[19] that DNN architectures can be considered the semi-
classical limit of a generalised quantum neural-network
(QNN) architecture, the topological quantum neural net-
work (TQNN). To this purpose, we elaborate on the
theoretical framework we proposed in [19], and provide
empirical evidence to support our thesis, ultimately de-
riving a novel answer to the problem of generalisation.
TQNNs differ from conventional QNNs [20,21] by allow-
ing the number of “layers” and their connection-topology
to vary arbitrarily, provided only that the input and out-
put boundary conditions are preserved. The full gen-
erality of the TQNN framework as a representation of
computational processes acting on classical data to yield
classical outputs has been recently proven [22]. We pro-
pose, in particular, that DNNs generalise because they
are classical limits of TQNNs. DNNs are, from this per-
spective, implementations of distributed, high-entropy,
error-correcting codes that can be seen as classical lim-
its of the quantum error-correcting codes (QECCs) [23]
implemented by TQNNs [24]. Increasing the model ca-
pacity of a DNN increases its generalisation ability be-
cause it increases the redundancy available in the code
space. We demonstrate, using the perceptron architec-
ture as a simplified example, that a TQNN can be used
to directly compute the parameters needed to generalise
from a training set, in the absence of any actual train-
ing of the network. This raises the possibility of replacing
data-intensive training of DNNs with quantum computa-
tions, with a significant increase in efficiency and decrease
in operational costs.
The plan of the paper is the following. In §II, we
provide a formal definition of generalisation (§II A), and
briefly review some expectations of conventional general-
isation theory and how they have been observed to fail
in DNNs (§II B). We then show, via a simple thought
experiment, how generalisation can be expected to im-
prove as parameters are added when a QECC is employed
II C). We provide a brief review of topological quantum
field theory (TQFT), the theoretical framework under-
lying TQNNs, in §III, and summarise the TQNN con-
struction in §IV. In §Vwe discuss the theoretical foun-
dations of the correspondence between generalisation and
the semi-classical limit of TQNNs. Specifically, in §V A
we introduce the concept of generalisation as the semi-
classical limit of TQNNs. In §V B we discuss the paradig-
matic case of the sum over quantum histories of a one-
particle state, considered in analogy with a perceptron.
We show that generalisation arises naturally from the
path-integral formalism. In §V C we extend this con-
struction to any TQNN, by rephrasing the path-integral
realising the sum over quantum histories for the percep-
tron, with generic spin-network states. In §VI we compu-
tationally showcase the theory of perceptron as a semi-
classical limit of TQNN. In §VII we discuss possible per-
spectives to develop the current theoretical framework.
Finally, in §VIII we spell out our conclusions.
II. BACKGROUND AND MOTIVATION
A. The generalisation bound problem
Informally, generalisation is the ability to identify
members of some set (or class) Dafter being exposed to
the members of some proper subset SD. The generali-
sation ability of machine learning (ML) systems is tested
by exposing them to new members of Dafter training
via backpropagation (or similar algorithm) on the mem-
bers of S. Whether they have identified new members
of Dcorrectly is determined by human knowledge of D.
This knowledge is, however, not explicitly characterised:
if we had an algorithmic specification of the members of
D, we could encode that specification directly and ML
would not be necessary. We can make this dependence
on non-algorithmic human knowledge explicit by defining
generalisation as follows.
We can, without loss of generality, treat Das a set
of N-bit strings, i.e. D={0,1}N. Let SDbe the
training set. Let Y={0,1}M, with MN, be the set
of possible “classes” or “answers” to be obtained by acting
on D. In most cases of interest, MN, e.g. classifying
handwritten characters, distinguishing cats from dogs, or
identifying individual people from photographs.
Now consider two agents Aand B, who implement
functions fA, fB:DYand gA, gB:SYrespec-
tively, where we assume gA=fA|Sand gB=fB|S. This
latter assumption is true even if Aand/or Bis trained
on a set SSthat also contains noise or distractors, as
in [2]. We can call Athe “ground truth” observer (e.g. a
human) and Bthe “test” observer (e.g. a ML system).
The “training error” is d(gA, gB), where dis some met-
ric on the relevant function space, and the “test error”
or “generalisation error” is d(fA, fB). We can measure
d(gA, gB)because we can explicitly list every instance of
gAand gBacting on S. Both fAand fBare unknown,
so we cannot compute d(fA, fB).
We can now state the generalisation bound problem
(GBP) as: given d(gA, gB)ϵ, determine an upper
bound βϵsuch that d(fA, fB)β. There is an a
priori upper bound on β:2N, corresponding to Aand B
disagreeing on every instance in D. In this case, ϵ= 2M.
Our goal is to make βϵas ϵ0, which corresponds
to both optimal learning and Sbeing optimally represen-
tative of D.
It is worth noting that the GBP cannot be solved in
the general case. The set Dmay not be computable; we
know, for example, that the generalisation from some set
Sof programs that halt to the set Dof all programs that
halt is not computable [25]. There is, moreover, every
reason to expect that human judges may disagree about
edge cases for some classes of objects, and hence about
membership in some candidate sets D, rendering test er-
ror in such cases observer-dependent. It is also worth
3
noting that humans are every bit as much black boxes
as any ML system — the “explanation problem” [26,27]
for humans is even harder than it is for ML systems,
because experimenting on humans is more difficult than
experimenting on machines.
B. How conventional generalisation theory fails to
explain DNN generalisation
The goal of generalisation theory is to explain and
justify why and how minimising the empirically mea-
surable training error d(g) = d(gA, gB)is a reasonable
approach to minimising the test error d(f) = d(fA, fB)
by analysing the generalisation gap, i.e. d(f)d(g). In
practice, the generalisation gap is typically taken to be
fBgB, i.e. the “gap” is taken to characterise the per-
formance of the ML system being tested. This implicitly
assumes that gA=fA, i.e. that the “ground truth” ob-
server Acan identify all elements of D; we have seen
above that this assumption can be incorrect. With this
assumption, the gap results entirely from the dependence
of the trained predictor gS(here we drop the “agent” sub-
script for convenience) on the training set S. However,
as noted above, Smight not be sufficiently representative
of Dto direct the learner towards a good classifier for all
of D. In a ML context one is, moreover, more interested
in the generalisation error than in the empirical train-
ing error; the main goal is evaluating the performance
of the model on the unseen cases, i.e. on D\S. If we
look at the problem from this point of view, then the
challenge comes from the mismatch between the optimi-
sation task of minimising the empirical or training risk
and the machine learning task of minimising the true or
test risk. Treating Das a probability distribution over
possible objects x, we can represent this mismatch as:
LD,f (gS) = PxD[gS(x)̸=f(x)] = D[{x:gS(x)̸=f(x)}],
where the error due to using gSis the probability of ran-
domly drawing an example x, according to the distribu-
tion D, for which gS(x)̸=f(x), measured with respect
to the probability distribution Dand the correct label-
ing function f. Here Pis the probability of a random
variable, and xDrepresents sampling xaccording to
D.
The dependence of the predictor gSon the training
data set Shas led to the definition of several examples
of complexity measures providing bounds on the test er-
ror or on the sample-complexity. An instance of these
measures are the VC-dimension or the Rademacher com-
plexity. In particular, if the VC-dimension provides an
upper bound on the test error, the Rademacher complex-
ity measures the richness of a class of predictors with re-
spect to a probability distribution. These results provide
bounds on the test error that depend on the complexity
or capacity of the class of functions Gfrom which gSis
drawn. In this sense, generalisation theory and capacity
control are strictly related. Capacity control consists in
using models that are rich enough to get good fits with-
out using those which are so rich that they overfit.
The empirical success of DNNs, which are typically
overparametrised models [28], challenges the traditional
complexity measures. Indeed, according for instance to
the VC-dimension, the discrepancy between training er-
ror and generalisation error is bounded from above by
a quantity that grows at least linearly as the number of
adjustable parameters grows, but shrinks as the num-
ber of training examples increases. As a consequence,
the traditional complexity measures tend to control the
capacity by minimising the number of parameters [29
31]. Moreover, experimental results prove that DNNs
that have been trained to interpolate the training data
achieve a near-optimal test result even when the train-
ing data have been corrupted by a massive amount of
noise [2]. As Poggio et al. [3] write, the main puzzle of
DNN revolves around the absence of overfitting despite
large overparametrisation and despite the large capacity
demonstrated by zero training error on randomly labeled
data. It is, in other words, the fact that DNNs can im-
plement an operation L: (r, gS)7→ f, where ris some
initial, e.g. random, network state, that (at least ap-
proximately) correctly generalises to the desired ffrom
a function gthat is not the restriction of fto the to-
tal training set S, but rather a noise function, namely a
function of randomised data, on S.
Historically, this puzzle was raised by a seminal paper
of Zhang et al. [1] mentioned earlier; see also [32]. This
paper showed that successful deep model classes have suf-
ficient capacity to memorise randomised data sets while
having the ability to produce zero training error for par-
ticular natural datasets, e.g. CIFAR-10. The authors
also empirically observed that explicit regularisation on
the norm of weights seemed to be unnecessary to obtain
small test errors, in contradiction to the “traditional wis-
dom” of generalisation theory. In the case of DNNs, the
large capacity of the model looks sufficient to memorise
the training data by brute force. This behavior conflicts
with conventional generalisation theory, since learning by
explicit memorisation of training examples should not
imply generalisation capabilities. Generalising is tradi-
tionally understood as learning some underlying rule as-
sociated with the data generation process, i.e. learning
some compact representation of the training function gS,
and therefore being able to extrapolate that rule from
the training data to new unseen data. Moreover, as we
have already mentioned, a result of this kind is a chal-
lenge to traditional complexity measures and, in general,
to computational learning theory, since none of the exist-
ing bounds produces non-trivial results for interpolating
solutions of the sort generated by such DNN models.
Several efforts have recently focused on achieving non-
vacuous generalisation bounds for DNN models [11,33,
34]. In the light of this situation, Belkin et al. [4] pro-
pose to subsume the traditional U-shaped risk curve de-
scribing the trade-off between underfitting and overfit-
4
FIG. 1: Traditional U-shaped risk curve, describing the trade-off
between underfitting and overfitting, and double-descent risk curve,
from [4, Figure 1].
ting with a double-descent risk curve Figure 1. From the
standard viewpoint, when the number of parameters N
is much smaller than the sample size n, i.e. N << n, the
traditional complexity measures assume that the train-
ing risk is close to the test risk. However, according to
the double descent risk curve that the authors propose,
by increasing progressively Nand thus by increasing the
class capacity, the model will also increase the function
classes until they are rich enough to achieve zero train-
ing error. As a consequence, near-perfect fit functions
can be progressively constructed. These typically show
a smaller norm and, thus, are simpler in the sense of the
Occam’s razor. A view of this kind contradicts the tra-
ditional framework according to which, by increasing the
class capacity, the predictors will perfectly interpolate the
data. In this case, when an overfitting regime is achieved,
a very high generalisation risk is achieved as well. Yet,
as the authors state, increasing further the function class
capacity beyond this point, leads to decreasing test risk,
typically below the risk achieved when balancing under-
fitting and overfitting under the traditional approach.
This view matches recent results from Poggio et al. [3]
and Bartlett et al. [35] according to which overparametri-
sation leads to “benign overfitting” in which an accurate
level of generalisation is achieved despite a near-perfect
fit to training data. In particular, according to these au-
thors, overparametrisation enables gradient techniques to
impose regularisation implicitly while leading to accurate
predictions despite their overfitting behaviour. Such be-
havior will prefer matrices showing a smaller norm, and
hence a more distributed representation. This is a result
that matches the thesis that the generalisation perfor-
mance also depends on the size of the weights [12]. Both
of these assumptions guide our approach as well, even
though they will be embedded into a theoretical frame-
work totally different from classical complexity theory.
C. A quantum approach to generalisation
Quantum computing introduces a new and fundamen-
tally nonclassical resource for the representation of data:
quantum coherence, typically in the form of quantum
entanglement [36]. Just as classical computers employ
additional bits — from single-bit parity checks to full
backup copies — to detect and correct errors and hence
enable robustness in the face of noise or other perturba-
tions, quantum computers employ additional quantum
bits (qbits) to implement QECCs. The extreme sensi-
tivity of quantum states to environmental perturbations,
which induce decoherence [37] and hence “erase” data,
requires that effective QECCs have large dimensional-
ity, typically at least two qbits for every qbit to be pro-
tected [38]. Adding dimensionality, i.e. adding qbits to
the code-space of the implemented QECC, to a quantum
computer increases the range of perturbations against
which the encoded data will be protected [23].
The relevance of a high-dimensionality QECC to the
generalisation ability of a QNN can be made obvious
with a simple thought experiment. Consider a two-player
game in which Aand Bshare a set Yof classical to-
kens, and share a quantum channel X.Aselects a token
yYand encodes it into Xusing some quantum op-
erator QA, which is nondeterministic in the sense that
QA(y)depends on the immediately-previous state of X.
Bthen uses an operator QBto read a token yfrom X.
Bsends yto A, and Asends back either ‘yes’ or ‘no’.
B’s objective is to vary QBso as to receive ‘yes’ answers
from A. Hence asymptotically, B’s objective is to make
QBQA.
This game is an example of local operations, classical
communication (LOCC) protocol [39]. The local opera-
tions are the encoding and measurement steps that em-
ploy QAand QB, respectively; the classical communica-
tion is the exchange of classical data (y) and ‘yes’ or ‘no’
answers. The quantum channel subserving any successful
LOCC protocol must implement a QECC to protect the
quantum encoding of classical data [24]. In this game,
Ximplements a QECC, with QAand QBthe encoding
and decoding operators. The dimension dim(X) must,
therefore, be much larger than dim(Y); how much larger
depends on what range of perturbations the code protects
against.
Suppose Aand Bplay this game for nrounds, at the
end of which Bis consistently getting ‘yes’ from A. What
can be said about further rounds of the game? This is,
clearly, the question of generalisation: the question of
how similar QBis to QAafter nrounds of “training”. If
the QECC implemented by Xprotects against some set
Bαof perturbations, and any encoding x=QA(y)is such
5
that Bαsuch that x=Bα(xi), xix1. . . xn, then B
will receive a ‘yes’ after decoding x. This corresponds to
the x1. . . xnbeing representative of QA(y)for arbitrary
yand arbitrary previous |X, given the set Bα. If we
know the Bα, we can find a lower bound on dim(X). In-
creasing dim(X) increases the set of perturbations that
Xcan correct against. Here it is clear that the gener-
alisation problem is not solvable in the general case: a
general solution of the generalisation problem would be
a QECC that protects against a provably-complete set of
perturbations mapping Sto Dfor arbitrary Sand D. If
Dis unknown, no set of perturbations can be proved to
be complete.
In this thought experiment, the training and test sets
Sand Dare, effectively, measurements of the state in the
“middle” of X, reflecting the fact that from a quantum-
computing perspective, the “world” outside the commu-
nicating agents is a quantum information channel. We
can think of Aand B— a human and a ML system un-
dergoing training — as converging on a shared language
— a shared set of concepts. The goal is for B’s con-
cepts — elements of Y— to refer to the same things as
A’s concepts; they approach agreement asymptotically
by playing the above game. They each employ a quan-
tum channel, the physical world, to generate a classical
encoding, D. They take turns showing each other an
instance of Dand naming a concept. This is how, for
example, teaching a child to speak a language works.
In this quantum setting, there is no overfitting prob-
lem, because Xdoes not separate into components that
individually encode particular inferences. Instead, Xem-
ploys a high-dimensional entangled state to maximally
distribute the representation of both the data and the
implemented computation (in the above case, an Iden-
tity map). A QECC is, in other words, as far from a
classical look-up table as one can get; it effectively takes
such a table and fully entangles it. Hence for any channel
that implements QECC, the test risk goes to zero as the
dimensionality, and hence the range of perturbations the
QECC protects against, increases.
As mentioned earlier, DNNs are the classical limits
of TQNNs, a generalisation of conventional quantum-
gate based, layered QNNs [19]. As quantum comput-
ers, TQNNs employ QECCs to achieve robustness [24].
Hence TQNNs, and DNNs as their classical limits, gen-
eralise better as their dimensionality increases. Indeed
minimising the norm of a DNN’s matrix — maximising
the classical entropy of the representation — is the clas-
sical limit of the use of entanglement to achieve a maxi-
mally distributed quantum encoding.
III. BASICS ON TOPOLOGICAL QUANTUM
FIELD THEORY
In this section we recollect some basic facts on TQFTs
and give a general overview of the mathematical formal-
ism used in this article. While TQFTs use the language
of category theory, we will not delve into categorical ap-
proaches, but simply explain the meaning of certain cat-
egorical notions used in TQFT. We will also provide a
physical perspective of TQFTs, which is the driving mo-
tivation of this work, as application of this framework to
machine learning (in the semi-classical limit).
A. Mathematical foundation
In general mathematical terms, using the Atiyah-Segal
axioms [40,41], a TQFT is a functor from the category of
cobordisms to the category of vector spaces. This defini-
tion, while very concise, is not particularly illuminating.
We will therefore explain the most important points of
this definition in a simplified manner.
First, a cobordism is a manifold of dimension n+ 1
whose boundary is a union of manifolds of dimension n.
As a simple example one can think of a cylinder. In fact,
a cylinder is a manifold of dimension 2, while a circle is a
manifold of dimension 1, and the boundary of a cylinder
is a union of two circles. By category, in this context,
it is meant that it is possible to perform operations be-
tween manifolds, such as glueing manifolds along their
boundaries. One can therefore imagine, for example, to
glue a cylinder to another cylinder along circles that con-
stitute part of their boundaries. Therefore, we can com-
pose manifolds by glueing along homeomorphic bound-
aries. This procedure is understood to be orientation-
preserving. In addition, we have another type of opera-
tion, the “tensor product”, which consists of taking the
disjoint union of manifolds.
The category of vector spaces in the context of TQFT
simply refers to the class of vector spaces over a given
field, e.g. C, and the linear maps between them. Com-
position is the usual composition of maps. The tensor
product, which is the second operation defined for cobor-
disms, is just the regular tensor product of vector spaces
and linear maps.
With these definitions in place, we are in the posi-
tion to state what a functor between cobordisms and
vector spaces is, and therefore to give a more descrip-
tive definition of TQFT. Such a functor is an assignment
of n-dimensional manifolds to vector spaces, and cobor-
disms between manifolds to linear maps between the cor-
responding vector spaces. One can think of a functor as
a translation of the terminology from a category (the one
of cobordisms in our case) to another category (the one
of vector spaces in our case). The correspondence should
satisfy certain coherence axioms, in the sense that com-
positions are sent to compositions, and tensor products
are sent to tensor products.
If we denote by Cob the category of cobordisms, and
by V eckthe category of vector spaces over a field k, then
a TQFT is a functor F:Cob V eck, meaning that
F(M)is a vector space for each n-dimensional manifold
M, and any cobordism Σwith boundary M1M2corre-
sponds to a linear map F(Σ) : F(M1)→ F(M2). The
摘要:

DeepNeuralNetworksastheSemi-classicalLimitofTopologicalQuantumNeuralNetworks:TheproblemofgeneralisationAntoninoMarcianò,1EmanueleZappala,2TommasoTorda,3MatteoLulli,4StefanoGiagu,3ChrisFields,5DeenChen,6andFilippoFabrocini71CenterforFieldTheoryandParticlePhysics&DepartmentofPhysics,FudanUniversity,Ji...

展开>> 收起<<
Deep Neural Networks as the Semi-classical Limit of Topological Quantum Neural Networks The problem of generalisation.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:22 页 大小:3.67MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注