A simple probabilistic neural network for machine understanding Rongrong Xie

2025-04-30 0 0 6.59MB 35 页 10玖币
侵权投诉
A simple probabilistic neural network for machine
understanding
Rongrong Xie
Key Laboratory of Quark and Lepton Physics (MOE) and Institute of Particle Physics,
Central China Normal University (CCNU), Wuhan, China
and
Matteo Marsili
Quantitative Life Sciences Section
The Abdus Salam International Centre for Theoretical Physics, 34151 Trieste, Italy
Abstract
We discuss probabilistic neural networks with a fixed internal representation as
models for machine understanding. Here understanding is intended as mapping data
to an already existing representation which encodes an a priori organisation of the
feature space.
We derive the internal representation by requiring that it satisfies the principles
of maximal relevance and of maximal ignorance about how different features are com-
bined. We show that, when hidden units are binary variables, these two principles
identify a unique model – the Hierarchical Feature Model (HFM) – which is fully
solvable and provides a natural interpretation in terms of features.
We argue that learning machines with this architecture enjoy a number of inter-
esting properties, like the continuity of the representation with respect to changes in
parameters and data, the possibility to control the level of compression and the ability
to support functions that go beyond generalisation.
We explore the behaviour of the model with extensive numerical experiments and ar-
gue that models where the internal representation is fixed reproduce a learning modal-
ity which is qualitatively different from that of traditional models such as Restricted
Boltzmann Machines.
”What I cannot create, I do not understand”
(Richard P. Feynman)
The advent of machine learning has expanded our ability to “create”, i.e. to generalise
from a set of examples, much beyond what we understand. In the classical view, the ability
marsili@ictp.it
1
arXiv:2210.13179v5 [cond-mat.dis-nn] 6 Dec 2023
to create data entails extracting compressed representations of complex data that capture
their regularities. In this view, the simpler the algorithm [1] or the statistical model [2]
that captures the regularities of the data, the more we understand. Yet, even machines
like auto-encoders or deep belief networks extract compressed representations from complex
data. But theirs is a form of understanding that is unintelligible to us.
Furthermore, the many triumphs of machine intelligence from automatic translation
to the generation of texts and images, have shown that the ability to “create” does not
require simplicity. The accuracy in deep neural networks can increase with complexity (i.e.
with the number of parameters), without overfitting [3, 4].
In this paper ”understanding” will be interpreted as ”representing new knowledge or
data within a preexisting framework or model1”. Our aim is to explore the properties of
a learning modality, in which the internal representation is fixed, within an unsupervised
learning framework of simple one layer probabilistic neural networks. When the internal
representation is fixed a priori, only the conditional distribution of the data, given the
hidden variables, needs to be learned.
The models we shall study are similar in spirit to variational auto-encoders [5, 6].
Yet they differ in how the internal representation is chosen. Rather than insisting on
computational manageability or on independence of the hidden features [6, 7], we derive the
distribution of hidden variables from first principles. We do this by drawing from previous
work [8] that introduces an absolute notion of relevance for representations and argues
that relevance should be maximal for internal representations of neural networks trained
on datasets with a rich statistical structure. Indeed, the principle of maximal relevance [8]
has been shown to characterise both trained models in machine learning [9, 10] and efficient
coding [11] (see Section 1.1 for more details). For models with binary hidden variables,
which are those we focus on, we show that the principle of maximal relevance and the
requirement that the occurrence of features be as unbiased as possible, uniquely identify a
model. This model exhibits a hierarchical organisation of the feature space, which is why
we call it the Hierarchical Feature Model (HFM).
There are several reasons why such an approach may be desirable, besides being sup-
ported by first principles: it allows the internal representation to be chosen as a simple
model, in agreement with the Occam razor, thus providing a transparent interpretation
of the features. Models where the internal representation is fixed operate in a statistical
regime characterised by the classical bias–variance tradeoff, which is qualitatively different
from the regime in which over-parametrised (probabilistic) neural networks, such as Re-
stricted Boltzmann Machines [12, 13] (RBMs), operate. There are many more properties
that differ markedly from those that hold in RBMs, that we shall take as the natural ref-
erence point for our model: The compression level of the internal representation can be
fixed at the outset. There is no need to resort to stochastic gradient descent based training.
Features can be introduced one by one in the training protocol and, as we shall see, already
1We thank Alessandro Ingrosso for suggesting this interpretation of “understanding”.
2
learned features need not be learned anew when new features are introduced. These prop-
erties do not hold in RBMs2. Also, sampling the equilibrium distribution of data is trivial
when the internal representation is chosen a priori, while it is not in RBMs [15]. Finally, a
fixed internal representation makes it possible to explore functions that go beyond learning
and generalisation, as we shall discuss later.
After illustrating these points on few case studies, we will argue that learning, in our
model, follows a qualitatively different modality with respect to the one of RBMs. In
RBMs, data specific information is transferred to the internal representation, with weights
which encode generic features [13]. In contrast, when the internal representation is fixed,
data specific information is necessarily incorporated in the weights. Hence weights can
be thought of as feature vectors extracted from the data. In a design where the internal
representation is fixed, training is akin to learning how to organise the data into an abstract
predetermined archive. We speculate that the abstraction implicit in this training modality
may support higher “cognitive” functions, such as making up correlations between data
learned separately. We shall comment further on the difference between these two learning
modalities in the concluding Section, besides discussing further research avenues.
1 Single layer probabilistic neural networks
We focus on unsupervised learning and, in particular, on probabilistic single layer neural
networks composed by a vector of mbinary variables x= (x1, . . . , xm) coupled to a vector
s= (s1, . . . , sn) of nbinary variables (xj, si∈ {0,1}). xis called the visible layer whereas
sis the hidden layer. The network is defined by a joint probability distribution p(x,s|θ)
over the visible and the hidden variables, that depends on a vector θof parameters.
Given a dataset ˆ
x= (x(1),...,x(N)) of Nsamples of the visible variables, the network
is trained by maximising the log-likelihood
L(ˆ
x|θ) =
N
X
i=1
log p(x(i)|θ),(1)
over θ, where
p(x|θ) = X
s
p(x,s|θ) (2)
is the marginal distribution of x. We’ll denote by ˆ
θ= arg maxθL(ˆ
x|θ) the learned param-
eters (after training). Training maps the dataset ˆ
xonto an internal representation
p(s|ˆ
θ) = X
x
p(x,s|ˆ
θ) (3)
2We refer to classical RBM defined in Appendix D. Cˆot´e and Larochelle [14] introduce a version of the
RBM where hidden units are hierarchically ordered and in which features can be learned one by one, and
their number does not need to be set a priori.
3
which is the marginal distribution over hidden states s.
For example, in a RBM (see Appendix D) the joint distribution p(x,s|θ) is defined in
such a way that the variables xand sare easy to sample by Markov chain Montecarlo
methods. This allows one to estimate the gradients of the log-likelihood. During training
of a RBM both the internal representation p(s|θ) and the conditional distribution p(x|s, θ)
are learned from scratch.
This paper explores architectures
p(x,s|θ) = p(x|s, θ)p(s) (4)
where the internal representation p(s) is fixed at the outset, and only the conditional
distribution p(x|s, θ) is learned.
We assume that all statistical dependencies between the xjare “explained” by the
variable s. This implies that, conditional on s, all components of xare independent. For
binary variables xj∈ {0,1}this implies3
p(x|s, θ) =
m
Y
j=1
ehj(s)xj
1 + ehj(s), hj(s) = aj+
n
X
i=1
siwi,j .(5)
We shall interpret the vector wi= (wi,1, . . . , wi,m) as feature i. Hence points xdrawn
from p(x|s, θ) with si= 1 are characterised by feature i, whereas if si= 0 the distribution
p(x|s, θ) generates points without feature i. The internal configuration sthen specifies a
profile of features. The distribution p(s) encodes the way in which the space of features is
populated.
Architectures where the internal representation is fixed at the outset and only the
output layer is learned have been proposed for supervised learning tasks (see e.g. [16, 17,
18]) and for unsupervised learning, e.g. in auto-encoders [6, 5]. Their success suggests that
the internal representation may be largely independent of the data being learned. The
choice of p(s) in these examples is dictated mostly by computational efficiency and/or by
requirements of interpretability4. Our focus is not on computational efficiency but rather
on deriving p(s) from first principles rooted in information theory.
1.1 The principle of maximal relevance
The first requirement that we shall impose on p(s) is that it should obey a principle of
maximal relevance. Relevance has been recently proposed [8, 19] as a context and model
3A straightforward generalisation to continuous variables is possible, taking xjas Gaussian variables
with mean aj+Pisiwi,j and variance σ2
j.
4Variational auto-encoders generally assume an internal representation where sare independent Gaussian
variables with unit variance. This enforces a representation of “disentangled” features, i.e. of features which
are independent of each other. Locatello et al. [7] discuss the limits of this approach.
4
free measure of informativeness of a dataset or of a representation. It is defined as the
entropy of the distribution of coding costs Es≡ −log2p(s), i.e.
H[E] = X
E
p(E) log2p(E), p(E) = X
s
p(s)δ(E+ log2p(s)) .(6)
The relevance differs from the Shannon entropy of p(s),
H[s] = X
s
p(s) log2p(s)E[Es],(7)
which is the average coding cost5.H[s] quantifies the compression level and, following
Ref. [8], it is called resolution.
Ref. [8] provides several arguments that support the hypothesis that efficient repre-
sentations satisfy the principle of maximal relevance, i.e. that p(s) maximises H[E] at a
given average coding cost H[s]. For example, it shows that H[E] lower bounds the mutual
information between the hidden state sof the network and the hidden features that the
training process extracts from the data. So representations of maximal relevance are those
that, in theory, extract the largest amount of information from data.
The principle of maximal relevance dictates that the number W(E) of states swith
coding cost E=log2p(s) should follow an exponential law, W(E) = W0egE , where the
constant gdepends on the resolution H[s]. The exponential law entails an efficient use of
information resources. In order to see this remember that the coding cost Emeasures the
level of detail of a state, i.e. the (minimal) number of bits needed to represent it. There are
2Epossible states with a level of detail E, hence W(E)2Ecorresponds to a situation in
which the feature space is exploited optimally, in the sense that it is occupied as uniformly
as possible6. For a generic value of H[s], the same principle leads to W(E) = W0egE , as
shown in [8], with g̸= log 2 which depends on the resolution H[s].
As a function of H[s], the maximal value of the relevance H[E] has a non-monotonic
behaviour which distinguishes two learning regimes. For large values of H[s], H[E] is a
decreasing function of H[s]. In other words, compression, i.e. a reduction of H[s], brings
an increase in relevance in this regime. Learning in this regime is akin to compressing
out irrelevant details (i.e. noise) from data. Upon decreasing H[s] further, H[E] reaches
a maximum and then starts decreasing. The maximum of H[E] corresponds to the most
compressed lossless representation. This point coincides with the optimum discussed above
where W(E)2E. A further reduction of H[s] beyond this point leads to lossy compressed
representations.
5We follow the standard notation [20] for the entropy H[X] of a random variable X.
6Notice that log2W(E) is the number of bits needed to distinguish between states with the same level
of detail E. Hence W(E)2Eimplies that the information cost (in bits) of retrieving a state is the same,
apart from an offset, as the number of bits needed to describe it, which is the description length E(in bits)
of that state.
5
摘要:

AsimpleprobabilisticneuralnetworkformachineunderstandingRongrongXieKeyLaboratoryofQuarkandLeptonPhysics(MOE)andInstituteofParticlePhysics,CentralChinaNormalUniversity(CCNU),Wuhan,ChinaandMatteoMarsili∗QuantitativeLifeSciencesSectionTheAbdusSalamInternationalCentreforTheoreticalPhysics,34151Trieste,I...

收起<<
A simple probabilistic neural network for machine understanding Rongrong Xie.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:6.59MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注