A simple probabilistic neural network for machine understanding Rongrong Xie

2025-04-30 0 0 6.59MB 35 页 10玖币

侵权投诉

A simple probabilistic neural network for machine

understanding

Rongrong Xie

Key Laboratory of Quark and Lepton Physics (MOE) and Institute of Particle Physics,

Central China Normal University (CCNU), Wuhan, China

and

Matteo Marsili∗

Quantitative Life Sciences Section

The Abdus Salam International Centre for Theoretical Physics, 34151 Trieste, Italy

Abstract

We discuss probabilistic neural networks with a ﬁxed internal representation as

models for machine understanding. Here understanding is intended as mapping data

to an already existing representation which encodes an a priori organisation of the

feature space.

We derive the internal representation by requiring that it satisﬁes the principles

of maximal relevance and of maximal ignorance about how diﬀerent features are com-

bined. We show that, when hidden units are binary variables, these two principles

identify a unique model – the Hierarchical Feature Model (HFM) – which is fully

solvable and provides a natural interpretation in terms of features.

We argue that learning machines with this architecture enjoy a number of inter-

esting properties, like the continuity of the representation with respect to changes in

parameters and data, the possibility to control the level of compression and the ability

to support functions that go beyond generalisation.

We explore the behaviour of the model with extensive numerical experiments and ar-

gue that models where the internal representation is ﬁxed reproduce a learning modal-

ity which is qualitatively diﬀerent from that of traditional models such as Restricted

Boltzmann Machines.

”What I cannot create, I do not understand”

(Richard P. Feynman)

The advent of machine learning has expanded our ability to “create”, i.e. to generalise

from a set of examples, much beyond what we understand. In the classical view, the ability

∗marsili@ictp.it

arXiv:2210.13179v5 [cond-mat.dis-nn] 6 Dec 2023

to create data entails extracting compressed representations of complex data that capture

their regularities. In this view, the simpler the algorithm [1] or the statistical model [2]

that captures the regularities of the data, the more we understand. Yet, even machines

like auto-encoders or deep belief networks extract compressed representations from complex

data. But theirs is a form of understanding that is unintelligible to us.

Furthermore, the many triumphs of machine intelligence from automatic translation

to the generation of texts and images, have shown that the ability to “create” does not

require simplicity. The accuracy in deep neural networks can increase with complexity (i.e.

with the number of parameters), without overﬁtting [3, 4].

In this paper ”understanding” will be interpreted as ”representing new knowledge or

data within a preexisting framework or model1”. Our aim is to explore the properties of

a learning modality, in which the internal representation is ﬁxed, within an unsupervised

learning framework of simple one layer probabilistic neural networks. When the internal

representation is ﬁxed a priori, only the conditional distribution of the data, given the

hidden variables, needs to be learned.

The models we shall study are similar in spirit to variational auto-encoders [5, 6].

Yet they diﬀer in how the internal representation is chosen. Rather than insisting on

computational manageability or on independence of the hidden features [6, 7], we derive the

distribution of hidden variables from ﬁrst principles. We do this by drawing from previous

work [8] that introduces an absolute notion of relevance for representations and argues

that relevance should be maximal for internal representations of neural networks trained

on datasets with a rich statistical structure. Indeed, the principle of maximal relevance [8]

has been shown to characterise both trained models in machine learning [9, 10] and eﬃcient

coding [11] (see Section 1.1 for more details). For models with binary hidden variables,

which are those we focus on, we show that the principle of maximal relevance and the

requirement that the occurrence of features be as unbiased as possible, uniquely identify a

model. This model exhibits a hierarchical organisation of the feature space, which is why

we call it the Hierarchical Feature Model (HFM).

There are several reasons why such an approach may be desirable, besides being sup-

ported by ﬁrst principles: it allows the internal representation to be chosen as a simple

model, in agreement with the Occam razor, thus providing a transparent interpretation

of the features. Models where the internal representation is ﬁxed operate in a statistical

regime characterised by the classical bias–variance tradeoﬀ, which is qualitatively diﬀerent

from the regime in which over-parametrised (probabilistic) neural networks, such as Re-

stricted Boltzmann Machines [12, 13] (RBMs), operate. There are many more properties

that diﬀer markedly from those that hold in RBMs, that we shall take as the natural ref-

erence point for our model: The compression level of the internal representation can be

ﬁxed at the outset. There is no need to resort to stochastic gradient descent based training.

Features can be introduced one by one in the training protocol and, as we shall see, already

1We thank Alessandro Ingrosso for suggesting this interpretation of “understanding”.

learned features need not be learned anew when new features are introduced. These prop-

erties do not hold in RBMs2. Also, sampling the equilibrium distribution of data is trivial

when the internal representation is chosen a priori, while it is not in RBMs [15]. Finally, a

ﬁxed internal representation makes it possible to explore functions that go beyond learning

and generalisation, as we shall discuss later.

After illustrating these points on few case studies, we will argue that learning, in our

model, follows a qualitatively diﬀerent modality with respect to the one of RBMs. In

RBMs, data speciﬁc information is transferred to the internal representation, with weights

which encode generic features [13]. In contrast, when the internal representation is ﬁxed,

data speciﬁc information is necessarily incorporated in the weights. Hence weights can

be thought of as feature vectors extracted from the data. In a design where the internal

representation is ﬁxed, training is akin to learning how to organise the data into an abstract

predetermined archive. We speculate that the abstraction implicit in this training modality

may support higher “cognitive” functions, such as making up correlations between data

learned separately. We shall comment further on the diﬀerence between these two learning

modalities in the concluding Section, besides discussing further research avenues.

1 Single layer probabilistic neural networks

We focus on unsupervised learning and, in particular, on probabilistic single layer neural

networks composed by a vector of mbinary variables x= (x1, . . . , xm) coupled to a vector

s= (s1, . . . , sn) of nbinary variables (xj, si∈ {0,1}). xis called the visible layer whereas

sis the hidden layer. The network is deﬁned by a joint probability distribution p(x,s|θ)

over the visible and the hidden variables, that depends on a vector θof parameters.

Given a dataset ˆ

x= (x(1),...,x(N)) of Nsamples of the visible variables, the network

is trained by maximising the log-likelihood

L(ˆ

x|θ) =

i=1

log p(x(i)|θ),(1)

over θ, where

p(x|θ) = X

p(x,s|θ) (2)

is the marginal distribution of x. We’ll denote by ˆ

θ= arg maxθL(ˆ

x|θ) the learned param-

eters (after training). Training maps the dataset ˆ

xonto an internal representation

p(s|ˆ

θ) = X

p(x,s|ˆ

θ) (3)

2We refer to classical RBM deﬁned in Appendix D. Cˆot´e and Larochelle [14] introduce a version of the

RBM where hidden units are hierarchically ordered and in which features can be learned one by one, and

their number does not need to be set a priori.

which is the marginal distribution over hidden states s.

For example, in a RBM (see Appendix D) the joint distribution p(x,s|θ) is deﬁned in

such a way that the variables xand sare easy to sample by Markov chain Montecarlo

methods. This allows one to estimate the gradients of the log-likelihood. During training

of a RBM both the internal representation p(s|θ) and the conditional distribution p(x|s, θ)

are learned from scratch.

This paper explores architectures

p(x,s|θ) = p(x|s, θ)p(s) (4)

where the internal representation p(s) is ﬁxed at the outset, and only the conditional

distribution p(x|s, θ) is learned.

We assume that all statistical dependencies between the xjare “explained” by the

variable s. This implies that, conditional on s, all components of xare independent. For

binary variables xj∈ {0,1}this implies3

p(x|s, θ) =

j=1

ehj(s)xj

1 + ehj(s), hj(s) = aj+

i=1

siwi,j .(5)

We shall interpret the vector ⃗wi= (wi,1, . . . , wi,m) as feature i. Hence points xdrawn

from p(x|s, θ) with si= 1 are characterised by feature i, whereas if si= 0 the distribution

p(x|s, θ) generates points without feature i. The internal conﬁguration sthen speciﬁes a

proﬁle of features. The distribution p(s) encodes the way in which the space of features is

populated.

Architectures where the internal representation is ﬁxed at the outset and only the

output layer is learned have been proposed for supervised learning tasks (see e.g. [16, 17,

18]) and for unsupervised learning, e.g. in auto-encoders [6, 5]. Their success suggests that

the internal representation may be largely independent of the data being learned. The

choice of p(s) in these examples is dictated mostly by computational eﬃciency and/or by

requirements of interpretability4. Our focus is not on computational eﬃciency but rather

on deriving p(s) from ﬁrst principles rooted in information theory.

1.1 The principle of maximal relevance

The ﬁrst requirement that we shall impose on p(s) is that it should obey a principle of

maximal relevance. Relevance has been recently proposed [8, 19] as a context and model

3A straightforward generalisation to continuous variables is possible, taking xjas Gaussian variables

with mean aj+Pisiwi,j and variance σ2

4Variational auto-encoders generally assume an internal representation where sare independent Gaussian

variables with unit variance. This enforces a representation of “disentangled” features, i.e. of features which

are independent of each other. Locatello et al. [7] discuss the limits of this approach.

free measure of informativeness of a dataset or of a representation. It is deﬁned as the

entropy of the distribution of coding costs Es≡ −log2p(s), i.e.

H[E] = −X

p(E) log2p(E), p(E) = X

p(s)δ(E+ log2p(s)) .(6)

The relevance diﬀers from the Shannon entropy of p(s),

H[s] = −X

p(s) log2p(s)≡E[Es],(7)

which is the average coding cost5.H[s] quantiﬁes the compression level and, following

Ref. [8], it is called resolution.

Ref. [8] provides several arguments that support the hypothesis that eﬃcient repre-

sentations satisfy the principle of maximal relevance, i.e. that p(s) maximises H[E] at a

given average coding cost H[s]. For example, it shows that H[E] lower bounds the mutual

information between the hidden state sof the network and the hidden features that the

training process extracts from the data. So representations of maximal relevance are those

that, in theory, extract the largest amount of information from data.

The principle of maximal relevance dictates that the number W(E) of states swith

coding cost E=−log2p(s) should follow an exponential law, W(E) = W0egE , where the

constant gdepends on the resolution H[s]. The exponential law entails an eﬃcient use of

information resources. In order to see this remember that the coding cost Emeasures the

level of detail of a state, i.e. the (minimal) number of bits needed to represent it. There are

2Epossible states with a level of detail E, hence W(E)∝2Ecorresponds to a situation in

which the feature space is exploited optimally, in the sense that it is occupied as uniformly

as possible6. For a generic value of H[s], the same principle leads to W(E) = W0egE , as

shown in [8], with g̸= log 2 which depends on the resolution H[s].

As a function of H[s], the maximal value of the relevance H[E] has a non-monotonic

behaviour which distinguishes two learning regimes. For large values of H[s], H[E] is a

decreasing function of H[s]. In other words, compression, i.e. a reduction of H[s], brings

an increase in relevance in this regime. Learning in this regime is akin to compressing

out irrelevant details (i.e. noise) from data. Upon decreasing H[s] further, H[E] reaches

a maximum and then starts decreasing. The maximum of H[E] corresponds to the most

compressed lossless representation. This point coincides with the optimum discussed above

where W(E)∝2E. A further reduction of H[s] beyond this point leads to lossy compressed

representations.

5We follow the standard notation [20] for the entropy H[X] of a random variable X.

6Notice that log2W(E) is the number of bits needed to distinguish between states with the same level

of detail E. Hence W(E)∝2Eimplies that the information cost (in bits) of retrieving a state is the same,

apart from an oﬀset, as the number of bits needed to describe it, which is the description length E(in bits)

of that state.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AsimpleprobabilisticneuralnetworkformachineunderstandingRongrongXieKeyLaboratoryofQuarkandLeptonPhysics(MOE)andInstituteofParticlePhysics,CentralChinaNormalUniversity(CCNU),Wuhan,ChinaandMatteoMarsili∗QuantitativeLifeSciencesSectionTheAbdusSalamInternationalCentreforTheoreticalPhysics,34151Trieste,I...

收起<<

A simple probabilistic neural network for machine understanding Rongrong Xie.pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A simple probabilistic neural network for machine understanding Rongrong Xie

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: