
to create data entails extracting compressed representations of complex data that capture
their regularities. In this view, the simpler the algorithm [1] or the statistical model [2]
that captures the regularities of the data, the more we understand. Yet, even machines
like auto-encoders or deep belief networks extract compressed representations from complex
data. But theirs is a form of understanding that is unintelligible to us.
Furthermore, the many triumphs of machine intelligence from automatic translation
to the generation of texts and images, have shown that the ability to “create” does not
require simplicity. The accuracy in deep neural networks can increase with complexity (i.e.
with the number of parameters), without overfitting [3, 4].
In this paper ”understanding” will be interpreted as ”representing new knowledge or
data within a preexisting framework or model1”. Our aim is to explore the properties of
a learning modality, in which the internal representation is fixed, within an unsupervised
learning framework of simple one layer probabilistic neural networks. When the internal
representation is fixed a priori, only the conditional distribution of the data, given the
hidden variables, needs to be learned.
The models we shall study are similar in spirit to variational auto-encoders [5, 6].
Yet they differ in how the internal representation is chosen. Rather than insisting on
computational manageability or on independence of the hidden features [6, 7], we derive the
distribution of hidden variables from first principles. We do this by drawing from previous
work [8] that introduces an absolute notion of relevance for representations and argues
that relevance should be maximal for internal representations of neural networks trained
on datasets with a rich statistical structure. Indeed, the principle of maximal relevance [8]
has been shown to characterise both trained models in machine learning [9, 10] and efficient
coding [11] (see Section 1.1 for more details). For models with binary hidden variables,
which are those we focus on, we show that the principle of maximal relevance and the
requirement that the occurrence of features be as unbiased as possible, uniquely identify a
model. This model exhibits a hierarchical organisation of the feature space, which is why
we call it the Hierarchical Feature Model (HFM).
There are several reasons why such an approach may be desirable, besides being sup-
ported by first principles: it allows the internal representation to be chosen as a simple
model, in agreement with the Occam razor, thus providing a transparent interpretation
of the features. Models where the internal representation is fixed operate in a statistical
regime characterised by the classical bias–variance tradeoff, which is qualitatively different
from the regime in which over-parametrised (probabilistic) neural networks, such as Re-
stricted Boltzmann Machines [12, 13] (RBMs), operate. There are many more properties
that differ markedly from those that hold in RBMs, that we shall take as the natural ref-
erence point for our model: The compression level of the internal representation can be
fixed at the outset. There is no need to resort to stochastic gradient descent based training.
Features can be introduced one by one in the training protocol and, as we shall see, already
1We thank Alessandro Ingrosso for suggesting this interpretation of “understanding”.
2