Constrained Predictive Coding as a Biologically Plausible Model of the Cortical Hierarchy

2025-09-29 0 0 1.91MB 24 页 10玖币
侵权投诉
Constrained Predictive Coding as a Biologically
Plausible Model of the Cortical Hierarchy
Siavash Golkar1Tiberiu Tesileanu1Yanis Bahroun 1Anirvan M. Sengupta 2,3,4
Dmitri Chklovskii 1,5
1Center for Computational Neuroscience, Flatiron Institute
2Center for Computational Mathematics, Flatiron Institute
3Center for Computational Quantum Physics, Flatiron Institute
4Department of Physics and Astronomy, Rutgers University
5Neuroscience Institute, NYU Medical Center
{sgolkar,ttesileanu,ybahroun,dchklovskii}@flatironinstitute.org
anirvans.physics@gmail.com
Abstract
Predictive coding (PC) has emerged as an influential normative model of neural
computation, with numerous extensions and applications. As such, much effort has
been put into mapping PC faithfully onto the cortex, but there are issues that remain
unresolved or controversial. In particular, current implementations often involve
separate value and error neurons and require symmetric forward and backward
weights across different brain regions. These features have not been experimentally
confirmed. In this work, we show that the PC framework in the linear regime can
be modified to map faithfully onto the cortical hierarchy in a manner compatible
with empirical observations. By employing a disentangling-inspired constraint
on hidden-layer neural activities, we derive an upper bound for the PC objective.
Optimization of this upper bound leads to an algorithm that shows the same per-
formance as the original objective and maps onto a biologically plausible network.
The units of this network can be interpreted as multi-compartmental neurons with
non-Hebbian learning rules, with a remarkable resemblance to recent experimental
findings. There exist prior models which also capture these features, but they
are phenomenological, while our work is a normative derivation. Notably, the
network we derive does not involve one-to-one connectivity or signal multiplexing,
which the phenomenological models required, indicating that these features are not
necessary for learning in the cortex. The normative nature of our algorithm in the
simplified linear case allows us to prove interesting properties of the framework and
analytically understand the computational role of our network’s components. The
parameters of our network have natural interpretations as physiological quantities
in a multi-compartmental model of pyramidal neurons, providing a concrete link
between PC and experimental measurements carried out in the cortex.
1 Introduction
Over the past decades, predictive coding (PC), a normative framework for learning representations
that maximize predictive power, has played an important role in computational neuroscience [1,2].
Equal contribution
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.15752v2 [q-bio.NC] 4 Mar 2023
Initially proposed as an unsupervised learning paradigm in the retina [3], it has since been expanded
to the supervised regime [4] with arbitrary graph topologies [5,6]. The PC framework has been
analyzed in many contexts [7,8] and has found many applications, from clinical neuroscience [9,10]
to memory storage and retrieval [11]. We refer the reader to [12,13] for recent reviews.
input supervision
5
(a) (b)
input supervision
5
error
neurons
value neurons
interneurons
Figure 1:
Schematic architecture of the predictive coding network (PCN) and our covariance-constrained
network (BioCCPC). (a) PCN from [4], figure adapted from [13]. The intra-layer connectivity is to be one-to-
one, while the inter-layer connectivity is symmetric. (b) Our BioCCPC network. There is no requirement for
symmetric weights across layers, and the connectivity within layers can be diffuse.
Predictive coding is viewed as a possible theory of cortical computation, and many parallels have
been drawn with the known neurophysiology of cortex [12]. While the initial works proposed a
biologically plausible network [1,2], the connection with cortex was more closely examined in [14],
where the PC module was mapped onto a cortical-column microcircuit. However, there are aspects of
this mapping that have proved controversial. Among these are the requirement of multiple redundant
cortical operations, the symmetric connectivity pattern, the one-to-one connectivity of value and error
neurons, and also the requirement that feedback connections be inhibitory [12,15,16], as sketched in
Fig. 1a . The presence of separate error and value neurons has itself been called into question [15].
The PC-based neural circuits also do not account for more recent experimental findings highlighting
the details of computation in the cortex [17
26]. For example, the learning dynamics of multi-
compartmental pyramidal neurons has been closely investigated. In particular, it was observed that
the plasticity of the synapses of the basal compartment is driven by the activity in the apical tuft
of the neuron by so-called calcium plateau potentials [19,20,24], leading to non-Hebbian learning
rules [25]. These experiments have motivated the development of several models of microcircuits with
multi-compartmental neurons [27
33]. In a number of cases, it has been shown that these models can
replicate learning similar to the backpropagation algorithm under specific assumptions [30,34,35].
However, because of their rather phenomenological nature, detailed analysis of these models is
in many cases challenging and one must resort to numerical simulations, rendering the task of
understanding the role of various neurophysiological quantities difficult. To this date, a normative
framework (PC or otherwise) that can explain these experimental findings is still lacking.
In this work, we show how the PC framework can be made compatible with the aforementioned
experimental observations. Inspired by prior work which explored the effects of finding decorrelated
representations [36
41], we add a decorrelating inequality constraint to the covariance of the latent
representations of PC. Using this constraint, we derive an upper bound for the PC objective. By
working in the linear regime, we can prove interesting properties of our algorithm. We show
that the learning algorithm derived from this upper bound does not suffer from the issues of prior
implementations and naturally maps onto the hierarchical structure of the cortical pyramidal neurons.
Contributions
By imposing a decorrelation-inspired inequality constraint on the latent space, we find an
analytic upper bound to the PC objective.
We introduce BioCCPC, a novel biologically plausible algorithm derived from this upper
bound and show that it closely matches the known physiology of the cortical hierarchy.
We interpret the different parameters of the algorithm in terms of the conductances and
leaks of the separate compartments of the pyramidal neuron. We find that the neural
compartmental conductances encode the varainces of the PC framework, and the somatic
leak maps onto a thresholding mechanism of the associated eigenvalue problem.
2
2 Related work and review of predictive coding
The backpropagation algorithm [34,42] is the predominant tool in the training of deep neural
networks but is generally not considered biologically plausible [43]. Over the years, many authors
have explored biologically plausible approximations or alternatives to backpropagation [30,31,44
58]
(for a more complete review see [59]). These approaches generally fall into two categories. First are
normative approaches, such as Predictive Coding [1
3], Target Propagation and variations [44,45,47],
Equilibrium Propagation [49,53] and others [51,54
56], where one starts from a mathematically
motivated optimization objective. These methods, by virtue of their normative derivation, have a
firm grounding in theory; however, they do not fully conform to the experimental observations in
the brain (see below and Sec. 4). The second approach is driven by biology, with network structures
and learning rules inspired by experimental findings [28,30,31,48,60]. While these works mostly
conform to the experimentally observed findings, they are more challenging to analyze because of
their conjectured phenomenological nature.
The goal of the present paper is not to propose yet another biologically plausible alternative to
backpropagation. It is rather to demonstrate that the normative framework of predictive coding,
when combined with a constraint, can indeed closely match experimental observations. For this
reason, in this work, we focus on comparing our method with previous implementations of predictive
coding and do not concern ourselves with other biologically plausible alternatives to backpropagation.
The relationship between the PC framework and backpropagation was explored in [4,61
64]. The
advantages of PC over backpropagation were highlighted in [13].
Notation.
Bold upper case
M
and lower case variables
v
denote matrices and vectors, respectively.
By upper case letter
X,Y,Z
, we denote data matrices in
Rd×T
, where
d
and
T
are the dimensions
of the relevant variable and the number of samples. Lower case
x,y,z
, denote the relevant quantities
of a single sample, and xt,yt,zt, denote the tth sample. kMk2
Fdenotes the Frobenius norm of M.
2.1 Review of predictive coding
Probabilistic model.
In this section, we review the supervised predictive coding algorithm [4]. The
derivation starts from a probabilistic model for supervised learning, which parallels the architecture
of an artificial neural network (ANN) with
n+ 1
layers. In this model, the neurons of each layer
are random variables
z(l)
(denoting the vector of activations in the
lth
layer) with layers
0
and
n
,
respectively, denoting the input and output layers of the network. We assume that the joint probability
of the latent variables factorizes in a Markovian structure
p(z(0),z(1),··· ,z(n)) = p(z(n)|z(n1))×p(z(n1)|z(n2))× ··· × p(z(0))
with the relationship between the random variables of adjacent layers given by:
pz(l)|z(l1)=Nz(l);µ(l),Σ(l),with µ(l)=W(l1)f(z(l1)),Σ(l)=σ(l)2I.(1)
The mean of the probability density on layer
l
mirrors the activity of the analogous ANN given by
µ(l)=W(l1)f(z(l1)),
where
W(l1)
are the weights connecting layers
l1
and
l
. The objective
function is then given by the negative log-likelihood of the joint distribution function:
L=X
t
log p(z(0)
t,...,z(n)
t) = 1
2X
l
Z(l)W(l1)f(Z(l1))
2
F
σ(l)2 +const ,(2)
where we have switched to the data matrix notation for brevity and assumed the variances
σ(l)2
are
fixed hyperparameters. In the following, we refer to L, Eq. (2), without the constant term.
Learning.
Learning takes place in two steps. First, the values of the random variables are deter-
mined by finding the most probable configuration of the joint distribution function when both the
input and output layers are conditioned on the given input and output (z(0) =xand z(n)=y):
z(1),...,z(n1) = arg min
z(1),...,z(n1)
L(z(0) =x,z(n)=y).(3)
3
The solution to this minimization problem can be found via gradient descent, which we evaluate
component-wise for clarity as:
˙z(l)
j=ηzjL=η(ε(l)
j+X
i
ε(l+1)
iW(l)
ij f0(z(l)
j)),with ε(l)
i=z(l)
iµ(l)
i
σ(l)2 ,(4)
where
η
is the gradient descent step size. The second step is then to minimize the objective with
respect to the weights while keeping the previously obtained neuron values fixed. This corresponds to
optimizing the value of the loss at the MAP estimate and can also be carried out by gradient descent.
This algorithm can be implemented by a biologically plausible network as described in [4]; see
Figure 1a. However, as discussed in Section 1, its mapping onto the cortex has proved controversial.
For further details regarding these steps see [4].
3 A constrained predictive coding framework
In this section, we introduce and discuss our novel, covariance-constrained predictive coding (CCPC)
model within the supervised PC paradigm of [4]. Our model also straightforwardly extends to the unsu-
pervised learning paradigm discussed in [1,2]. For simplicity we work in the linear regime (
f(x) = x
)
which allows us to prove different properties of our framework.
3.1 Derivation of upper bound objective
Reduction to a sum of objectives.
We start by reducing the optimization problem, Eq.
(2)
, into a
set of overlapping sub-problems, which will allow us to break the symmetry between feedforward
and feedback weights. To do so, we first introduce a copy of the terms containing the weights
W(1)
to W(n2), denoted by W(l)
aand W(l)
brespectively, as
min
Z,WL= min
Z,W
1
2
n
X
l=1
Z(l)W(l1)Z(l1)
2
F
σ(l)2 = min
Z,Wa,Wb
ˆ
L
ˆ
L= min
Z,Wa,Wb
1
2
Z(1) W(0)
bZ(0)
2
F
σ(1)2 +1
2
Z(n)W(n1)
aZ(n1)
2
F
σ(n)2
+1
4
n1
X
l=2 "
Z(l)W(l1)
bZ(l1))
2
F
σ(l)2 +
Z(l)W(l1)
aZ(l1))
2
F
σ(l)2 #.
(5)
For consistency, we rename
W(0)
,
W(n1)
to
W(0)
b
,
W(n1)
a
, respectively. Introducing these copies
does not change the optimization
2
but will help us avoid weight sharing in the steps below. We now
pair the terms two by two as
min
Z,Wa,Wb
ˆ
L=1
2
n1
X
l=1 hg(l)
b
Z(l)W(l1)
bZ(l1)
2
F+g(l)
a
Z(l+1) W(l)
aZ(l)
2
Fi,(6)
with g(n1)
a= 1(n)2,g(1)
b= 1(1)2, and g(l1)
a=g(l)
b= 1/(2σ(l)2)for l= 2, . . . , n 1.
Weight sharing occurs here from terms like
z(l+1)>Wz(l)
, obtained from expanding the squared
norms in Eq.
(6)
. Indeed, the gradient descent dynamics with respect to
z(l+1)
(resp.
z(l)
) leads to
terms of the form
Wz(l)
in
˙
z(l+1)
(resp.
W>z(l+1)
in
˙
z(l)
), which use the same weights
W
. Thanks
to the doubling of the weights, we can avoid this problem by optimizing each term in the sum in
Eq. (6) separately. In other words,
min
Z,Wa,Wb
ˆ
L
n1
X
l=1
min
Z(l),W(l)
a,W(l1)
b
ˆ
L(l),(7)
where ˆ
L(l)1
2
n
X
l=0 hg(l)
b
Z(l)W(l1)
bZ(l1)
2
F+g(l)
a
Z(l+1) W(l)
aZ(l)
2
Fi.
2
This can be directly verified by finding the optima for
W
s before and after the change. We have
W(l)=
W(l)
a=W(l)
b=Z(l+1)Z(l)>(Z(l+1)Z(l)>)1
. Plugging these back into Eq.
(5)
we see that the equality holds.
However, in the next step, since we treat Was and Wbs differently, Wa=WBwill no longer hold.
4
This equality holds simply because we are no longer finding the minimum of the full objective
ˆ
L
. We
are instead finding the minimum of each component separately, and then evaluating
Plˆ
L(l)=ˆ
L
.
This splits the
(n+ 1)
-layer optimization problem into a set of
3
-layer optimizations, in each of
which only the middle layer is being optimized. Note, however, that these are overlapping, so the
different optimization problems need to be solved self-consistently. We make this precise in the
supplementary materials section and show that it provides an upper bound for our objective
L
(SM
Sec. A). Separating the objective function in this manner eliminates the weight sharing problem for
Wb, but the problem remains for Wa. We address this problem in the following.
Whitening constraint.
The idea of decorrelating internal representation has been widely used
for unsupervised tasks, often motivated by neuroscience [36
38]. In the case of deep learning, the
main motivations were improved convergence speed and generalization [39
41]. Decorrelation has
also been used to circumvent the weight transport problem [55]. Inspired by these observations,
we introduce the constraint
1
TZ(l)Z(l)>I
, imposing an upper bound on the eigenvalues of the
covariance matrix. The inequality can be implemented by using a positive-definite Lagrange multiplier
Q(l)>Q(l)(for details see SM Sec. E):
min ˆ
L(l)min
Z(l),W(l)
a,W(l1)
b
max
Q(l)
1
2g(l)
b
Z(l)W(l1)
bZ(l1)
2
F+g(l)
aTW(l)>
aW(l)
a
+g(l)
aTr2Z(l+1)>W(l)
aZ(l)+ Tr Q(l)>Q(l)Z(l)Z(l)>TI+c(l)
Z(l)
2
F,
(8)
where we have added an additional quadratic term in Zas a regularizer.
3.2 Neural dynamics and learning rules.
Similar to the PCN of [4], the dynamics of our network during learning proceeds in two steps. First
the neural dynamics is derived by taking gradient steps of ˆ
L(l)from Eq. (8) with respect to z(l):
˙
z(l)=g(l)
bW(l1)
bz(l1) +g(l)
aW(l)>
az(l+1) g(l)
b+c(l)z(l)g(l)
aQ(l)>n(l),(9)
where we have defined the variables n(l)= (1/g(l)
a)Q(l)z(l)for each layer of the network.
The weight updates are derived via stochastic gradient descent of the loss given in Eq.
(8)
after the
neural dynamics have reached equilibrium. These are given by
δW(l)
bhg(l+1)
aW(l+1)>
az(l+2) Q(l+1)>n(l+1)c(l+1)z(l+1)iz(l)>,(10a)
δW(l)
az(l+1)z(l)>W(l)
a,(10b)
δQ(l)n(l)z(l)>Q(l).(10c)
We used the neural dynamics equilibrium equation for
z(l)
to simplify the weight update for
W(l)
b
.
This yields our online algorithm (Alg. 1), with the architecture shown in Fig. 1b. The algorithm can
be implemented in a biologically plausible neural network as in Fig. 2; see Sec. 4.
5
摘要:

ConstrainedPredictiveCodingasaBiologicallyPlausibleModeloftheCorticalHierarchySiavashGolkar1TiberiuTesileanu1YanisBahroun1AnirvanM.Sengupta2;3;4DmitriChklovskii1;51CenterforComputationalNeuroscience,FlatironInstitute2CenterforComputationalMathematics,FlatironInstitute3CenterforComputationalQuantum...

展开>> 收起<<
Constrained Predictive Coding as a Biologically Plausible Model of the Cortical Hierarchy.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:1.91MB 格式:PDF 时间:2025-09-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注