Constrained Predictive Coding as a Biologically Plausible Model of the Cortical Hierarchy

2025-09-29 0 0 1.91MB 24 页 10玖币

侵权投诉

Constrained Predictive Coding as a Biologically

Plausible Model of the Cortical Hierarchy

Siavash Golkar∗1Tiberiu Tesileanu∗1Yanis Bahroun 1Anirvan M. Sengupta 2,3,4

Dmitri Chklovskii 1,5

1Center for Computational Neuroscience, Flatiron Institute

2Center for Computational Mathematics, Flatiron Institute

3Center for Computational Quantum Physics, Flatiron Institute

4Department of Physics and Astronomy, Rutgers University

5Neuroscience Institute, NYU Medical Center

{sgolkar,ttesileanu,ybahroun,dchklovskii}@flatironinstitute.org

anirvans.physics@gmail.com

Abstract

Predictive coding (PC) has emerged as an inﬂuential normative model of neural

computation, with numerous extensions and applications. As such, much effort has

been put into mapping PC faithfully onto the cortex, but there are issues that remain

unresolved or controversial. In particular, current implementations often involve

separate value and error neurons and require symmetric forward and backward

weights across different brain regions. These features have not been experimentally

conﬁrmed. In this work, we show that the PC framework in the linear regime can

be modiﬁed to map faithfully onto the cortical hierarchy in a manner compatible

with empirical observations. By employing a disentangling-inspired constraint

on hidden-layer neural activities, we derive an upper bound for the PC objective.

Optimization of this upper bound leads to an algorithm that shows the same per-

formance as the original objective and maps onto a biologically plausible network.

The units of this network can be interpreted as multi-compartmental neurons with

non-Hebbian learning rules, with a remarkable resemblance to recent experimental

ﬁndings. There exist prior models which also capture these features, but they

are phenomenological, while our work is a normative derivation. Notably, the

network we derive does not involve one-to-one connectivity or signal multiplexing,

which the phenomenological models required, indicating that these features are not

necessary for learning in the cortex. The normative nature of our algorithm in the

simpliﬁed linear case allows us to prove interesting properties of the framework and

analytically understand the computational role of our network’s components. The

parameters of our network have natural interpretations as physiological quantities

in a multi-compartmental model of pyramidal neurons, providing a concrete link

between PC and experimental measurements carried out in the cortex.

1 Introduction

Over the past decades, predictive coding (PC), a normative framework for learning representations

that maximize predictive power, has played an important role in computational neuroscience [1,2].

∗Equal contribution

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.15752v2 [q-bio.NC] 4 Mar 2023

Initially proposed as an unsupervised learning paradigm in the retina [3], it has since been expanded

to the supervised regime [4] with arbitrary graph topologies [5,6]. The PC framework has been

analyzed in many contexts [7,8] and has found many applications, from clinical neuroscience [9,10]

to memory storage and retrieval [11]. We refer the reader to [12,13] for recent reviews.

input supervision

(a) (b)

input supervision

error

neurons

value neurons

interneurons

Figure 1:

Schematic architecture of the predictive coding network (PCN) and our covariance-constrained

network (BioCCPC). (a) PCN from [4], ﬁgure adapted from [13]. The intra-layer connectivity is to be one-to-

one, while the inter-layer connectivity is symmetric. (b) Our BioCCPC network. There is no requirement for

symmetric weights across layers, and the connectivity within layers can be diffuse.

Predictive coding is viewed as a possible theory of cortical computation, and many parallels have

been drawn with the known neurophysiology of cortex [12]. While the initial works proposed a

biologically plausible network [1,2], the connection with cortex was more closely examined in [14],

where the PC module was mapped onto a cortical-column microcircuit. However, there are aspects of

this mapping that have proved controversial. Among these are the requirement of multiple redundant

cortical operations, the symmetric connectivity pattern, the one-to-one connectivity of value and error

neurons, and also the requirement that feedback connections be inhibitory [12,15,16], as sketched in

Fig. 1a . The presence of separate error and value neurons has itself been called into question [15].

The PC-based neural circuits also do not account for more recent experimental ﬁndings highlighting

the details of computation in the cortex [17

–

26]. For example, the learning dynamics of multi-

compartmental pyramidal neurons has been closely investigated. In particular, it was observed that

the plasticity of the synapses of the basal compartment is driven by the activity in the apical tuft

of the neuron by so-called calcium plateau potentials [19,20,24], leading to non-Hebbian learning

rules [25]. These experiments have motivated the development of several models of microcircuits with

multi-compartmental neurons [27

–

33]. In a number of cases, it has been shown that these models can

replicate learning similar to the backpropagation algorithm under speciﬁc assumptions [30,34,35].

However, because of their rather phenomenological nature, detailed analysis of these models is

in many cases challenging and one must resort to numerical simulations, rendering the task of

understanding the role of various neurophysiological quantities difﬁcult. To this date, a normative

framework (PC or otherwise) that can explain these experimental ﬁndings is still lacking.

In this work, we show how the PC framework can be made compatible with the aforementioned

experimental observations. Inspired by prior work which explored the effects of ﬁnding decorrelated

representations [36

–

41], we add a decorrelating inequality constraint to the covariance of the latent

representations of PC. Using this constraint, we derive an upper bound for the PC objective. By

working in the linear regime, we can prove interesting properties of our algorithm. We show

that the learning algorithm derived from this upper bound does not suffer from the issues of prior

implementations and naturally maps onto the hierarchical structure of the cortical pyramidal neurons.

Contributions

•

By imposing a decorrelation-inspired inequality constraint on the latent space, we ﬁnd an

analytic upper bound to the PC objective.

•

We introduce BioCCPC, a novel biologically plausible algorithm derived from this upper

bound and show that it closely matches the known physiology of the cortical hierarchy.

•

We interpret the different parameters of the algorithm in terms of the conductances and

leaks of the separate compartments of the pyramidal neuron. We ﬁnd that the neural

compartmental conductances encode the varainces of the PC framework, and the somatic

leak maps onto a thresholding mechanism of the associated eigenvalue problem.

2 Related work and review of predictive coding

The backpropagation algorithm [34,42] is the predominant tool in the training of deep neural

networks but is generally not considered biologically plausible [43]. Over the years, many authors

have explored biologically plausible approximations or alternatives to backpropagation [30,31,44

–

58]

(for a more complete review see [59]). These approaches generally fall into two categories. First are

normative approaches, such as Predictive Coding [1

–

3], Target Propagation and variations [44,45,47],

Equilibrium Propagation [49,53] and others [51,54

–

56], where one starts from a mathematically

motivated optimization objective. These methods, by virtue of their normative derivation, have a

ﬁrm grounding in theory; however, they do not fully conform to the experimental observations in

the brain (see below and Sec. 4). The second approach is driven by biology, with network structures

and learning rules inspired by experimental ﬁndings [28,30,31,48,60]. While these works mostly

conform to the experimentally observed ﬁndings, they are more challenging to analyze because of

their conjectured phenomenological nature.

The goal of the present paper is not to propose yet another biologically plausible alternative to

backpropagation. It is rather to demonstrate that the normative framework of predictive coding,

when combined with a constraint, can indeed closely match experimental observations. For this

reason, in this work, we focus on comparing our method with previous implementations of predictive

coding and do not concern ourselves with other biologically plausible alternatives to backpropagation.

The relationship between the PC framework and backpropagation was explored in [4,61

–

64]. The

advantages of PC over backpropagation were highlighted in [13].

Notation.

Bold upper case

and lower case variables

denote matrices and vectors, respectively.

By upper case letter

X,Y,Z

, we denote data matrices in

Rd×T

, where

and

are the dimensions

of the relevant variable and the number of samples. Lower case

x,y,z

, denote the relevant quantities

of a single sample, and xt,yt,zt, denote the tth sample. kMk2

Fdenotes the Frobenius norm of M.

2.1 Review of predictive coding

Probabilistic model.

In this section, we review the supervised predictive coding algorithm [4]. The

derivation starts from a probabilistic model for supervised learning, which parallels the architecture

of an artiﬁcial neural network (ANN) with

n+ 1

layers. In this model, the neurons of each layer

are random variables

z(l)

(denoting the vector of activations in the

lth

layer) with layers

and

respectively, denoting the input and output layers of the network. We assume that the joint probability

of the latent variables factorizes in a Markovian structure

p(z(0),z(1),··· ,z(n)) = p(z(n)|z(n−1))×p(z(n−1)|z(n−2))× ··· × p(z(0))

with the relationship between the random variables of adjacent layers given by:

pz(l)|z(l−1)=Nz(l);µ(l),Σ(l),with µ(l)=W(l−1)f(z(l−1)),Σ(l)=σ(l)2I.(1)

The mean of the probability density on layer

mirrors the activity of the analogous ANN given by

µ(l)=W(l−1)f(z(l−1)),

where

W(l−1)

are the weights connecting layers

l−1

and

. The objective

function is then given by the negative log-likelihood of the joint distribution function:

L=−X

log p(z(0)

t,...,z(n)

t) = 1

l

Z(l)−W(l−1)f(Z(l−1))



σ(l)2 +const ,(2)

where we have switched to the data matrix notation for brevity and assumed the variances

σ(l)2

are

ﬁxed hyperparameters. In the following, we refer to L, Eq. (2), without the constant term.

Learning.

Learning takes place in two steps. First, the values of the random variables are deter-

mined by ﬁnding the most probable conﬁguration of the joint distribution function when both the

input and output layers are conditioned on the given input and output (z(0) =xand z(n)=y):

z∗(1),...,z∗(n−1) = arg min

z(1),...,z(n−1)

L(z(0) =x,z(n)=y).(3)

The solution to this minimization problem can be found via gradient descent, which we evaluate

component-wise for clarity as:

˙z(l)

j=−η∂zjL=η(−ε(l)

j+X

ε(l+1)

iW(l)

ij f0(z(l)

j)),with ε(l)

i=z(l)

i−µ(l)

σ(l)2 ,(4)

where

is the gradient descent step size. The second step is then to minimize the objective with

respect to the weights while keeping the previously obtained neuron values ﬁxed. This corresponds to

optimizing the value of the loss at the MAP estimate and can also be carried out by gradient descent.

This algorithm can be implemented by a biologically plausible network as described in [4]; see

Figure 1a. However, as discussed in Section 1, its mapping onto the cortex has proved controversial.

For further details regarding these steps see [4].

3 A constrained predictive coding framework

In this section, we introduce and discuss our novel, covariance-constrained predictive coding (CCPC)

model within the supervised PC paradigm of [4]. Our model also straightforwardly extends to the unsu-

pervised learning paradigm discussed in [1,2]. For simplicity we work in the linear regime (

f(x) = x

)

which allows us to prove different properties of our framework.

3.1 Derivation of upper bound objective

Reduction to a sum of objectives.

We start by reducing the optimization problem, Eq.

(2)

, into a

set of overlapping sub-problems, which will allow us to break the symmetry between feedforward

and feedback weights. To do so, we ﬁrst introduce a copy of the terms containing the weights

W(1)

to W(n−2), denoted by W(l)

aand W(l)

brespectively, as

min

Z,WL= min

Z,W

l=1 

Z(l)−W(l−1)Z(l−1)



σ(l)2 = min

Z,Wa,Wb

L= min

Z,Wa,Wb

2

Z(1) −W(0)

bZ(0)



σ(1)2 +1

2

Z(n)−W(n−1)

aZ(n−1)



σ(n)2

n−1

l=2 "

Z(l)−W(l−1)

bZ(l−1))



σ(l)2 +

Z(l)−W(l−1)

aZ(l−1))



σ(l)2 #.

(5)

For consistency, we rename

W(0)

W(n−1)

W(0)

W(n−1)

, respectively. Introducing these copies

does not change the optimization

but will help us avoid weight sharing in the steps below. We now

pair the terms two by two as

min

Z,Wa,Wb

L=1

n−1

l=1 hg(l)

b

Z(l)−W(l−1)

bZ(l−1)



F+g(l)

a

Z(l+1) −W(l)

aZ(l)



Fi,(6)

with g(n−1)

a= 1/σ(n)2,g(1)

b= 1/σ(1)2, and g(l−1)

a=g(l)

b= 1/(2σ(l)2)for l= 2, . . . , n −1.

Weight sharing occurs here from terms like

z(l+1)>Wz(l)

, obtained from expanding the squared

norms in Eq.

(6)

. Indeed, the gradient descent dynamics with respect to

z(l+1)

(resp.

z(l)

) leads to

terms of the form

Wz(l)

z(l+1)

(resp.

W>z(l+1)

z(l)

), which use the same weights

. Thanks

to the doubling of the weights, we can avoid this problem by optimizing each term in the sum in

Eq. (6) separately. In other words,

min

Z,Wa,Wb

L≤

n−1

l=1

min

Z(l),W(l)

a,W(l−1)

L(l),(7)

where ˆ

L(l)≡1

l=0 hg(l)

b

Z(l)−W(l−1)

bZ(l−1)



F+g(l)

a

Z(l+1) −W(l)

aZ(l)



Fi.

This can be directly veriﬁed by ﬁnding the optima for

’s before and after the change. We have

W(l)=

W(l)

a=W(l)

b=Z(l+1)Z(l)>(Z(l+1)Z(l)>)−1

. Plugging these back into Eq.

(5)

we see that the equality holds.

However, in the next step, since we treat Wa’s and Wb’s differently, Wa=WBwill no longer hold.

This equality holds simply because we are no longer ﬁnding the minimum of the full objective

. We

are instead ﬁnding the minimum of each component separately, and then evaluating

Plˆ

L(l)=ˆ

This splits the

(n+ 1)

-layer optimization problem into a set of

-layer optimizations, in each of

which only the middle layer is being optimized. Note, however, that these are overlapping, so the

different optimization problems need to be solved self-consistently. We make this precise in the

supplementary materials section and show that it provides an upper bound for our objective

(SM

Sec. A). Separating the objective function in this manner eliminates the weight sharing problem for

Wb, but the problem remains for Wa. We address this problem in the following.

Whitening constraint.

The idea of decorrelating internal representation has been widely used

for unsupervised tasks, often motivated by neuroscience [36

–

38]. In the case of deep learning, the

main motivations were improved convergence speed and generalization [39

–

41]. Decorrelation has

also been used to circumvent the weight transport problem [55]. Inspired by these observations,

we introduce the constraint

TZ(l)Z(l)>I

, imposing an upper bound on the eigenvalues of the

covariance matrix. The inequality can be implemented by using a positive-deﬁnite Lagrange multiplier

Q(l)>Q(l)(for details see SM Sec. E):

min ˆ

L(l)≤min

Z(l),W(l)

a,W(l−1)

max

Q(l)

2g(l)

b



Z(l)−W(l−1)

bZ(l−1)



F+g(l)

aTW(l)>

aW(l)

+g(l)

aTr−2Z(l+1)>W(l)

aZ(l)+ Tr Q(l)>Q(l)Z(l)Z(l)>−TI+c(l)

Z(l)



F,

(8)

where we have added an additional quadratic term in Zas a regularizer.

3.2 Neural dynamics and learning rules.

Similar to the PCN of [4], the dynamics of our network during learning proceeds in two steps. First

the neural dynamics is derived by taking gradient steps of ˆ

L(l)from Eq. (8) with respect to z(l):

z(l)=g(l)

bW(l−1)

bz(l−1) +g(l)

aW(l)>

az(l+1) −g(l)

b+c(l)z(l)−g(l)

aQ(l)>n(l),(9)

where we have deﬁned the variables n(l)= (1/g(l)

a)Q(l)z(l)for each layer of the network.

The weight updates are derived via stochastic gradient descent of the loss given in Eq.

(8)

after the

neural dynamics have reached equilibrium. These are given by

δW(l)

b∝hg(l+1)

aW(l+1)>

az(l+2) −Q(l+1)>n(l+1)−c(l+1)z(l+1)iz(l)>,(10a)

δW(l)

a∝z(l+1)z(l)>−W(l)

a,(10b)

δQ(l)∝n(l)z(l)>−Q(l).(10c)

We used the neural dynamics equilibrium equation for

z(l)

to simplify the weight update for

W(l)

This yields our online algorithm (Alg. 1), with the architecture shown in Fig. 1b. The algorithm can

be implemented in a biologically plausible neural network as in Fig. 2; see Sec. 4.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ConstrainedPredictiveCodingasaBiologicallyPlausibleModeloftheCorticalHierarchySiavashGolkar1TiberiuTesileanu1YanisBahroun1AnirvanM.Sengupta2;3;4DmitriChklovskii1;51CenterforComputationalNeuroscience,FlatironInstitute2CenterforComputationalMathematics,FlatironInstitute3CenterforComputationalQuantum...

展开>> 收起<<

Constrained Predictive Coding as a Biologically Plausible Model of the Cortical Hierarchy.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Constrained Predictive Coding as a Biologically Plausible Model of the Cortical Hierarchy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: