Polysemanticity and Capacity in Neural Networks Adam Scherlis1 Kshitij Sachan1 Adam S. Jermyn2 Joe Benton Buck Shlegeris1 1Redwood Research

2025-05-06 1 0 3.02MB 23 页 10玖币

侵权投诉

Polysemanticity and Capacity in Neural Networks

Adam Scherlis1, Kshitij Sachan1, Adam S. Jermyn2, Joe Benton, Buck Shlegeris1

1Redwood Research

2Flatiron Institute

Abstract

Individual neurons in neural networks often represent a mixture of unrelated features. This

phenomenon, called polysemanticity, can make interpreting neural networks more difﬁcult

and so we aim to understand its causes. We propose doing so through the lens of feature

capacity, which is the fractional dimension each feature consumes in the embedding space.

We show that in a toy model the optimal capacity allocation tends to monosemantically

represent the most important features, polysemantically represent less important features

(in proportion to their impact on the loss), and entirely ignore the least important features.

Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and

more prevalent in some architectures than others. Given an optimal allocation of capacity,

we go on to study the geometry of the embedding space. We ﬁnd a block-semi-orthogonal

structure, with differing block sizes in different models, highlighting the impact of model

architecture on the interpretability of its neurons.

1 Introduction

Individual neurons in neural networks often represent multiple unrelated features in the input [OMS17,

OCS+20]. This phenomenon is known as polysemanticity, and makes it more difﬁcult to interpret neural

networks [OCS+20]. While "feature" is a somewhat fuzzy concept [EHO+22b], there are at least some

cases where we “know it when we see it”. For example, when the input features are independent random

variables that do not interact in the data-generating process, neurons that represent combinations of these

input features can be conﬁdently called polysemantic. In this work we explore how loss functions incentivize

polysemanticity in this setting, and the structure of the learned solutions.

Fittingly, there are multiple ways that polysemanticity can manifest. Here we focus on one form that seems

particularly fundamental, namely superposition [EHO+22b]. Suppose we have a linear layer that embeds

features which then pass through a layer with a nonlinear activation function. The feature embedding vectors

might not be orthogonal, in which case multiple neurons (nonlinear units) are involved in representing each

feature. When there are at least as many features as neurons this means that some neurons represent multiple

features, and so are polysemantic (Figure 1, right). There are other causes of polysemanticity, e.g. feature

embedding vectors could be rotated relative to the neuron basis (Figure 1, left), but we do not study these in

this work.

Here we build on the work of [EHO+22b], who studied polysemanticity in the context of toy models of

autoencoders. They found that models can support both monosemantic and polysemantic neurons, that pol-

ysemantic neurons can perform certain kinds of computations, and that the embedding vectors of features

often formed repeating motifs of a few features symmetrically embedded in a low-dimensional subspace.

Moreover, in their models they found distinct “phases” where superposition was either signiﬁcant or com-

pletely absent. Sparser inputs resulted in more superposition. Features with similar importance were more

likely to be in superposition. This reﬂects an abundance of unexpected structure, and gives new handles on

the phenomenon of polysemanticity.

We study these phenomena through the lens of capacity, or the fraction of an embedding dimension allocated

to each feature (Section 2, also termed “dimensionality” by [EHO+22b]). This ranges from 0-1 for each

arXiv:2210.01892v4 [cs.NE] 25 Mar 2025

Rotated Embeddings Non-Orthogonal Embeddings

Figure 1: Feature embedding vectors are shown in two dimensions. The neuron basis corresponds to the

coordinate axes. Left: rotated embeddings. Right: non-orthogonal embeddings. In both cases the result is

polysemanticity because each neuron receives some input when either feature is present.

Figure 2: The marginal loss reduction −∂L/∂Ciis shown for several features as a function of feature capacity

in our toy model. Circles represent the optimal capacity allocation for a particular total embedding dimension.

Colors vary to make individual curves more distinguishable.

feature, and the total capacity across all features is bounded by the dimension of the embedding space.

Because the model has a limited number of neurons and so a limited number of embedding dimensions,

there is a trade-off between representing different features. We ﬁnd that the capacity constraint on individual

features (0-1) means that many features are either ignored altogether (not embedded) or else allocated a full

dimension orthogonal to all the other features in the embedding space, depending on the relative importance

of each feature to the loss. Features are represented polysemantically only when the marginal loss reduction

of assigning more capacity to each is equal (Figure 2). This neatly explains the sharp “pinning” of features to

either 0 or 1 capacity noted by [EHO+22b], and gives us a framework for understanding the circumstances

under which features are represented polysemantically.

To explore capacity allocation in a concrete model, we instantiate our theory for a one-layer model with

quadratic activations (Section 3). Our model differs from the Anthropic toy model in that ours uses a different

activation function to make the math more tractable, and, more importantly, ours is focused on polysemantic

computation rather than data compression. We contrast these toy models in Figure 3.

For our toy model we can analytically determine the capacity allocation as a function of feature sparsity

and importance (i.e. weight in the loss), and so construct a “phase diagram” (Figure 4). While the details

of our phase diagram differ from those of [EHO+22b], reﬂecting our different toy model, there are three

qualitative features that are in good agreement. First, when a feature is much more important than the rest,

Figure 3: Comparison between the Anthropic toy model of [EHO+22b] (left) and our toy model (right).

Model inputs are at the bottom of the diagram and outputs are at the top. The key difference is that the

Anthropic model studies the compression and recovery of high-dimensional vectors, while ours examines

how a smaller number of polysemantic neurons can simulate the computation done by a larger number of

monosemantic ones. Figure kindly provided by Chris Olah.

it is always represented fully with its own embedding dimension. Second, when a feature is much less

important than the rest, it is ignored entirely. Finally, in a sparsity-dependent intermediate region features are

partially represented, sharing embedding dimensions. In addition, this conﬁrms our theoretical expectation

that capacity is allocated according to how much each feature matters to the loss (a mixture of importance

and sparsity) and that it is often allocated to fully ignore some features while fully representing others. We

supplement this with empirical results for a variety of activation functions showing that the phase diagram

predicts the behavior of a broad family of 2-layer models.

We then turn to study the geometry of the embedding space (Section 4). When embedding matrices fully

utilize the available capacity we call them “efﬁcient”. We ﬁnd that every efﬁcient embedding matrix has a

block-semi-orthogonal structure, with features partitioned into different blocks. When multiple features in a

block are present they interfere with each other, causing spurious correlations in the output and hence greater

loss. Features do not, however, interfere across blocks.

The blocks in efﬁcient matrices correspond to the polytope structure [EHO+22b] found, with small blocks

corresponding to features embedded as regular polytopes and large blocks corresponding to less-ordered

structures. Large- and small-block arrangements come with different advantages. With large blocks there

is signiﬁcant freedom to allocate capacity across features, whereas with small blocks there is the additional

constraint that the capacity of each block be an integer and that the block capacities add up to the total

capacity. On the other hand, with small blocks the lengths of embedding vectors can be chosen more freely

because blocks can be scaled independently of each other without affecting the capacity allocation.

In our quadratic model the embedding matrices in our toy model always have one large block, which is cor-

respondingly less structured. We expect that differences in architecture can lead to different sizes of blocks,

which could provide a way to control the extent of polysemanticity in models, alongside other approaches

such as changing the activation function [EHO+22a].

2 Capacity and Superposition

2.1 Deﬁnitions

Suppose we have a model composed of stacks of linear layers with nonlinear activation functions. In each

layer, the model applies a linear transform to the input vector xto produce an embedding vector e, and then

performs an element-wise non-linear calculation on those embeddings to produce the non-linear activation

vector h. For instance, we might have

Figure 4: Upper: Analytical and empirical phase diagrams for our toy model with 6 features and 3 neurons.

In both panels one feature has a different importance from the rest, and colors show the resulting capacity

allocation for that feature as a function of sparsity and relative importance. Lower: Plots of marginal loss re-

duction ∂L/∂Cias a function of feature capacity for each labelled point in the analytical phase diagram. The

blue curve represents the feature with varied importance and the black one represents the constant important

feature. Black dots are optimal allocations of capacity.

Figure 5: Left: An embedding matrix with two blocks. Center: The relationship between features and

(principal-component-aligned) neurons for this matrix. Right: Embedding vector geometry for this matrix.

Figure 6: Example capacity allocations for different embeddings.

e≡W·x(1)

h≡ReLU(e)(2)

with W∈Rd×p,x∈Rp,e, h ∈Rd. We associate each dimension of the input vector xwith a feature, and

we call each dimension of the non-linear layer a neuron.

For simplicity, in the rest of this paper we work with a one-layer model, but our capacity deﬁnition should be

valid for any layer in a multi-layer model.

When a model represents a feature in the input space, it is convenient to think that it expends some capacity

to do so. Our intuition here is that as we ask a model to represent more and more features we eventually

exhaust its ability to do so, resulting in features interfering. We study the superposition phenomena by asking

the question: "How do models allocate limited representation capacity to input features?" In what follows

we assume that each input feature is assigned a unique dimension in the input space (e.g. feature iis input

dimension i), and capacity we deﬁne below.

Let W·,i ∈Rdbe the embedding vector for feature i. The capacity allocated to feature i is

Ci=(W·,i ·W·,i)2

Pj(W·,i ·W·,j )2(3)

We can think of Cias “the fraction of a dimension” allocated to feature i([EHO+22b])1. The numerator

measures the size of the embedding and the denominator tracks the interference from other features. By this

deﬁnition, Ciis bounded between 0 and 1. In the case W·,i = 0, where this expression is undeﬁned, we set

Ci= 0.2

We deﬁne the total model capacity to be C=PiCi(in a multi-layer model, this would be a single layer-

pair’s capacity). This is bounded between 1 and the embedding dimension D(see Appendix F for a proof of

the upper bound).

Note that a set of capacities does not uniquely specify a weight matrix. For example, capacity is invariant to

the overall scaling and rotation of W. In what follows it will be useful to have a full parameterization of W

that includes Ci, so we deﬁne Sto be a set of additional parameters that uniquely specify a weight matrix W

given its capacities C1, . . . , CN. We can then parameterize the loss using (C1, . . . , CN, S)rather than W.

2.2 Loss Minimization

We are interested in how loss minimization allocates capacity among different features. Because the capacity

of each feature lies in [0,1] and there is also a constraint on the total capacity of a model, this is a constrained

optimization problem:

min

C1:n,S L(C1:n, S)

s.t. 0≤Ci≤1

1≤X

Ci≤D

1We can also interpret Cias the squared correlation coefﬁcient between xiand (WTW x)i– see Appendix E.

2This is the limit of the expression from almost all directions, so in practice W·,i ≈0implies Ci≈0.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PolysemanticityandCapacityinNeuralNetworksAdamScherlis1,KshitijSachan1,AdamS.Jermyn2,JoeBenton,BuckShlegeris11RedwoodResearch2FlatironInstituteAbstractIndividualneuronsinneuralnetworksoftenrepresentamixtureofunrelatedfeatures.Thisphenomenon,calledpolysemanticity,canmakeinterpretingneuralnetworksmore...

展开>> 收起<<

Polysemanticity and Capacity in Neural Networks Adam Scherlis1 Kshitij Sachan1 Adam S. Jermyn2 Joe Benton Buck Shlegeris1 1Redwood Research.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Polysemanticity and Capacity in Neural Networks Adam Scherlis1 Kshitij Sachan1 Adam S. Jermyn2 Joe Benton Buck Shlegeris1 1Redwood Research

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: