Polysemanticity and Capacity in Neural Networks Adam Scherlis1 Kshitij Sachan1 Adam S. Jermyn2 Joe Benton Buck Shlegeris1 1Redwood Research

2025-05-06 0 0 3.02MB 23 页 10玖币
侵权投诉
Polysemanticity and Capacity in Neural Networks
Adam Scherlis1, Kshitij Sachan1, Adam S. Jermyn2, Joe Benton, Buck Shlegeris1
1Redwood Research
2Flatiron Institute
Abstract
Individual neurons in neural networks often represent a mixture of unrelated features. This
phenomenon, called polysemanticity, can make interpreting neural networks more difficult
and so we aim to understand its causes. We propose doing so through the lens of feature
capacity, which is the fractional dimension each feature consumes in the embedding space.
We show that in a toy model the optimal capacity allocation tends to monosemantically
represent the most important features, polysemantically represent less important features
(in proportion to their impact on the loss), and entirely ignore the least important features.
Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and
more prevalent in some architectures than others. Given an optimal allocation of capacity,
we go on to study the geometry of the embedding space. We find a block-semi-orthogonal
structure, with differing block sizes in different models, highlighting the impact of model
architecture on the interpretability of its neurons.
1 Introduction
Individual neurons in neural networks often represent multiple unrelated features in the input [OMS17,
OCS+20]. This phenomenon is known as polysemanticity, and makes it more difficult to interpret neural
networks [OCS+20]. While "feature" is a somewhat fuzzy concept [EHO+22b], there are at least some
cases where we “know it when we see it”. For example, when the input features are independent random
variables that do not interact in the data-generating process, neurons that represent combinations of these
input features can be confidently called polysemantic. In this work we explore how loss functions incentivize
polysemanticity in this setting, and the structure of the learned solutions.
Fittingly, there are multiple ways that polysemanticity can manifest. Here we focus on one form that seems
particularly fundamental, namely superposition [EHO+22b]. Suppose we have a linear layer that embeds
features which then pass through a layer with a nonlinear activation function. The feature embedding vectors
might not be orthogonal, in which case multiple neurons (nonlinear units) are involved in representing each
feature. When there are at least as many features as neurons this means that some neurons represent multiple
features, and so are polysemantic (Figure 1, right). There are other causes of polysemanticity, e.g. feature
embedding vectors could be rotated relative to the neuron basis (Figure 1, left), but we do not study these in
this work.
Here we build on the work of [EHO+22b], who studied polysemanticity in the context of toy models of
autoencoders. They found that models can support both monosemantic and polysemantic neurons, that pol-
ysemantic neurons can perform certain kinds of computations, and that the embedding vectors of features
often formed repeating motifs of a few features symmetrically embedded in a low-dimensional subspace.
Moreover, in their models they found distinct “phases” where superposition was either significant or com-
pletely absent. Sparser inputs resulted in more superposition. Features with similar importance were more
likely to be in superposition. This reflects an abundance of unexpected structure, and gives new handles on
the phenomenon of polysemanticity.
We study these phenomena through the lens of capacity, or the fraction of an embedding dimension allocated
to each feature (Section 2, also termed “dimensionality” by [EHO+22b]). This ranges from 0-1 for each
arXiv:2210.01892v4 [cs.NE] 25 Mar 2025
Rotated Embeddings Non-Orthogonal Embeddings
Figure 1: Feature embedding vectors are shown in two dimensions. The neuron basis corresponds to the
coordinate axes. Left: rotated embeddings. Right: non-orthogonal embeddings. In both cases the result is
polysemanticity because each neuron receives some input when either feature is present.
Figure 2: The marginal loss reduction L/∂Ciis shown for several features as a function of feature capacity
in our toy model. Circles represent the optimal capacity allocation for a particular total embedding dimension.
Colors vary to make individual curves more distinguishable.
feature, and the total capacity across all features is bounded by the dimension of the embedding space.
Because the model has a limited number of neurons and so a limited number of embedding dimensions,
there is a trade-off between representing different features. We find that the capacity constraint on individual
features (0-1) means that many features are either ignored altogether (not embedded) or else allocated a full
dimension orthogonal to all the other features in the embedding space, depending on the relative importance
of each feature to the loss. Features are represented polysemantically only when the marginal loss reduction
of assigning more capacity to each is equal (Figure 2). This neatly explains the sharp “pinning” of features to
either 0 or 1 capacity noted by [EHO+22b], and gives us a framework for understanding the circumstances
under which features are represented polysemantically.
To explore capacity allocation in a concrete model, we instantiate our theory for a one-layer model with
quadratic activations (Section 3). Our model differs from the Anthropic toy model in that ours uses a different
activation function to make the math more tractable, and, more importantly, ours is focused on polysemantic
computation rather than data compression. We contrast these toy models in Figure 3.
For our toy model we can analytically determine the capacity allocation as a function of feature sparsity
and importance (i.e. weight in the loss), and so construct a “phase diagram” (Figure 4). While the details
of our phase diagram differ from those of [EHO+22b], reflecting our different toy model, there are three
qualitative features that are in good agreement. First, when a feature is much more important than the rest,
2
Figure 3: Comparison between the Anthropic toy model of [EHO+22b] (left) and our toy model (right).
Model inputs are at the bottom of the diagram and outputs are at the top. The key difference is that the
Anthropic model studies the compression and recovery of high-dimensional vectors, while ours examines
how a smaller number of polysemantic neurons can simulate the computation done by a larger number of
monosemantic ones. Figure kindly provided by Chris Olah.
it is always represented fully with its own embedding dimension. Second, when a feature is much less
important than the rest, it is ignored entirely. Finally, in a sparsity-dependent intermediate region features are
partially represented, sharing embedding dimensions. In addition, this confirms our theoretical expectation
that capacity is allocated according to how much each feature matters to the loss (a mixture of importance
and sparsity) and that it is often allocated to fully ignore some features while fully representing others. We
supplement this with empirical results for a variety of activation functions showing that the phase diagram
predicts the behavior of a broad family of 2-layer models.
We then turn to study the geometry of the embedding space (Section 4). When embedding matrices fully
utilize the available capacity we call them “efficient”. We find that every efficient embedding matrix has a
block-semi-orthogonal structure, with features partitioned into different blocks. When multiple features in a
block are present they interfere with each other, causing spurious correlations in the output and hence greater
loss. Features do not, however, interfere across blocks.
The blocks in efficient matrices correspond to the polytope structure [EHO+22b] found, with small blocks
corresponding to features embedded as regular polytopes and large blocks corresponding to less-ordered
structures. Large- and small-block arrangements come with different advantages. With large blocks there
is significant freedom to allocate capacity across features, whereas with small blocks there is the additional
constraint that the capacity of each block be an integer and that the block capacities add up to the total
capacity. On the other hand, with small blocks the lengths of embedding vectors can be chosen more freely
because blocks can be scaled independently of each other without affecting the capacity allocation.
In our quadratic model the embedding matrices in our toy model always have one large block, which is cor-
respondingly less structured. We expect that differences in architecture can lead to different sizes of blocks,
which could provide a way to control the extent of polysemanticity in models, alongside other approaches
such as changing the activation function [EHO+22a].
2 Capacity and Superposition
2.1 Definitions
Suppose we have a model composed of stacks of linear layers with nonlinear activation functions. In each
layer, the model applies a linear transform to the input vector xto produce an embedding vector e, and then
performs an element-wise non-linear calculation on those embeddings to produce the non-linear activation
vector h. For instance, we might have
3
Figure 4: Upper: Analytical and empirical phase diagrams for our toy model with 6 features and 3 neurons.
In both panels one feature has a different importance from the rest, and colors show the resulting capacity
allocation for that feature as a function of sparsity and relative importance. Lower: Plots of marginal loss re-
duction L/∂Cias a function of feature capacity for each labelled point in the analytical phase diagram. The
blue curve represents the feature with varied importance and the black one represents the constant important
feature. Black dots are optimal allocations of capacity.
Figure 5: Left: An embedding matrix with two blocks. Center: The relationship between features and
(principal-component-aligned) neurons for this matrix. Right: Embedding vector geometry for this matrix.
4
Figure 6: Example capacity allocations for different embeddings.
eW·x(1)
hReLU(e)(2)
with WRd×p,xRp,e, h Rd. We associate each dimension of the input vector xwith a feature, and
we call each dimension of the non-linear layer a neuron.
For simplicity, in the rest of this paper we work with a one-layer model, but our capacity definition should be
valid for any layer in a multi-layer model.
When a model represents a feature in the input space, it is convenient to think that it expends some capacity
to do so. Our intuition here is that as we ask a model to represent more and more features we eventually
exhaust its ability to do so, resulting in features interfering. We study the superposition phenomena by asking
the question: "How do models allocate limited representation capacity to input features?" In what follows
we assume that each input feature is assigned a unique dimension in the input space (e.g. feature iis input
dimension i), and capacity we define below.
Let W·,i Rdbe the embedding vector for feature i. The capacity allocated to feature i is
Ci=(W·,i ·W·,i)2
Pj(W·,i ·W·,j )2(3)
We can think of Cias “the fraction of a dimension” allocated to feature i([EHO+22b])1. The numerator
measures the size of the embedding and the denominator tracks the interference from other features. By this
definition, Ciis bounded between 0 and 1. In the case W·,i = 0, where this expression is undefined, we set
Ci= 0.2
We define the total model capacity to be C=PiCi(in a multi-layer model, this would be a single layer-
pair’s capacity). This is bounded between 1 and the embedding dimension D(see Appendix F for a proof of
the upper bound).
Note that a set of capacities does not uniquely specify a weight matrix. For example, capacity is invariant to
the overall scaling and rotation of W. In what follows it will be useful to have a full parameterization of W
that includes Ci, so we define Sto be a set of additional parameters that uniquely specify a weight matrix W
given its capacities C1, . . . , CN. We can then parameterize the loss using (C1, . . . , CN, S)rather than W.
2.2 Loss Minimization
We are interested in how loss minimization allocates capacity among different features. Because the capacity
of each feature lies in [0,1] and there is also a constraint on the total capacity of a model, this is a constrained
optimization problem:
min
C1:n,S L(C1:n, S)
s.t. 0Ci1
1X
i
CiD
1We can also interpret Cias the squared correlation coefficient between xiand (WTW x)i– see Appendix E.
2This is the limit of the expression from almost all directions, so in practice W·,i 0implies Ci0.
5
摘要:

PolysemanticityandCapacityinNeuralNetworksAdamScherlis1,KshitijSachan1,AdamS.Jermyn2,JoeBenton,BuckShlegeris11RedwoodResearch2FlatironInstituteAbstractIndividualneuronsinneuralnetworksoftenrepresentamixtureofunrelatedfeatures.Thisphenomenon,calledpolysemanticity,canmakeinterpretingneuralnetworksmore...

展开>> 收起<<
Polysemanticity and Capacity in Neural Networks Adam Scherlis1 Kshitij Sachan1 Adam S. Jermyn2 Joe Benton Buck Shlegeris1 1Redwood Research.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:23 页 大小:3.02MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注