Transformers generalize differently from information stored in context vs weights Stephanie C.Y. Chan

2025-04-24 0 0 429.29KB 8 页 10玖币
侵权投诉
Transformers generalize differently
from information stored in context vs weights
Stephanie C.Y. Chan
DeepMind
Ishita Dasgupta
DeepMind
Junkyung Kim
DeepMind
Dharshan Kumaran
DeepMind
Andrew K. Lampinen
DeepMind
Felix Hill
DeepMind
Abstract
Transformer models can use two fundamentally different kinds of information:
information stored in weights during training, and information provided “in-context”
at inference time. In this work, we show that transformers exhibit different inductive
biases in how they represent and generalize from the information in these two
sources. In particular, we characterize whether they generalize via parsimonious
rules (rule-based generalization) or via direct comparison with observed examples
(exemplar-based generalization). This is of important practical consequence, as
it informs whether to encode information in weights or in context, depending
on how we want models to use that information. In transformers trained on
controlled stimuli, we find that generalization from weights is more rule-based
whereas generalization from context is largely exemplar-based. In contrast, we
find that in transformers pre-trained on natural language, in-context learning is
significantly rule-based, with larger models showing more rule-basedness. We
hypothesise that rule-based generalization from in-context information might be an
emergent consequence of large-scale training on language, which has sparse rule-
like structure. Using controlled stimuli, we verify that transformers pretrained on
data containing sparse rule-like structure exhibit more rule-based generalization.
1 Introduction
Transformer-based architectures have an impressive ability to use both information stored in weights
during training (“in-weights learning”), and information stored only in the inputs provided at inference
time (without any gradient updates to the weights of the model; “in-context learning”) [Chan et al.,
2022]. In-context learning on pretrained models enables learning efficiently from a few examples
(“few-shot learning") [Brown et al., 2020], or even efficiently compressing a large dataset (“prompt
tuning") [Li and Liang, 2021, Lester et al., 2021, Sun et al., 2022]. Given the evident current and
future potential for this new learning paradigm, it is important and useful to understand its inductive
biases, especially how it differs from in-weights learning.
One way to understand inductive bias is by examining how models generalize to held-out data. In this
work, we adapt the experimental paradigm in Dasgupta et al. [2022] that pose a classification task
that distinguishes between two previously defined kinds of generalization behaviors (see 1). A “rule-
based” decision is made on the basis of minimal features that support the category boundary [Ashby
and Townsend, 1986], while an “exemplar-based” decision generalizes on the basis of similarity to
examples from training data [Shepard and Chang, 1963], invoking many or all features available.
Equal contributions
Preprint. Under review.
arXiv:2210.05675v2 [cs.CL] 13 Oct 2022
(a) Partial exposure test.
Private & Confidential
Feature 1
Feature 2
A B
**
*
*
**
**o
oo
o
?
W
X
Feature 1
Feature 2
A B
**
*
*
**
**o
oo
o
?
W
X
Feature 1
Feature 2
A B
**
*
*
**
**o
oo
o
?
W
X
(b) Rule-based. (c) Exemplar-based.
Figure 1: Partial exposure test for differen-
tiating rule-based vs exemplar-based gener-
alization. Stimuli have two features. The
model sees three combinations (
AX
,
AW
, and
BW
) in training or in context (depending on
experiment), and is evaluated on a held-out
(test) combination
BX
.
(b)
A rule-based model
uses a parsimonious decision boundary that
explains the data (here, based only on Feature
1), classifying the test as
o
.
(c)
An exemplar-
based model computes the similarity between
test and training examples using all features.
Since
BX
is equally similar to
AX
and
BW
, it is
equally likely to classify it as *or o.
This distinction is particularly interesting when com-
paring in-weights vs in-context learning. Exemplar-
based generalization (that uses all available features)
is useful in a low-data regime where there is not
enough information to form an abstract sparse rule
[Feldman, 2020]. On the other hand, sparser rule-
based generalization may help avoid sensitivity to
spurious correlation when training with large, noisy,
naturalistic datasets (that are commonly used to train
in-weights learning).
We find that transformers exhibit a striking difference
in their generalization from in-context vs in-weights
information. Transformers display a strong inductive
bias towards exemplar-based generalization from in-
context information. In contrast, transformers display
a strong inductive bias towards rule-based generaliza-
tion from in-weights information.
However, when we pose a similar task to large trans-
former models pretrained on language, they exhibit
stronger rule-based generalization from in-context in-
formation. One interpretation of these results is that
the distribution of natural language is more compati-
ble with rule-based generalization from context (rule-
based generalization is in fact optimal in composi-
tional domains like langauge [Arjovsky et al., 2019]),
and such patterns might present strong enough learn-
ing pressure to overcome – and even reverse – trans-
formers’ inherent bias towards exemplar-based gen-
eralization from context.
2 Experimental Design
We adapted the “partial exposure” paradigm from Dasgupta et al. [2022] where each stimulus has
two features; only one of the features predicts the label. We evaluate how the model generalizes to a
held-out combination (using sparse rules or similarity to exemplars), see Fig 1.
First, we explored generalization in transformers trained on controlled synthetic data, where we can
examine generalization from both in-weights and in-context information and directly compare them.
Second, we repeat this experiment on pretrained language models and characterize their in-context
generalization. Finally, we compare the patterns observed and invesigate factors that explain the
differences observed.
3 Results
3.1 Trained-from-scratch transformers
For the trained-from-scratch transformers, we passed sequences of stimulus-label pairs as inputs to
the transformer model [Vaswani et al., 2017]. The sequences consisted of two parts: a context (24
tokens; i.e. 12 stimulus-label pairs) and a query (stimulus). The model was trained to minimize a
softmax cross-entropy loss on the prediction for the final (query) stimulus. Each stimulus consists of
two subvectors concatenated together into a single token (Fig 4c) – these subvectors comprise the
two features of the partial exposure paradigm. See Appendix A for further details.
Generalization from in-weights information is rule-based.
To investigate generalization from
in-weights information, we trained the model on partial exposure data, and evaluated on the held-out
combination. During training, the label for each stimulus class was fixed, so that the stimulus-label
mappings were stored in weights. The context tokens are uninformative for the query. After training,
2
摘要:

TransformersgeneralizedifferentlyfrominformationstoredincontextvsweightsStephanieC.Y.ChanDeepMindIshitaDasguptaDeepMindJunkyungKimDeepMindDharshanKumaranDeepMindAndrewK.LampinenDeepMindFelixHillDeepMindAbstractTransformermodelscanusetwofundamentallydifferentkindsofinformation:informationstoredinwe...

展开>> 收起<<
Transformers generalize differently from information stored in context vs weights Stephanie C.Y. Chan.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:429.29KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注