Transformers generalize differently from information stored in context vs weights Stephanie C.Y. Chan

2025-04-24 1 0 429.29KB 8 页 10玖币

侵权投诉

Transformers generalize differently

from information stored in context vs weights

Stephanie C.Y. Chan∗

DeepMind

Ishita Dasgupta∗

DeepMind

Junkyung Kim

DeepMind

Dharshan Kumaran

DeepMind

Andrew K. Lampinen

DeepMind

Felix Hill

DeepMind

Abstract

Transformer models can use two fundamentally different kinds of information:

information stored in weights during training, and information provided “in-context”

at inference time. In this work, we show that transformers exhibit different inductive

biases in how they represent and generalize from the information in these two

sources. In particular, we characterize whether they generalize via parsimonious

rules (rule-based generalization) or via direct comparison with observed examples

(exemplar-based generalization). This is of important practical consequence, as

it informs whether to encode information in weights or in context, depending

on how we want models to use that information. In transformers trained on

controlled stimuli, we ﬁnd that generalization from weights is more rule-based

whereas generalization from context is largely exemplar-based. In contrast, we

ﬁnd that in transformers pre-trained on natural language, in-context learning is

signiﬁcantly rule-based, with larger models showing more rule-basedness. We

hypothesise that rule-based generalization from in-context information might be an

emergent consequence of large-scale training on language, which has sparse rule-

like structure. Using controlled stimuli, we verify that transformers pretrained on

data containing sparse rule-like structure exhibit more rule-based generalization.

1 Introduction

Transformer-based architectures have an impressive ability to use both information stored in weights

during training (“in-weights learning”), and information stored only in the inputs provided at inference

time (without any gradient updates to the weights of the model; “in-context learning”) [Chan et al.,

2022]. In-context learning on pretrained models enables learning efﬁciently from a few examples

(“few-shot learning") [Brown et al., 2020], or even efﬁciently compressing a large dataset (“prompt

tuning") [Li and Liang, 2021, Lester et al., 2021, Sun et al., 2022]. Given the evident current and

future potential for this new learning paradigm, it is important and useful to understand its inductive

biases, especially how it differs from in-weights learning.

One way to understand inductive bias is by examining how models generalize to held-out data. In this

work, we adapt the experimental paradigm in Dasgupta et al. [2022] that pose a classiﬁcation task

that distinguishes between two previously deﬁned kinds of generalization behaviors (see 1). A “rule-

based” decision is made on the basis of minimal features that support the category boundary [Ashby

and Townsend, 1986], while an “exemplar-based” decision generalizes on the basis of similarity to

examples from training data [Shepard and Chang, 1963], invoking many or all features available.

∗Equal contributions

Preprint. Under review.

arXiv:2210.05675v2 [cs.CL] 13 Oct 2022

(a) Partial exposure test.

Private & Conﬁdential

Feature 1

Feature 2

A B

**o

Feature 1

Feature 2

A B

**o

Feature 1

Feature 2

A B

**o

(b) Rule-based. (c) Exemplar-based.

Figure 1: Partial exposure test for differen-

tiating rule-based vs exemplar-based gener-

alization. Stimuli have two features. The

model sees three combinations (

, and

) in training or in context (depending on

experiment), and is evaluated on a held-out

(test) combination

(b)

A rule-based model

uses a parsimonious decision boundary that

explains the data (here, based only on Feature

1), classifying the test as

(c)

An exemplar-

based model computes the similarity between

test and training examples using all features.

Since

is equally similar to

and

, it is

equally likely to classify it as *or o.

This distinction is particularly interesting when com-

paring in-weights vs in-context learning. Exemplar-

based generalization (that uses all available features)

is useful in a low-data regime where there is not

enough information to form an abstract sparse rule

[Feldman, 2020]. On the other hand, sparser rule-

based generalization may help avoid sensitivity to

spurious correlation when training with large, noisy,

naturalistic datasets (that are commonly used to train

in-weights learning).

We ﬁnd that transformers exhibit a striking difference

in their generalization from in-context vs in-weights

information. Transformers display a strong inductive

bias towards exemplar-based generalization from in-

context information. In contrast, transformers display

a strong inductive bias towards rule-based generaliza-

tion from in-weights information.

However, when we pose a similar task to large trans-

former models pretrained on language, they exhibit

stronger rule-based generalization from in-context in-

formation. One interpretation of these results is that

the distribution of natural language is more compati-

ble with rule-based generalization from context (rule-

based generalization is in fact optimal in composi-

tional domains like langauge [Arjovsky et al., 2019]),

and such patterns might present strong enough learn-

ing pressure to overcome – and even reverse – trans-

formers’ inherent bias towards exemplar-based gen-

eralization from context.

2 Experimental Design

We adapted the “partial exposure” paradigm from Dasgupta et al. [2022] where each stimulus has

two features; only one of the features predicts the label. We evaluate how the model generalizes to a

held-out combination (using sparse rules or similarity to exemplars), see Fig 1.

First, we explored generalization in transformers trained on controlled synthetic data, where we can

examine generalization from both in-weights and in-context information and directly compare them.

Second, we repeat this experiment on pretrained language models and characterize their in-context

generalization. Finally, we compare the patterns observed and invesigate factors that explain the

differences observed.

3 Results

3.1 Trained-from-scratch transformers

For the trained-from-scratch transformers, we passed sequences of stimulus-label pairs as inputs to

the transformer model [Vaswani et al., 2017]. The sequences consisted of two parts: a context (24

tokens; i.e. 12 stimulus-label pairs) and a query (stimulus). The model was trained to minimize a

softmax cross-entropy loss on the prediction for the ﬁnal (query) stimulus. Each stimulus consists of

two subvectors concatenated together into a single token (Fig 4c) – these subvectors comprise the

two features of the partial exposure paradigm. See Appendix A for further details.

Generalization from in-weights information is rule-based.

To investigate generalization from

in-weights information, we trained the model on partial exposure data, and evaluated on the held-out

combination. During training, the label for each stimulus class was ﬁxed, so that the stimulus-label

mappings were stored in weights. The context tokens are uninformative for the query. After training,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TransformersgeneralizedifferentlyfrominformationstoredincontextvsweightsStephanieC.Y.ChanDeepMindIshitaDasguptaDeepMindJunkyungKimDeepMindDharshanKumaranDeepMindAndrewK.LampinenDeepMindFelixHillDeepMindAbstractTransformermodelscanusetwofundamentallydifferentkindsofinformation:informationstoredinwe...

展开>> 收起<<

Transformers generalize differently from information stored in context vs weights Stephanie C.Y. Chan.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transformers generalize differently from information stored in context vs weights Stephanie C.Y. Chan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: