
(a) Partial exposure test.
Private & Confidential
Feature 1
Feature 2
A B
**
*
*
**
**o
oo
o
?
W
X
Feature 1
Feature 2
A B
**
*
*
**
**o
oo
o
?
W
X
Feature 1
Feature 2
A B
**
*
*
**
**o
oo
o
?
W
X
(b) Rule-based. (c) Exemplar-based.
Figure 1: Partial exposure test for differen-
tiating rule-based vs exemplar-based gener-
alization. Stimuli have two features. The
model sees three combinations (
AX
,
AW
, and
BW
) in training or in context (depending on
experiment), and is evaluated on a held-out
(test) combination
BX
.
(b)
A rule-based model
uses a parsimonious decision boundary that
explains the data (here, based only on Feature
1), classifying the test as
o
.
(c)
An exemplar-
based model computes the similarity between
test and training examples using all features.
Since
BX
is equally similar to
AX
and
BW
, it is
equally likely to classify it as *or o.
This distinction is particularly interesting when com-
paring in-weights vs in-context learning. Exemplar-
based generalization (that uses all available features)
is useful in a low-data regime where there is not
enough information to form an abstract sparse rule
[Feldman, 2020]. On the other hand, sparser rule-
based generalization may help avoid sensitivity to
spurious correlation when training with large, noisy,
naturalistic datasets (that are commonly used to train
in-weights learning).
We find that transformers exhibit a striking difference
in their generalization from in-context vs in-weights
information. Transformers display a strong inductive
bias towards exemplar-based generalization from in-
context information. In contrast, transformers display
a strong inductive bias towards rule-based generaliza-
tion from in-weights information.
However, when we pose a similar task to large trans-
former models pretrained on language, they exhibit
stronger rule-based generalization from in-context in-
formation. One interpretation of these results is that
the distribution of natural language is more compati-
ble with rule-based generalization from context (rule-
based generalization is in fact optimal in composi-
tional domains like langauge [Arjovsky et al., 2019]),
and such patterns might present strong enough learn-
ing pressure to overcome – and even reverse – trans-
formers’ inherent bias towards exemplar-based gen-
eralization from context.
2 Experimental Design
We adapted the “partial exposure” paradigm from Dasgupta et al. [2022] where each stimulus has
two features; only one of the features predicts the label. We evaluate how the model generalizes to a
held-out combination (using sparse rules or similarity to exemplars), see Fig 1.
First, we explored generalization in transformers trained on controlled synthetic data, where we can
examine generalization from both in-weights and in-context information and directly compare them.
Second, we repeat this experiment on pretrained language models and characterize their in-context
generalization. Finally, we compare the patterns observed and invesigate factors that explain the
differences observed.
3 Results
3.1 Trained-from-scratch transformers
For the trained-from-scratch transformers, we passed sequences of stimulus-label pairs as inputs to
the transformer model [Vaswani et al., 2017]. The sequences consisted of two parts: a context (24
tokens; i.e. 12 stimulus-label pairs) and a query (stimulus). The model was trained to minimize a
softmax cross-entropy loss on the prediction for the final (query) stimulus. Each stimulus consists of
two subvectors concatenated together into a single token (Fig 4c) – these subvectors comprise the
two features of the partial exposure paradigm. See Appendix A for further details.
Generalization from in-weights information is rule-based.
To investigate generalization from
in-weights information, we trained the model on partial exposure data, and evaluated on the held-out
combination. During training, the label for each stimulus class was fixed, so that the stimulus-label
mappings were stored in weights. The context tokens are uninformative for the query. After training,
2