Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks Yuxuan Li

2025-05-02 0 0 4.08MB 18 页 10玖币
侵权投诉
Systematic Generalization and Emergent Structures
in Transformers Trained on Structured Tasks
Yuxuan Li
Department of Psychology
Stanford University
Stanford, CA 94305
liyuxuan@stanford.edu
James L. McClelland
Department of Psychology
Stanford University
Stanford, CA 94305
jlmcc@stanford.edu
Abstract
Transformer networks have seen great success in natural language processing and
machine vision, where task objectives such as next word prediction and image
classification benefit from nuanced context sensitivity across high-dimensional
inputs. However, there is an ongoing debate about how and when transformers can
acquire highly structured behavior and achieve systematic generalization. Here,
we explore how well a causal transformer can perform a set of algorithmic tasks,
including copying, sorting, and hierarchical compositions of these operations. We
demonstrate strong generalization to sequences longer than those used in training
by replacing the standard positional encoding typically used in transformers with
labels arbitrarily paired with items in the sequence. We search for the layer and
head configuration sufficient to solve these tasks, then probe for signs of systematic
processing in latent representations and attention patterns. We show that two-layer
transformers learn reliable solutions to multi-level problems, develop signs of task
decomposition, and encode input items in a way that encourages the exploitation
of shared computation across related tasks. These results provide key insights into
how attention layers support structured computation both within a task and across
multiple tasks.
1 Introduction
Since their introduction (Vaswani et al., 2017), transformer-based models have become the new norm
of natural language modeling (Brown et al., 2020; Devlin et al., 2018) and are being leveraged for
machine vision tasks as well as in reinforcement learning contexts (Chen et al., 2021; Dosovitskiy
et al., 2020; Janner et al., 2021; Ramesh et al., 2021). Transformers trained on large amounts of data
under simple self-supervised, sequence modeling objectives are capable of subsequent generalization
to a wide variety of tasks, making them an appealing option for building multi-modal, multi-task,
generalist agents (Bommasani et al., 2021; Reed et al., 2022).
Central to this success is the ability to represent each part of the input in the context of other
parts through the self-attention mechanism. This may be especially important for task objectives
such as next word prediction and image classification at scale with naturalistic data, which benefit
from nuanced context sensitivity across high-dimensional inputs. Interestingly, transformer-based
language models seem to also acquire some knowledge of syntactic structures without being explicitly
trained to do so and display few-shot learning capabilities (Brown et al., 2020; Linzen and Baroni,
2021; Manning et al., 2020). These insights have led to ongoing work assessing broader reasoning
capabilities in these models (Binz and Schulz, 2022; Dasgupta et al., 2022).
Despite success in learning large-scale, naturalistic data and signs of generalizable behavior or
sensitivity to structures, how transformer models support systematic generalization remains to
Preprint.
arXiv:2210.00400v2 [cs.LG] 10 Dec 2022
be better understood. Recent work demonstrated that large language models struggle at longer
problems and fail to robustly reason beyond the training data (Anil et al., 2022; Razeghi et al.,
2022). Different architectural variations have been proposed to improve length generalization in
transformers, highlighting the role of variants of position-based encodings (Csordás et al., 2021a,b;
Ontanón et al., 2021; Press et al., 2021). Indeed, whether neural networks will ever be capable of
systematic generalization without building in explicit symbolic components remains an open question
(Fodor and Pylyshyn, 1988; Smolensky et al., 2022).
Here, we approach this question by training a causal transformer to perform a set of algorithmic
operations, including copy, reverse, and hierarchical group or sort tasks. We explicitly sought
the minimal transformer that would reliably solve these simple tasks and thoroughly analyze such
minimal solution through attention ablation and representation analysis to understand the internal
computational dynamics. Exploring how a transformer with no predefined task-aligned structure
could adapt to structures in these algorithmic tasks provides a starting point for understanding how
self-attention can tune to structures in more complex problems, e.g., those with the kinds of exceptions
and partial regularities of natural datasets, where the exploitation of task structures may occur in a
more approximate and graded manner. Our main contributions are:
1.
We highlight a simple label-based order encoding method in place of the positional encoding
methods typically used in transformers, and show that it helps our models achieve strong
length generalization performance across the set of algorithmic tasks we examine.
2.
We thoroughly analyze simple, two-layer causal transformers that learn these algorithmic
tasks, and show that the attention layers develop signs of systematic decomposition within
tasks and exploitation of shared structures across tasks.
2 Method
A
causal transformer
task
task!
embed
+
item!
embed
label!
embed
2
……
+
item!
embed
label!
embed
9
|
|
query for 1st !
output token
+
item!
embed
label!
embed
<eos>
<eos>
item head
label head
B
input sequence:
task
copy
output sequence
reverse
group: [shape]
group: [color]
sort:
[shape, color, texture]
sort:
[color, shape, texture]
2 x
add & norm
masked multi-head
attention
add & norm
feed forward
Figure 1: Task and model design.
Dataset
. We created an item pool covering all combinations of 5 shapes, 5 colors, and 5 textures, and
generated a sequence dataset by sampling 100k sequences of 5–50 items randomly selected from
the item pool. The tasks we used to train the models are shown in Fig 1A. Each task corresponds to
one of the following rules, which relies on item feature and/or item order information to rearrange
an input sequence (grouping or sorting items by a particular feature is with respect to a pre-defined
feature sort order, e.g., circles <squares <pentagons, or red <purple <blue):
COPY (C): copy the input sequence.
REVERSE (R): reverse the input sequence.
GROUP[SHAPE] (G[S]): group the items by shape, preserve the input order within each shape group.
GROUP[COLOR] (G[C]): group the items by color, preserve the input order within each color group.
SORT[SHAPE,COLOR,TEXTURE] (S[S]): sort the items first by shape, then by color, then by texture.
SORT[COLOR,SHAPE,TEXTURE] (S[C]): sort the items first by color, then by shape, then by texture.
We instantiated the token vocabularies as onehot or multihot vectors. The task tokens were onehot
vectors with the corresponding task category set to one, with one additional task dimension corre-
sponding to the end-of-sequence (EOS) token. The item tokens were multihot vectors whose units
2
indicated its value in each feature dimension (equivalent to concatenated onehot feature vectors). As
such, the model receives disentangled feature information in the input, though in principle it can learn
to disentangle feature information given onehot encodings for each unique item.
Label-based order encoding
. Using position-based order encodings, models trained with sequences
up to length
L
encounter an out-of-distribution problem when tested on longer sequences, as position
encodings beyond
L
are unfamiliar to the model. We introduce label-based encoding, which instead
pairs items in each sequence with ascending random integer labels to communicate order information
(Fig 1B). This allows models to encode longer sequences of tokens with familiar labels seen during
training. In our model, these labels were embedded with learnable weights, and we contrast the
random label encoding method with sinusoidal and learnable encodings based on item positions. A
concurrent work also explored the random position method and tested with other types of encodings
(Anonymous, 2022). In all reported results, we pre-generated item labels sampled from a range up to
the maximum generalization length (50) for all sequences in the dataset, and these labels were shared
across training steps and model seeds. In practice, the labels for each sequence can be sampled online
and from a larger range to accommodate generalization to even longer sequences.
Model
. The main model architecture is shown in Fig 1B. Each input sequence consisted of a task
token and the paired item and label tokens, with the EOS token serving as the first query for tokens
in the output sequence. The input tokens were first embedded to the model’s latent representational
space through a set of embedding layers depending on the token type (task, item, or label). The item
and label embeddings were then added to form a composite item embedding. These embedded tokens
were fed into a causal transformer, which contained one or two layers of alternating future-masked
attention sublayers and MLP sublayers. Residual connections and layer normalization were applied
after each sublayer as in Vaswani et al. (2017). We tested architectural variations in the number of
attention heads in different layers of the model while controlling for the total number of learnable
parameters (see detailed hyperparameters in Appendix B). The state of the query token at the output
of the causal transformer was passed through two linear heads to predict the next output token (the
task token, or an item and its associated label).
Training and evaluation
. The models were trained using full teacher forcing (where we always
feed the model the correct tokens) on all sequences of lengths 5 to 25 in the dataset (
46k) and
evaluated for length generalization on sequences of lengths 26 to 50 (
54k). We trained models in
both single-task and multi-task settings. In both cases, the output sequence consisted of the correctly
ordered items and their labels given the task being trained, followed by an EOS token. In single-task
learning, we did not include the task token in training or evaluation. In multi-task learning, the
task token was used and the models were trained to first output the task token before predicting the
output sequence. The training sequences used in multi-task learning remained the same ones between
lengths 5–25, but each sequence corresponded to a different output sequence under different tasks.
The models were trained using softmax cross-entropy loss on the prediction of feature classes, labels,
and task/EOS categories for tokens in the output sequence. Item predictions were treated as average
feature prediction accuracy, i.e., if the model predicted 2/3 features correct, its token-level item
accuracy is 2/3. Training stopped at 32k gradient updates for single-task models and 38k gradient
updates for multi-task models. Below, we report both token-level and sequence-level accuracy, under
both teacher forcing and top1 rollout (i.e., greedy decoding). Results were aggregated over four
random seeds for each task type
×
architecture pair. Unless otherwise specified, results were taken
from the checkpoint with the highest generalization accuracy within each seed. Error shades and
error bars indicate standard error of the mean across models.
3 Results
3.1 Single-task learning
Two-layer models with label encoding learn the SORT task and generalize to longer sequences
.
We first trained the model with the SORT[SHAPE,COLOR,TEXTURE] task. Using our label encoding
method, models with two single-headed layers (indicated as [1,1]) were able to achieve near-ceiling
accuracy on training sequences and generalize to longer sequences (Fig 2; also see quantitative
results in Appendix C). The predictions of the EOS token were also highly accurate in these models
(see Fig S1A in Appendix A.1). Item prediction was more accurate than label prediction in this
task, reflecting that the models represented item feature information more accurately in order to sort
3
A B D
C
training step
training step
token-level label acc
token-level item acc
generalization (length 26-50)
training (length 5-25)
sequence position
sequence position
token-level acc
item prediction
label prediction
token-level acc
sequence length
sequence length
item prediction
label prediction
sequence length
sequence length
seq-level acc (>=0.95)
item prediction
label prediction
seq-level acc (=1.0)
Figure 2: Token- and sequence-level accuracy for single-task models.
A
. Token-level accuracy on
training and generalization sequences over learning.
B
. Token-level accuracy over sequence length.
C
.
Token-level accuracy over sequence positions.
D
. Proportion of sequences that the model predicted
100% tokens correct (upper) or predicted greater than 95% tokens correct (lower). In B, C, and
D, results were taken from 5k novel sequences in the training length range (in B and D) and 5k
generalization sequences (B, C, and D). Legends indicate the number of attention heads in each layer
and the order encoding used (in A). Gray shades indicate the range of lengths used in training.
the input tokens. The two-layer models showed some degradation in sequence-level accuracy as a
function of sequence length, but the failures on longer sequences were not catastrophic, as these
models scored very well on longer sequences when up to 5% prediction errors were allowed (Fig 2D;
also see Fig S1B, and Fig S2 for accuracy under rollout in Appendix A.1). In contrast, two-layer
models trained with sinusoidal or learnable position encodings performed worse across both training
and generalization sequences (Fig 2A).
The two-layer models were also much better than single-layer models with either one or two attention
heads. While these single-layer models were able to exploit some correlations between items and
output positions (e.g., item [0,0,0] always came first, and item [4,4,4] always came last), they failed
to sort items in the middle positions (Fig 2C). In contrast, a single-layer, single-headed model was
sufficient to learn the COPY or the REVERSE task (see Fig S3A in Appendix A.1), suggesting that
multiple layers strongly benefit successful learning of multi-level problems.
s1
s2
s3
s4
s5
reordered input sequence
output sequence
sources
layer=0 head=0 layer=1 head=0
reordered input sequence
output sequence
sources
queries
A B C
item index within shape
attention to <eos>
item index within shape
item index within shape
Figure 3: Attention patterns in two-layer models.
A
. Attention maps for an example generalization
sequence. Items in the input sequence were reordered to match their output order for visualization
purposes. Numbers 1-5 mark the beginning of each shape group. Label e indicates the EOS token.
B
.
First-layer attention from query items to source items within shape groups.
C
. Attention to EOS as a
function of item index within each shape group (indicated by labels s1-s5). Results in B and C were
aggregated across 1k generalization sequences and across seeds.
Distinct two-stage processing across attention layers
. The attention weights in the two-layer
models revealed signs of task decomposition (Fig 3A). The attention head in the first layer tended
to distribute attention to the unsorted items that share the same shape as the current query item.
The attention head in the second layer then almost exclusively attended to the next output token
in preparation for feature and label readout. This pattern appeared robustly across sequences and
across different seeds (Fig 3B). Interestingly, there was an increase in the attention weights to the EOS
token as the model received query items towards the end of each shape group. This attention to EOS
increased to similar degrees in early or late shape groups (Fig 3C), again suggesting that the model
learned to systematically process items within each shape group, even though generating the EOS
4
摘要:

SystematicGeneralizationandEmergentStructuresinTransformersTrainedonStructuredTasksYuxuanLiDepartmentofPsychologyStanfordUniversityStanford,CA94305liyuxuan@stanford.eduJamesL.McClellandDepartmentofPsychologyStanfordUniversityStanford,CA94305jlmcc@stanford.eduAbstractTransformernetworkshaveseengreats...

展开>> 收起<<
Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks Yuxuan Li.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:4.08MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注