Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks Yuxuan Li

2025-05-02 0 0 4.08MB 18 页 10玖币

侵权投诉

Systematic Generalization and Emergent Structures

in Transformers Trained on Structured Tasks

Yuxuan Li

Department of Psychology

Stanford University

Stanford, CA 94305

liyuxuan@stanford.edu

James L. McClelland

Department of Psychology

Stanford University

Stanford, CA 94305

jlmcc@stanford.edu

Abstract

Transformer networks have seen great success in natural language processing and

machine vision, where task objectives such as next word prediction and image

classiﬁcation beneﬁt from nuanced context sensitivity across high-dimensional

inputs. However, there is an ongoing debate about how and when transformers can

acquire highly structured behavior and achieve systematic generalization. Here,

we explore how well a causal transformer can perform a set of algorithmic tasks,

including copying, sorting, and hierarchical compositions of these operations. We

demonstrate strong generalization to sequences longer than those used in training

by replacing the standard positional encoding typically used in transformers with

labels arbitrarily paired with items in the sequence. We search for the layer and

head conﬁguration sufﬁcient to solve these tasks, then probe for signs of systematic

processing in latent representations and attention patterns. We show that two-layer

transformers learn reliable solutions to multi-level problems, develop signs of task

decomposition, and encode input items in a way that encourages the exploitation

of shared computation across related tasks. These results provide key insights into

how attention layers support structured computation both within a task and across

multiple tasks.

1 Introduction

Since their introduction (Vaswani et al., 2017), transformer-based models have become the new norm

of natural language modeling (Brown et al., 2020; Devlin et al., 2018) and are being leveraged for

machine vision tasks as well as in reinforcement learning contexts (Chen et al., 2021; Dosovitskiy

et al., 2020; Janner et al., 2021; Ramesh et al., 2021). Transformers trained on large amounts of data

under simple self-supervised, sequence modeling objectives are capable of subsequent generalization

to a wide variety of tasks, making them an appealing option for building multi-modal, multi-task,

generalist agents (Bommasani et al., 2021; Reed et al., 2022).

Central to this success is the ability to represent each part of the input in the context of other

parts through the self-attention mechanism. This may be especially important for task objectives

such as next word prediction and image classiﬁcation at scale with naturalistic data, which beneﬁt

from nuanced context sensitivity across high-dimensional inputs. Interestingly, transformer-based

language models seem to also acquire some knowledge of syntactic structures without being explicitly

trained to do so and display few-shot learning capabilities (Brown et al., 2020; Linzen and Baroni,

2021; Manning et al., 2020). These insights have led to ongoing work assessing broader reasoning

capabilities in these models (Binz and Schulz, 2022; Dasgupta et al., 2022).

Despite success in learning large-scale, naturalistic data and signs of generalizable behavior or

sensitivity to structures, how transformer models support systematic generalization remains to

Preprint.

arXiv:2210.00400v2 [cs.LG] 10 Dec 2022

be better understood. Recent work demonstrated that large language models struggle at longer

problems and fail to robustly reason beyond the training data (Anil et al., 2022; Razeghi et al.,

2022). Different architectural variations have been proposed to improve length generalization in

transformers, highlighting the role of variants of position-based encodings (Csordás et al., 2021a,b;

Ontanón et al., 2021; Press et al., 2021). Indeed, whether neural networks will ever be capable of

systematic generalization without building in explicit symbolic components remains an open question

(Fodor and Pylyshyn, 1988; Smolensky et al., 2022).

Here, we approach this question by training a causal transformer to perform a set of algorithmic

operations, including copy, reverse, and hierarchical group or sort tasks. We explicitly sought

the minimal transformer that would reliably solve these simple tasks and thoroughly analyze such

minimal solution through attention ablation and representation analysis to understand the internal

computational dynamics. Exploring how a transformer with no predeﬁned task-aligned structure

could adapt to structures in these algorithmic tasks provides a starting point for understanding how

self-attention can tune to structures in more complex problems, e.g., those with the kinds of exceptions

and partial regularities of natural datasets, where the exploitation of task structures may occur in a

more approximate and graded manner. Our main contributions are:

We highlight a simple label-based order encoding method in place of the positional encoding

methods typically used in transformers, and show that it helps our models achieve strong

length generalization performance across the set of algorithmic tasks we examine.

We thoroughly analyze simple, two-layer causal transformers that learn these algorithmic

tasks, and show that the attention layers develop signs of systematic decomposition within

tasks and exploitation of shared structures across tasks.

2 Method

causal transformer

task

task!

embed

item!

embed

label!

embed

……

item!

embed

label!

embed

input sequence

query for 1st !

output token

item!

embed

label!

embed

<eos>

item head

label head

input sequence:

task

copy

output sequence

reverse

group: [shape]

group: [color]

sort:

[shape, color, texture]

sort:

[color, shape, texture]

2 x

add & norm

masked multi-head

attention

add & norm

feed forward

Figure 1: Task and model design.

Dataset

. We created an item pool covering all combinations of 5 shapes, 5 colors, and 5 textures, and

generated a sequence dataset by sampling 100k sequences of 5–50 items randomly selected from

the item pool. The tasks we used to train the models are shown in Fig 1A. Each task corresponds to

one of the following rules, which relies on item feature and/or item order information to rearrange

an input sequence (grouping or sorting items by a particular feature is with respect to a pre-deﬁned

feature sort order, e.g., circles <squares <pentagons, or red <purple <blue):

COPY (C): copy the input sequence.

REVERSE (R): reverse the input sequence.

GROUP[SHAPE] (G[S]): group the items by shape, preserve the input order within each shape group.

GROUP[COLOR] (G[C]): group the items by color, preserve the input order within each color group.

SORT[SHAPE,COLOR,TEXTURE] (S[S]): sort the items ﬁrst by shape, then by color, then by texture.

SORT[COLOR,SHAPE,TEXTURE] (S[C]): sort the items ﬁrst by color, then by shape, then by texture.

We instantiated the token vocabularies as onehot or multihot vectors. The task tokens were onehot

vectors with the corresponding task category set to one, with one additional task dimension corre-

sponding to the end-of-sequence (EOS) token. The item tokens were multihot vectors whose units

indicated its value in each feature dimension (equivalent to concatenated onehot feature vectors). As

such, the model receives disentangled feature information in the input, though in principle it can learn

to disentangle feature information given onehot encodings for each unique item.

Label-based order encoding

. Using position-based order encodings, models trained with sequences

up to length

encounter an out-of-distribution problem when tested on longer sequences, as position

encodings beyond

are unfamiliar to the model. We introduce label-based encoding, which instead

pairs items in each sequence with ascending random integer labels to communicate order information

(Fig 1B). This allows models to encode longer sequences of tokens with familiar labels seen during

training. In our model, these labels were embedded with learnable weights, and we contrast the

random label encoding method with sinusoidal and learnable encodings based on item positions. A

concurrent work also explored the random position method and tested with other types of encodings

(Anonymous, 2022). In all reported results, we pre-generated item labels sampled from a range up to

the maximum generalization length (50) for all sequences in the dataset, and these labels were shared

across training steps and model seeds. In practice, the labels for each sequence can be sampled online

and from a larger range to accommodate generalization to even longer sequences.

Model

. The main model architecture is shown in Fig 1B. Each input sequence consisted of a task

token and the paired item and label tokens, with the EOS token serving as the ﬁrst query for tokens

in the output sequence. The input tokens were ﬁrst embedded to the model’s latent representational

space through a set of embedding layers depending on the token type (task, item, or label). The item

and label embeddings were then added to form a composite item embedding. These embedded tokens

were fed into a causal transformer, which contained one or two layers of alternating future-masked

attention sublayers and MLP sublayers. Residual connections and layer normalization were applied

after each sublayer as in Vaswani et al. (2017). We tested architectural variations in the number of

attention heads in different layers of the model while controlling for the total number of learnable

parameters (see detailed hyperparameters in Appendix B). The state of the query token at the output

of the causal transformer was passed through two linear heads to predict the next output token (the

task token, or an item and its associated label).

Training and evaluation

. The models were trained using full teacher forcing (where we always

feed the model the correct tokens) on all sequences of lengths 5 to 25 in the dataset (

∼

46k) and

evaluated for length generalization on sequences of lengths 26 to 50 (

∼

54k). We trained models in

both single-task and multi-task settings. In both cases, the output sequence consisted of the correctly

ordered items and their labels given the task being trained, followed by an EOS token. In single-task

learning, we did not include the task token in training or evaluation. In multi-task learning, the

task token was used and the models were trained to ﬁrst output the task token before predicting the

output sequence. The training sequences used in multi-task learning remained the same ones between

lengths 5–25, but each sequence corresponded to a different output sequence under different tasks.

The models were trained using softmax cross-entropy loss on the prediction of feature classes, labels,

and task/EOS categories for tokens in the output sequence. Item predictions were treated as average

feature prediction accuracy, i.e., if the model predicted 2/3 features correct, its token-level item

accuracy is 2/3. Training stopped at 32k gradient updates for single-task models and 38k gradient

updates for multi-task models. Below, we report both token-level and sequence-level accuracy, under

both teacher forcing and top1 rollout (i.e., greedy decoding). Results were aggregated over four

random seeds for each task type

architecture pair. Unless otherwise speciﬁed, results were taken

from the checkpoint with the highest generalization accuracy within each seed. Error shades and

error bars indicate standard error of the mean across models.

3 Results

3.1 Single-task learning

Two-layer models with label encoding learn the SORT task and generalize to longer sequences

We ﬁrst trained the model with the SORT[SHAPE,COLOR,TEXTURE] task. Using our label encoding

method, models with two single-headed layers (indicated as [1,1]) were able to achieve near-ceiling

accuracy on training sequences and generalize to longer sequences (Fig 2; also see quantitative

results in Appendix C). The predictions of the EOS token were also highly accurate in these models

(see Fig S1A in Appendix A.1). Item prediction was more accurate than label prediction in this

task, reﬂecting that the models represented item feature information more accurately in order to sort

A B D

training step

token-level label acc

token-level item acc

generalization (length 26-50)

training (length 5-25)

sequence position

token-level acc

item prediction

label prediction

token-level acc

sequence length

item prediction

label prediction

sequence length

seq-level acc (>=0.95)

item prediction

label prediction

seq-level acc (=1.0)

Figure 2: Token- and sequence-level accuracy for single-task models.

. Token-level accuracy on

training and generalization sequences over learning.

. Token-level accuracy over sequence length.

Token-level accuracy over sequence positions.

. Proportion of sequences that the model predicted

100% tokens correct (upper) or predicted greater than 95% tokens correct (lower). In B, C, and

D, results were taken from 5k novel sequences in the training length range (in B and D) and 5k

generalization sequences (B, C, and D). Legends indicate the number of attention heads in each layer

and the order encoding used (in A). Gray shades indicate the range of lengths used in training.

the input tokens. The two-layer models showed some degradation in sequence-level accuracy as a

function of sequence length, but the failures on longer sequences were not catastrophic, as these

models scored very well on longer sequences when up to 5% prediction errors were allowed (Fig 2D;

also see Fig S1B, and Fig S2 for accuracy under rollout in Appendix A.1). In contrast, two-layer

models trained with sinusoidal or learnable position encodings performed worse across both training

and generalization sequences (Fig 2A).

The two-layer models were also much better than single-layer models with either one or two attention

heads. While these single-layer models were able to exploit some correlations between items and

output positions (e.g., item [0,0,0] always came ﬁrst, and item [4,4,4] always came last), they failed

to sort items in the middle positions (Fig 2C). In contrast, a single-layer, single-headed model was

sufﬁcient to learn the COPY or the REVERSE task (see Fig S3A in Appendix A.1), suggesting that

multiple layers strongly beneﬁt successful learning of multi-level problems.

｜

reordered input sequence

｜

output sequence

｜

sources

layer=0 head=0 layer=1 head=0

｜

reordered input sequence

｜

output sequence

｜

sources

queries

A B C

item index within shape

attention to <eos>

item index within shape

Figure 3: Attention patterns in two-layer models.

. Attention maps for an example generalization

sequence. Items in the input sequence were reordered to match their output order for visualization

purposes. Numbers 1-5 mark the beginning of each shape group. Label e indicates the EOS token.

First-layer attention from query items to source items within shape groups.

. Attention to EOS as a

function of item index within each shape group (indicated by labels s1-s5). Results in B and C were

aggregated across 1k generalization sequences and across seeds.

Distinct two-stage processing across attention layers

. The attention weights in the two-layer

models revealed signs of task decomposition (Fig 3A). The attention head in the ﬁrst layer tended

to distribute attention to the unsorted items that share the same shape as the current query item.

The attention head in the second layer then almost exclusively attended to the next output token

in preparation for feature and label readout. This pattern appeared robustly across sequences and

across different seeds (Fig 3B). Interestingly, there was an increase in the attention weights to the EOS

token as the model received query items towards the end of each shape group. This attention to EOS

increased to similar degrees in early or late shape groups (Fig 3C), again suggesting that the model

learned to systematically process items within each shape group, even though generating the EOS

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SystematicGeneralizationandEmergentStructuresinTransformersTrainedonStructuredTasksYuxuanLiDepartmentofPsychologyStanfordUniversityStanford,CA94305liyuxuan@stanford.eduJamesL.McClellandDepartmentofPsychologyStanfordUniversityStanford,CA94305jlmcc@stanford.eduAbstractTransformernetworkshaveseengreats...

展开>> 收起<<

Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks Yuxuan Li.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks Yuxuan Li

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: