WIDEATTENTION ISTHEWAYFORWARD FOR TRANSFORMERS Jason Ross Brown

2025-05-06 0 0 374.53KB 14 页 10玖币

侵权投诉

WIDE ATTENTION ISTHE WAY FORWARD FOR

TRANSFORMERS?

Jason Ross Brown

University of Cambridge

jrb239@cam.ac.uk

Yiren Zhao

Imperial College London and University of Cambridge

a.zhao@imperial.ac.uk

Ilia Shumailov

University of Oxford

ilia.shumailov@chch.ox.ac.uk

Robert D Mullins

University of Cambridge

robert.mullins@cl.cam.ac.uk

ABSTRACT

The Transformer is an extremely powerful and prominent deep learning architecture. In this work,

we challenge the commonly held belief in deep learning that going deeper is better, and show an

alternative approach that is building wider attention Transformers. We demonstrate that wide single

layer Transformer models can typically equal or sometimes outperform deeper ones in a variety

of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of

changing the model aspect ratio on Transformers is studied systematically. This ratio balances

the number of layers and the number of attention heads per layer, while keeping the total number

of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10

attention types, single layer wide models perform 0.3% better than their deep counterparts. We show

an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and

can run faster on commodity hardware, in addition, these wider models are also more interpretable.

For example, a single layer Transformer on the IMDb byte level text classiﬁcation has 3.1×faster

inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. We

therefore put forward wider and shallower models as a viable and desirable alternative for small

models on NLP tasks, and as an important area of research for domains beyond this.

1 Introduction

Since Vaswani et al. [2017], Transformer-based architectures have become widespread due to their advantages over

previous architectures such as recurrent neural networks (RNNs) and sometimes even convolutional neural networks

(CNNs). Many new X-formers have also been proposed that improve on the original Transformer by overcoming

its limitation on sequence length by providing a more scalable attention mechanism [Beltagy et al., 2020, Choro-

manski et al., 2020, Wang et al., 2020b]. However, little research has been done on the relevance of the size of the

attention computation in each layer, the number of attention layers, and how these parameters relate to the resulting

Transformer’s characteristics.

The primary source of parameters in a Transformer network is the Feed-forward Network (FFN) in each encoder or

decoder layer, and the linear layers which convert from the sequence feature dimension (often equal to the initial

embedding dimension) to the attention feature dimension, and back again after attention is applied. Each attention

head typically has an equal number of attention features. Consider an input sequence X∈RS×E, where Sis the

sequence length and Eis the embedding dimension. Here, a multi-head attention with Hheads is used and each head

operates on the learned projection with a dimension of A. After the attention mechanism there is a FFN with a single

hidden dimension of size M. These layers are then stacked Ltimes as illustrated on the left of Figure 1. Often in a

typical Transformer E=AH. The total number of parameters in a Transformer encoder is given by:

arXiv:2210.00640v2 [cs.LG] 8 Nov 2022

L Layers m

Input & Positional Embedding

QKV Linear Layers QKV Linear Layers

Attention

Concatenation & Linear Layer

Attention

Q K V

Feed Forward Network

Linear Layer & Softmax

Input

Output

H Heads ⇔

Input & Positional Embedding

QKV Linear Layers QKV Linear Layers

Attention

Concatenation & Linear Layer

Attention

Q K V

Feed Forward Network

Linear Layer & Softmax

Input

Output

LxH Heads ⇔

Figure 1: A comparison of a deep Transformer based classiﬁer (left, with Llayers and Hheads for each layer) vs an

equivalent wide one (right, with a single layer and L×Hheads). Layer norms and residual connections have been

omitted from the diagram for clarity, for details on the full Transformer architecture see Vaswani et al. [2017].

Encoder Parameters =L(3EAH +AHE +EM +M E)

= 2LE(2AH +M)

In this paper, we investigate the effects of changing Land Hwhile keeping their product, the total number of heads,

the same. We start with typical values for Land Hand then move down to a single layer. A diagram illustrating our

design space and the differences between our widest and deepest models is given in Figure 1. We refer to the ratio of

layers to heads as the model aspect ratio. This naturally leads to an intriguing question: What is the best model aspect

ratio for the growing number of X-former models? We consider impacts of model aspect ratios on accuracy, run-time

performance, model size, and interpretability.

Based on the question above, we investigate the inﬂuence of various model aspect ratios on 9 X-former models, each

with their own attention mechanism, in addition to the original Transformer. Prior work on Transformer architectures

has mainly focused on designing more efﬁcient attention styles [Choromanski et al., 2020, Wang et al., 2020b] or using

Network Architecture Search (NAS) to discover an optimal combination of operators [So et al., 2019]. By changing

the model aspect ratio, we consider a more coarse-grained design space. This design space is not commonly explored

in the NAS algorithms for Transformers, and we evaluate some interesting model architectures such as a single layer

model with many parallel heads.

For each model aspect ratio we run our experiments with each X-former across a number of text classiﬁcation tasks

with various input sequence lengths ranging from 500 to 4000. We empirically observe that wider and shallower

models can typically equal or sometimes beat the accuracy of deeper models. This observation challenges the common

design paradigm of trying to build deeper Neural Networks. We show several other major advantages of a shallower

and wider model. First, it is more latency friendly on commodity hardware. Second, wider models are smaller in

terms of the number of parameters. And, third, outputs of wider models are more interpretable.

To summarise, we make the following contributions in this paper:

• We demonstrate that wider and shallower models can typically equal or sometimes beat the accuracy of deeper

models when there is no pretraining of weights or embeddings. Across all 4 tasks, average accuracy for the

vanilla Transformer increases by 0.4% between normal deep models and our single layer wide models.

• We show that our results are consistent across a variety of different attention mechanisms and input sequence

lengths, and thus there is a general design equivalence in increasing the depth of a Transformer model vs

increasing the width. Averaged across all non-vanilla attention types and tasks, accuracy increases by 0.3%

from deepest to widest.

• We show that widening the models by ﬁxing the attention computation size results in less overall parameters

and faster inference. We show that wider models are on average 1.4×smaller and have 3.1×faster inference

latency on a CPU and 1.9×on a GPU, compared to deep models.

• We demonstrate how single layer networks can have more interpretable predictions by inspecting the attention

weights of each head in a single layer.

2 Related Work

2.1 Wider Networks

Zagoruyko & Komodakis [2016] and Wu et al. [2019] respectively show and investigate how widening and making

shallower ResNet CNNs [He et al., 2016] can improve their performance. Transformers also use residual connections,

so this helps motivate our investigation.

Xue et al. [2022b] ﬁnd that wider layers in Transformers (in both attention and sequence features) can improve per-

formance on vision and natural language tasks. However, they use a mixture-of-experts layer instead of the typical

feed-forward network at the output of the Transformer encoder. They do this because a much larger dimension of

sequence features means the number of parameters in the feed-forward layer would increase greatly. As we are only

increasing the attention feature dimension we do not face this issue.

Xue et al. [2022a] investigate masked autoencoder training and how it can help reduce the problem of over smoothing

in training deep Transformer networks - embeddings of tokens converging to be similar at deeper layers. They explore

the optimal conﬁguration of the Transformer when using masked autoencoding to pretrain it, and then ﬁne-tune it on

speciﬁc tasks. They ﬁnd that for vision Transformers, it is better to pretrain with a deep autoencoder rather than a wide

one. As we are not using pretrainining and pretrained embedings, their results are not directly comparable to our own.

2.2 Architecture optimization for Transformers

A lot of research in the Transformer ﬁeld focused on ﬁnding a more efﬁcient attention mechanism [Beltagy et al., 2020,

Child et al., 2019, Choromanski et al., 2020, Katharopoulos et al., 2020, Kitaev et al., 2020, Liu et al., 2018, Tay et al.,

2020a, 2021, Wang et al., 2020b, Zaheer et al., 2020], as the original dot-product attention has quadratic complexities

with respect to input sequence length. Long Range Arena (LRA, Tay et al. [2020b]) compares and contrasts these

methods, noting that in absence of pretrained embeddings and model parameters, the best attention mechanism is task

dependent. Thus there is no clear best attention mechanism. Because of this we test different attention mechanisms on

a variety of tasks with no pretrained embeddings or model parameters. This makes clear that our ﬁndings are largely

independent of attention mechanism and task, and contextualise going wide as a general design decision.

The application of Neural Architecture Search (NAS) to Transformer models has also been explored. Both Neural

Architecture Transformer (NAT) Guo et al. [2019] and Hardware-aware Transformers Wang et al. [2020a] use NAS to

search for models that are more hardware efﬁcient. So et al. [2019] use evolutionary search to optimize components

and connections inside an encoder or decoder layer. Liu et al. [2022] apply RankNAS [Hu et al., 2021] to cosFormer

[Qin et al., 2022] and standard Transformers [Vaswani et al., 2017]. They perform a search of the hyperparameters on

a cosFormer network and compare these to the same method applied to the standard Transformer. Tsai et al. [2020]

searches the hyperparameter space of a BERT [Devlin et al., 2018] model architecture heterogeneously to ﬁnd an

optimal efﬁcient network. These previous experiments have not varied or searched over different numbers of layers in

the Transformer encoder, which is a key factor in what we investigate.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WIDEATTENTIONISTHEWAYFORWARDFORTRANSFORMERS?JasonRossBrownUniversityofCambridgejrb239@cam.ac.ukYirenZhaoImperialCollegeLondonandUniversityofCambridgea.zhao@imperial.ac.ukIliaShumailovUniversityofOxfordilia.shumailov@chch.ox.ac.ukRobertDMullinsUniversityofCambridgerobert.mullins@cl.cam.ac.ukABSTRACTT...

展开>> 收起<<

WIDEATTENTION ISTHEWAYFORWARD FOR TRANSFORMERS Jason Ross Brown.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

WIDEATTENTION ISTHEWAYFORWARD FOR TRANSFORMERS Jason Ross Brown

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: