WIDEATTENTION ISTHEWAYFORWARD FOR TRANSFORMERS Jason Ross Brown

2025-05-06 0 0 374.53KB 14 页 10玖币
侵权投诉
WIDE ATTENTION ISTHE WAY FORWARD FOR
TRANSFORMERS?
Jason Ross Brown
University of Cambridge
jrb239@cam.ac.uk
Yiren Zhao
Imperial College London and University of Cambridge
a.zhao@imperial.ac.uk
Ilia Shumailov
University of Oxford
ilia.shumailov@chch.ox.ac.uk
Robert D Mullins
University of Cambridge
robert.mullins@cl.cam.ac.uk
ABSTRACT
The Transformer is an extremely powerful and prominent deep learning architecture. In this work,
we challenge the commonly held belief in deep learning that going deeper is better, and show an
alternative approach that is building wider attention Transformers. We demonstrate that wide single
layer Transformer models can typically equal or sometimes outperform deeper ones in a variety
of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of
changing the model aspect ratio on Transformers is studied systematically. This ratio balances
the number of layers and the number of attention heads per layer, while keeping the total number
of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10
attention types, single layer wide models perform 0.3% better than their deep counterparts. We show
an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and
can run faster on commodity hardware, in addition, these wider models are also more interpretable.
For example, a single layer Transformer on the IMDb byte level text classification has 3.1×faster
inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. We
therefore put forward wider and shallower models as a viable and desirable alternative for small
models on NLP tasks, and as an important area of research for domains beyond this.
1 Introduction
Since Vaswani et al. [2017], Transformer-based architectures have become widespread due to their advantages over
previous architectures such as recurrent neural networks (RNNs) and sometimes even convolutional neural networks
(CNNs). Many new X-formers have also been proposed that improve on the original Transformer by overcoming
its limitation on sequence length by providing a more scalable attention mechanism [Beltagy et al., 2020, Choro-
manski et al., 2020, Wang et al., 2020b]. However, little research has been done on the relevance of the size of the
attention computation in each layer, the number of attention layers, and how these parameters relate to the resulting
Transformer’s characteristics.
The primary source of parameters in a Transformer network is the Feed-forward Network (FFN) in each encoder or
decoder layer, and the linear layers which convert from the sequence feature dimension (often equal to the initial
embedding dimension) to the attention feature dimension, and back again after attention is applied. Each attention
head typically has an equal number of attention features. Consider an input sequence XRS×E, where Sis the
sequence length and Eis the embedding dimension. Here, a multi-head attention with Hheads is used and each head
operates on the learned projection with a dimension of A. After the attention mechanism there is a FFN with a single
hidden dimension of size M. These layers are then stacked Ltimes as illustrated on the left of Figure 1. Often in a
typical Transformer E=AH. The total number of parameters in a Transformer encoder is given by:
arXiv:2210.00640v2 [cs.LG] 8 Nov 2022
L Layers m
Input & Positional Embedding
E
S
QKV Linear Layers QKV Linear Layers
A
S
Attention
Concatenation & Linear Layer
Attention
Q K V
E
S
Feed Forward Network
Linear Layer & Softmax
Input
Output
H Heads
Input & Positional Embedding
E
S
QKV Linear Layers QKV Linear Layers
A
S
Attention
Concatenation & Linear Layer
Attention
Q K V
E
S
Feed Forward Network
Linear Layer & Softmax
Input
Output
LxH Heads
Figure 1: A comparison of a deep Transformer based classifier (left, with Llayers and Hheads for each layer) vs an
equivalent wide one (right, with a single layer and L×Hheads). Layer norms and residual connections have been
omitted from the diagram for clarity, for details on the full Transformer architecture see Vaswani et al. [2017].
Encoder Parameters =L(3EAH +AHE +EM +M E)
= 2LE(2AH +M)
In this paper, we investigate the effects of changing Land Hwhile keeping their product, the total number of heads,
the same. We start with typical values for Land Hand then move down to a single layer. A diagram illustrating our
design space and the differences between our widest and deepest models is given in Figure 1. We refer to the ratio of
layers to heads as the model aspect ratio. This naturally leads to an intriguing question: What is the best model aspect
ratio for the growing number of X-former models? We consider impacts of model aspect ratios on accuracy, run-time
performance, model size, and interpretability.
Based on the question above, we investigate the influence of various model aspect ratios on 9 X-former models, each
with their own attention mechanism, in addition to the original Transformer. Prior work on Transformer architectures
has mainly focused on designing more efficient attention styles [Choromanski et al., 2020, Wang et al., 2020b] or using
Network Architecture Search (NAS) to discover an optimal combination of operators [So et al., 2019]. By changing
the model aspect ratio, we consider a more coarse-grained design space. This design space is not commonly explored
in the NAS algorithms for Transformers, and we evaluate some interesting model architectures such as a single layer
model with many parallel heads.
For each model aspect ratio we run our experiments with each X-former across a number of text classification tasks
with various input sequence lengths ranging from 500 to 4000. We empirically observe that wider and shallower
models can typically equal or sometimes beat the accuracy of deeper models. This observation challenges the common
design paradigm of trying to build deeper Neural Networks. We show several other major advantages of a shallower
2
and wider model. First, it is more latency friendly on commodity hardware. Second, wider models are smaller in
terms of the number of parameters. And, third, outputs of wider models are more interpretable.
To summarise, we make the following contributions in this paper:
We demonstrate that wider and shallower models can typically equal or sometimes beat the accuracy of deeper
models when there is no pretraining of weights or embeddings. Across all 4 tasks, average accuracy for the
vanilla Transformer increases by 0.4% between normal deep models and our single layer wide models.
We show that our results are consistent across a variety of different attention mechanisms and input sequence
lengths, and thus there is a general design equivalence in increasing the depth of a Transformer model vs
increasing the width. Averaged across all non-vanilla attention types and tasks, accuracy increases by 0.3%
from deepest to widest.
We show that widening the models by fixing the attention computation size results in less overall parameters
and faster inference. We show that wider models are on average 1.4×smaller and have 3.1×faster inference
latency on a CPU and 1.9×on a GPU, compared to deep models.
We demonstrate how single layer networks can have more interpretable predictions by inspecting the attention
weights of each head in a single layer.
2 Related Work
2.1 Wider Networks
Zagoruyko & Komodakis [2016] and Wu et al. [2019] respectively show and investigate how widening and making
shallower ResNet CNNs [He et al., 2016] can improve their performance. Transformers also use residual connections,
so this helps motivate our investigation.
Xue et al. [2022b] find that wider layers in Transformers (in both attention and sequence features) can improve per-
formance on vision and natural language tasks. However, they use a mixture-of-experts layer instead of the typical
feed-forward network at the output of the Transformer encoder. They do this because a much larger dimension of
sequence features means the number of parameters in the feed-forward layer would increase greatly. As we are only
increasing the attention feature dimension we do not face this issue.
Xue et al. [2022a] investigate masked autoencoder training and how it can help reduce the problem of over smoothing
in training deep Transformer networks - embeddings of tokens converging to be similar at deeper layers. They explore
the optimal configuration of the Transformer when using masked autoencoding to pretrain it, and then fine-tune it on
specific tasks. They find that for vision Transformers, it is better to pretrain with a deep autoencoder rather than a wide
one. As we are not using pretrainining and pretrained embedings, their results are not directly comparable to our own.
2.2 Architecture optimization for Transformers
A lot of research in the Transformer field focused on finding a more efficient attention mechanism [Beltagy et al., 2020,
Child et al., 2019, Choromanski et al., 2020, Katharopoulos et al., 2020, Kitaev et al., 2020, Liu et al., 2018, Tay et al.,
2020a, 2021, Wang et al., 2020b, Zaheer et al., 2020], as the original dot-product attention has quadratic complexities
with respect to input sequence length. Long Range Arena (LRA, Tay et al. [2020b]) compares and contrasts these
methods, noting that in absence of pretrained embeddings and model parameters, the best attention mechanism is task
dependent. Thus there is no clear best attention mechanism. Because of this we test different attention mechanisms on
a variety of tasks with no pretrained embeddings or model parameters. This makes clear that our findings are largely
independent of attention mechanism and task, and contextualise going wide as a general design decision.
The application of Neural Architecture Search (NAS) to Transformer models has also been explored. Both Neural
Architecture Transformer (NAT) Guo et al. [2019] and Hardware-aware Transformers Wang et al. [2020a] use NAS to
search for models that are more hardware efficient. So et al. [2019] use evolutionary search to optimize components
and connections inside an encoder or decoder layer. Liu et al. [2022] apply RankNAS [Hu et al., 2021] to cosFormer
[Qin et al., 2022] and standard Transformers [Vaswani et al., 2017]. They perform a search of the hyperparameters on
a cosFormer network and compare these to the same method applied to the standard Transformer. Tsai et al. [2020]
searches the hyperparameter space of a BERT [Devlin et al., 2018] model architecture heterogeneously to find an
optimal efficient network. These previous experiments have not varied or searched over different numbers of layers in
the Transformer encoder, which is a key factor in what we investigate.
3
摘要:

WIDEATTENTIONISTHEWAYFORWARDFORTRANSFORMERS?JasonRossBrownUniversityofCambridgejrb239@cam.ac.ukYirenZhaoImperialCollegeLondonandUniversityofCambridgea.zhao@imperial.ac.ukIliaShumailovUniversityofOxfordilia.shumailov@chch.ox.ac.ukRobertDMullinsUniversityofCambridgerobert.mullins@cl.cam.ac.ukABSTRACTT...

展开>> 收起<<
WIDEATTENTION ISTHEWAYFORWARD FOR TRANSFORMERS Jason Ross Brown.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:374.53KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注