DARTF ORMER FINDING THEBESTTYPE OFATTENTION Jason Ross Brown University of Cambridge

2025-04-27 0 0 318.28KB 12 页 10玖币

侵权投诉

DARTFORMER: FINDING THE BEST TYPE OFATTENTION

Jason Ross Brown

University of Cambridge

jrb239@cam.ac.uk

Yiren Zhao

Imperial College London and University of Cambridge

a.zhao@imperial.ac.uk

Ilia Shumailov

University of Oxford

ilia.shumailov@chch.ox.ac.uk

Robert D Mullins

University of Cambridge

robert.mullins@cl.cam.ac.uk

ABSTRACT

Given the wide and ever growing range of different efﬁcient Transformer attention mechanisms, it

is important to identify which attention is most effective when given a task. In this work, we are

also interested in combining different attention types to build heterogeneous Transformers. We ﬁrst

propose a DARTS-like Neural Architecture Search (NAS) method to ﬁnd the best attention for a

given task, in this setup, all heads use the same attention (homogeneous models). Our results suggest

that NAS is highly effective on this task, and it identiﬁes the best attention mechanisms for IMDb

byte level text classiﬁcation and Listops. We then extend our framework to search for and build

Transformers with multiple different attention types, and call them heterogeneous Transformers. We

show that whilst these heterogeneous Transformers are better than the average homogeneous models,

they cannot outperform the best. We explore the reasons why heterogeneous attention makes sense,

and why it ultimately fails.

1 Introduction

Since the ﬁrst proposal of the Transformer architecture by Vaswani et al. [2017], many alternatives have been proposed

for the attention mechanism [Beltagy et al., 2020, Child et al., 2019, Choromanski et al., 2020, Katharopoulos et al.,

2020, Kitaev et al., 2020, Liu et al., 2018b, Tay et al., 2021, Wang et al., 2020, Zaheer et al., 2020] due to the original

dot-product attention mechanism having quadratic complexities in time and space with respect to the length of the input

sequence. Recent work [Tay et al., 2020b] showed that these alternative architectures often perform well at different

tasks when there is no pretraining, thus there is no clear attention that is best at every type of task. Therefore we ask:

Can we efﬁciently learn the best attention for a given long range task?

It is thought that each attention head in the Transformer can learn a different relationship, much like how in Convolutional

Neural Networks (CNNs) each kernel learns a different feature. Tay et al. [2020b] hypothesizes that each attention

mechanism represents a different functional bias for which attention relationships should be learned, and that the utility

of this bias is dependent on the task and its processing requirements. Thus if we had many different attention types in a

Transformer, they could each more easily learn different types of relationship, and thus make the Transformer more

effective overall. So is the optimal attention for a task a mixture of different attentions?

In this paper we apply Neural Architecture Search (NAS) techniques to the Transformer attention search space and

propose DARTFormer, a DARTS-like method [Liu et al., 2018a] for ﬁnding the best attention, a high level illustration

of the method is presented in Figure 1. To do this we use multiple candidate attention types in parallel and sum their

outputs in the attention mechanism of the Transformer. Following Wang et al. [2021] we use masked validation accuracy

drop as our metric for determining the performance of each attention type.

For clarity we refer to computation of the QKV linear projections, multi-head attention, and then concatenation and

linear projection back down to the sequence features as the attention block. In a standard Transformer there is only a

single attention block that has multiple heads. When training mixed attention models (either supernetworks or a ﬁnal

heterogeneous model) we use multiple attention blocks, each containing only a single attention type.

arXiv:2210.00641v1 [cs.LG] 2 Oct 2022

Initial layers

Inputs

Linear

Post-attn. layers

Output Probabilities

Bigbird Synthesizer

Final layers

x Num layers

Initial layers

Inputs

Reformer

Post-attn. layers

Output Probabilities

Performer

Final layers

Figure 1: Our supernetwork architecture (left) that we search over, and an example ﬁnal derived architecture (right) that

is heterogeneous across heads and homogeneous across layers. The supernetwork is a single layer Transformer with

several different multi-head attention blocks whose outputs are averaged.

Following Tay et al. [2020b], we use a representative mixture of different attention mechanisms as part of our search

space in order to cover the main methods of achieving efﬁcient attention. The speciﬁc attentions we use are: Bigbird

[Zaheer et al., 2020], Linear Transformer [Katharopoulos et al., 2020], Linformer [Wang et al., 2020], Local attention

[Liu et al., 2018b], Longformer [Beltagy et al., 2020], Performer [Choromanski et al., 2020], Reformer [Kitaev et al.,

2020], Sparse Transformer [Child et al., 2019], and Synthesizer [Tay et al., 2021].

We use our setup to investigate two key paradigms. The ﬁrst is learning the best attention for a new task with a single

layer Transformer, this means only one full-scale Transformer needs to be trained after a good attention mechanism

is found. The second paradigm, illustrated in Figure 1, is using a single layer Transformer to ﬁnd the best head-wise

heterogeneous attention mixture for that task, and then using that mixture in each layer of a full Transformer model,

making it layer-wise homogeneous. We test these paradigms on three different tasks. They are taken from Tay et al.

[2020b] and were speciﬁcally designed to test the capabilities of efﬁcient long range Transformers.

In this paper we make the following contributions:

• We propose a DARTS-like framework to efﬁciently ﬁnd the best attention for a task.

•

We extend this framework to building and searching for optimal heterogeneous attention Transformer models.

•

We empirically show that heterogeneous Transformers cannot outperform the best homogeneous Transformer

for our selected long range NLP tasks.

2 Related Work

2.1 Transformer Attention

Since Vaswani et al. [2017], a large variety of replacement attention mechanisms were proposed. These use different

approaches, such as, low rank approximations and kernel based methods [Choromanski et al., 2020, Katharopoulos

et al., 2020, Tay et al., 2021, Wang et al., 2020], ﬁxed/factorised/random patterns [Beltagy et al., 2020, Child et al., 2019,

Tay et al., 2020a, 2021, Zaheer et al., 2020], learnable patterns [Kitaev et al., 2020, Tay et al., 2020a], recurrence [Dai

et al., 2019] and more [Lee et al., 2019]. Tay et al. [2020c] gives a detailed survey of different attention mechanisms.

Tay et al. [2020b] compares the performance, speed and memory usage of many of these different attention mechanisms

on a variety of tasks, including NLP and image processing. Their main ﬁnding is that the performance of each attention

mechanism is highly dependent on the nature of the task being learned when pretrained embeddings or weights aren’t

used. This motivates the initial part of our research.

2.2 NAS on Transformers

Recent work of Liu et al. [2022] applies RankNAS [Hu et al., 2021] to cosFormer [Qin et al., 2022] and standard

Transformers [Vaswani et al., 2017]. They perform a search of the hyperparameters on a cosFormer network and

compare these to the same method applied to the standard Transformer. Tsai et al. [2020] searches the hyperparameter

space of a BERT [Devlin et al., 2018] model architecture heterogeneously to ﬁnd an optimal efﬁcient network. These

methods search hyperparameters, whereas we ﬁx hyperparameters and search over the space of attention mechanisms.

3 Method

3.1 DARTS Style NAS

We train a supernetwork that contains attention heads of each attention type. This is analogous to the edges in a DARTS

cell containing all possible operations for that edge. We also use ‘ﬁxed

’ with masked validation accuracy as our

metric for the strength of each edge as detailed in Wang et al. [2021]. Standard DARTS assigns a weight for each

edge in the computation graph and softmaxes this on the output of each cell (see Figure 2, left, in Appendix C), since

we are using a ’ﬁxed

’ approach we do not need to do this and simply average our edges without using learnable

weights. This allows us to train the supernetwork until convergence and then select the best attention (or prune out the

worst). Note that this removes the need to carry out the bi-level optimisation that is required in the original DARTS

paradigm that uses validation performance to train the edge weights. Our approach is illustrated in Figure 1. The

DARTS supernetwork cell and the standard Transformer architecture are given in Appendix C in Figure 2, to show how

our architecture relates to them.

3.2 Architecture and Search Space

Our Transformer encoder supernetwork consists of an embedding layer, an attention block supernetwork, a feed-forward

network (FFN), and a linear classiﬁer layer. In normal Transformer attention, each attention block consists of: QKV

linear projections from features to number of heads

head dimension, scaled dot product attention on each head,

concatenation of heads, and a ﬁnal dense projection back to the feature dimension [Vaswani et al., 2017]. In theory,

we want to search over the space of alternatives for the scaled dot product attention operation independently for each

head. This would replace scaled dot product attention in a normal Transformer with an average of candidate attention

mechanisms in a DARTS-like paradigm. The averages from each head would then be concatenated and linearly

projected back to the feature dimension as normal.

However, some attention mechanisms, such as Reformer [Kitaev et al., 2020], modify the linear projections or

concatenation operations. Because of this we instead search over candidate multi-head attention blocks where each

one implements the projections, attention, and concatenation operations. This is illustrated in Figure 4 in Appendix C

. When the candidate blocks are single head and we have

blocks per candidate attention mechanism, after the

architecture search is complete, the computation graph is equivalent to if we searched over the attention mechanisms

themselves with

heads. With the search at the block level with single heads, each attention mechanism can learn its

own linear projections, whereas in a search over just the attention mechanisms these would be shared within each head.

The disadvantage of this is increased memory and computation during the search, since now heads do not share the

linear projection.

3.3 Experiments

3.3.1 Finding Optimal Homogeneous Attention

For the ﬁrst experiment we train a single layer network with a block for each attention mechanism. These blocks

are each initialized with the desired number of ﬁnal heads for that task. After training we perform a single masked

validation accuracy trial and pick the highest scoring mechanism as a good candidate mechanism for that task when

used in a full Transformer model homogeneously. This paradigm is summarized in an algorithmic form in Algorithm 1

in Appendix D.

3.3.2 Finding Optimal Heterogeneous Attention

In this paradigm we try to learn an optimal single layer with a ﬁxed number of heads and stack it, analogous to searching

for an optimal cell in DARTS.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DARTFORMER:FINDINGTHEBESTTYPEOFATTENTIONJasonRossBrownUniversityofCambridgejrb239@cam.ac.ukYirenZhaoImperialCollegeLondonandUniversityofCambridgea.zhao@imperial.ac.ukIliaShumailovUniversityofOxfordilia.shumailov@chch.ox.ac.ukRobertDMullinsUniversityofCambridgerobert.mullins@cl.cam.ac.ukABSTRACTGiven...

展开>> 收起<<

DARTF ORMER FINDING THEBESTTYPE OFATTENTION Jason Ross Brown University of Cambridge.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DARTF ORMER FINDING THEBESTTYPE OFATTENTION Jason Ross Brown University of Cambridge

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: