DARTF ORMER FINDING THEBESTTYPE OFATTENTION Jason Ross Brown University of Cambridge

2025-04-27 0 0 318.28KB 12 页 10玖币
侵权投诉
DARTFORMER: FINDING THE BEST TYPE OFATTENTION
Jason Ross Brown
University of Cambridge
jrb239@cam.ac.uk
Yiren Zhao
Imperial College London and University of Cambridge
a.zhao@imperial.ac.uk
Ilia Shumailov
University of Oxford
ilia.shumailov@chch.ox.ac.uk
Robert D Mullins
University of Cambridge
robert.mullins@cl.cam.ac.uk
ABSTRACT
Given the wide and ever growing range of different efficient Transformer attention mechanisms, it
is important to identify which attention is most effective when given a task. In this work, we are
also interested in combining different attention types to build heterogeneous Transformers. We first
propose a DARTS-like Neural Architecture Search (NAS) method to find the best attention for a
given task, in this setup, all heads use the same attention (homogeneous models). Our results suggest
that NAS is highly effective on this task, and it identifies the best attention mechanisms for IMDb
byte level text classification and Listops. We then extend our framework to search for and build
Transformers with multiple different attention types, and call them heterogeneous Transformers. We
show that whilst these heterogeneous Transformers are better than the average homogeneous models,
they cannot outperform the best. We explore the reasons why heterogeneous attention makes sense,
and why it ultimately fails.
1 Introduction
Since the first proposal of the Transformer architecture by Vaswani et al. [2017], many alternatives have been proposed
for the attention mechanism [Beltagy et al., 2020, Child et al., 2019, Choromanski et al., 2020, Katharopoulos et al.,
2020, Kitaev et al., 2020, Liu et al., 2018b, Tay et al., 2021, Wang et al., 2020, Zaheer et al., 2020] due to the original
dot-product attention mechanism having quadratic complexities in time and space with respect to the length of the input
sequence. Recent work [Tay et al., 2020b] showed that these alternative architectures often perform well at different
tasks when there is no pretraining, thus there is no clear attention that is best at every type of task. Therefore we ask:
Can we efficiently learn the best attention for a given long range task?
It is thought that each attention head in the Transformer can learn a different relationship, much like how in Convolutional
Neural Networks (CNNs) each kernel learns a different feature. Tay et al. [2020b] hypothesizes that each attention
mechanism represents a different functional bias for which attention relationships should be learned, and that the utility
of this bias is dependent on the task and its processing requirements. Thus if we had many different attention types in a
Transformer, they could each more easily learn different types of relationship, and thus make the Transformer more
effective overall. So is the optimal attention for a task a mixture of different attentions?
In this paper we apply Neural Architecture Search (NAS) techniques to the Transformer attention search space and
propose DARTFormer, a DARTS-like method [Liu et al., 2018a] for finding the best attention, a high level illustration
of the method is presented in Figure 1. To do this we use multiple candidate attention types in parallel and sum their
outputs in the attention mechanism of the Transformer. Following Wang et al. [2021] we use masked validation accuracy
drop as our metric for determining the performance of each attention type.
For clarity we refer to computation of the QKV linear projections, multi-head attention, and then concatenation and
linear projection back down to the sequence features as the attention block. In a standard Transformer there is only a
single attention block that has multiple heads. When training mixed attention models (either supernetworks or a final
heterogeneous model) we use multiple attention blocks, each containing only a single attention type.
arXiv:2210.00641v1 [cs.LG] 2 Oct 2022
Initial layers
Inputs
Linear
Post-attn. layers
Output Probabilities
Bigbird Synthesizer
Final layers
x Num layers
Initial layers
Inputs
Reformer
Post-attn. layers
Output Probabilities
Performer
Final layers
Figure 1: Our supernetwork architecture (left) that we search over, and an example final derived architecture (right) that
is heterogeneous across heads and homogeneous across layers. The supernetwork is a single layer Transformer with
several different multi-head attention blocks whose outputs are averaged.
Following Tay et al. [2020b], we use a representative mixture of different attention mechanisms as part of our search
space in order to cover the main methods of achieving efficient attention. The specific attentions we use are: Bigbird
[Zaheer et al., 2020], Linear Transformer [Katharopoulos et al., 2020], Linformer [Wang et al., 2020], Local attention
[Liu et al., 2018b], Longformer [Beltagy et al., 2020], Performer [Choromanski et al., 2020], Reformer [Kitaev et al.,
2020], Sparse Transformer [Child et al., 2019], and Synthesizer [Tay et al., 2021].
We use our setup to investigate two key paradigms. The first is learning the best attention for a new task with a single
layer Transformer, this means only one full-scale Transformer needs to be trained after a good attention mechanism
is found. The second paradigm, illustrated in Figure 1, is using a single layer Transformer to find the best head-wise
heterogeneous attention mixture for that task, and then using that mixture in each layer of a full Transformer model,
making it layer-wise homogeneous. We test these paradigms on three different tasks. They are taken from Tay et al.
[2020b] and were specifically designed to test the capabilities of efficient long range Transformers.
In this paper we make the following contributions:
We propose a DARTS-like framework to efficiently find the best attention for a task.
We extend this framework to building and searching for optimal heterogeneous attention Transformer models.
We empirically show that heterogeneous Transformers cannot outperform the best homogeneous Transformer
for our selected long range NLP tasks.
2 Related Work
2.1 Transformer Attention
Since Vaswani et al. [2017], a large variety of replacement attention mechanisms were proposed. These use different
approaches, such as, low rank approximations and kernel based methods [Choromanski et al., 2020, Katharopoulos
et al., 2020, Tay et al., 2021, Wang et al., 2020], fixed/factorised/random patterns [Beltagy et al., 2020, Child et al., 2019,
Tay et al., 2020a, 2021, Zaheer et al., 2020], learnable patterns [Kitaev et al., 2020, Tay et al., 2020a], recurrence [Dai
et al., 2019] and more [Lee et al., 2019]. Tay et al. [2020c] gives a detailed survey of different attention mechanisms.
Tay et al. [2020b] compares the performance, speed and memory usage of many of these different attention mechanisms
on a variety of tasks, including NLP and image processing. Their main finding is that the performance of each attention
mechanism is highly dependent on the nature of the task being learned when pretrained embeddings or weights aren’t
used. This motivates the initial part of our research.
2
2.2 NAS on Transformers
Recent work of Liu et al. [2022] applies RankNAS [Hu et al., 2021] to cosFormer [Qin et al., 2022] and standard
Transformers [Vaswani et al., 2017]. They perform a search of the hyperparameters on a cosFormer network and
compare these to the same method applied to the standard Transformer. Tsai et al. [2020] searches the hyperparameter
space of a BERT [Devlin et al., 2018] model architecture heterogeneously to find an optimal efficient network. These
methods search hyperparameters, whereas we fix hyperparameters and search over the space of attention mechanisms.
3 Method
3.1 DARTS Style NAS
We train a supernetwork that contains attention heads of each attention type. This is analogous to the edges in a DARTS
cell containing all possible operations for that edge. We also use ‘fixed
α
’ with masked validation accuracy as our
metric for the strength of each edge as detailed in Wang et al. [2021]. Standard DARTS assigns a weight for each
edge in the computation graph and softmaxes this on the output of each cell (see Figure 2, left, in Appendix C), since
we are using a ’fixed
α
’ approach we do not need to do this and simply average our edges without using learnable
weights. This allows us to train the supernetwork until convergence and then select the best attention (or prune out the
worst). Note that this removes the need to carry out the bi-level optimisation that is required in the original DARTS
paradigm that uses validation performance to train the edge weights. Our approach is illustrated in Figure 1. The
DARTS supernetwork cell and the standard Transformer architecture are given in Appendix C in Figure 2, to show how
our architecture relates to them.
3.2 Architecture and Search Space
Our Transformer encoder supernetwork consists of an embedding layer, an attention block supernetwork, a feed-forward
network (FFN), and a linear classifier layer. In normal Transformer attention, each attention block consists of: QKV
linear projections from features to number of heads
×
head dimension, scaled dot product attention on each head,
concatenation of heads, and a final dense projection back to the feature dimension [Vaswani et al., 2017]. In theory,
we want to search over the space of alternatives for the scaled dot product attention operation independently for each
head. This would replace scaled dot product attention in a normal Transformer with an average of candidate attention
mechanisms in a DARTS-like paradigm. The averages from each head would then be concatenated and linearly
projected back to the feature dimension as normal.
However, some attention mechanisms, such as Reformer [Kitaev et al., 2020], modify the linear projections or
concatenation operations. Because of this we instead search over candidate multi-head attention blocks where each
one implements the projections, attention, and concatenation operations. This is illustrated in Figure 4 in Appendix C
. When the candidate blocks are single head and we have
H
blocks per candidate attention mechanism, after the
architecture search is complete, the computation graph is equivalent to if we searched over the attention mechanisms
themselves with
H
heads. With the search at the block level with single heads, each attention mechanism can learn its
own linear projections, whereas in a search over just the attention mechanisms these would be shared within each head.
The disadvantage of this is increased memory and computation during the search, since now heads do not share the
linear projection.
3.3 Experiments
3.3.1 Finding Optimal Homogeneous Attention
For the first experiment we train a single layer network with a block for each attention mechanism. These blocks
are each initialized with the desired number of final heads for that task. After training we perform a single masked
validation accuracy trial and pick the highest scoring mechanism as a good candidate mechanism for that task when
used in a full Transformer model homogeneously. This paradigm is summarized in an algorithmic form in Algorithm 1
in Appendix D.
3.3.2 Finding Optimal Heterogeneous Attention
In this paradigm we try to learn an optimal single layer with a fixed number of heads and stack it, analogous to searching
for an optimal cell in DARTS.
3
摘要:

DARTFORMER:FINDINGTHEBESTTYPEOFATTENTIONJasonRossBrownUniversityofCambridgejrb239@cam.ac.ukYirenZhaoImperialCollegeLondonandUniversityofCambridgea.zhao@imperial.ac.ukIliaShumailovUniversityofOxfordilia.shumailov@chch.ox.ac.ukRobertDMullinsUniversityofCambridgerobert.mullins@cl.cam.ac.ukABSTRACTGiven...

展开>> 收起<<
DARTF ORMER FINDING THEBESTTYPE OFATTENTION Jason Ross Brown University of Cambridge.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:318.28KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注