Mixture of Attention Heads Selecting Attention Heads Per Token Xiaofeng Zhang12 Yikang Shen34 Zeyu Huang12 Jie Zhou4 Wenge Rong1 Zhang Xiong1 1State Key Laboratory of Software Development Environment

2025-05-02 0 0 631.31KB 13 页 10玖币
侵权投诉
Mixture of Attention Heads: Selecting Attention Heads Per Token
Xiaofeng Zhang1,2
, Yikang Shen3,4
, Zeyu Huang1,2, Jie Zhou4, Wenge Rong1, Zhang Xiong1
1State Key Laboratory of Software Development Environment,
School of Computer Science and Engineering, Beihang University, China
2Sino-French Engineer School, Beihang University, China
3Mila, University of Montreal, Canada
4Wechat AI, Tencent, China
Abstract
Mixture-of-Experts (MoE) networks have
been proposed as an efficient way to scale
up model capacity and implement conditional
computing. However, the study of MoE com-
ponents mostly focused on the feedforward
layer in Transformer architecture. This pa-
per proposes the Mixture of Attention Heads
(MoA), a new architecture that combines
multi-head attention with the MoE mechanism.
MoA includes a set of attention heads that each
has its own set of parameters. Given an input,
a router dynamically selects a subset of katten-
tion heads per token. This conditional compu-
tation schema allows MoA to achieve stronger
performance than the standard multi-head at-
tention layer. Furthermore, the sparsely gated
MoA can easily scale up the number of atten-
tion heads and the number of parameters while
preserving computational efficiency. In addi-
tion to the performance improvements, MoA
also automatically differentiates heads’ utili-
ties, providing a new perspective to discuss the
model’s interpretability. We conducted exper-
iments on several important tasks, including
Machine Translation and Masked Language
Modeling. Experiments have shown promis-
ing results on several tasks against strong base-
lines that involve large and very deep models1.
1 Introduction
In recent years, large models have become a
popular trend in the research of Natural Lan-
guage Processing, especially large-scale Trans-
former (Vaswani et al.,2017). The model’s capac-
ity has increased from millions of parameters (De-
vlin et al.,2019;Liu et al.,2019), to billions of pa-
rameters (Shoeybi et al.,2019;Raffel et al.,2020;
Wang et al.,2022), even to trillions of parame-
ters (Du et al.,2021;Fedus et al.,2021). How-
Equal contribution.
xiaofeng_z@buaa.edu.cn
,
yikang.shn@gmail.com
1
The code can be found at
https://github.com/
yikangshen/MoA.
Input
Top-K Router
H1 H2 H3 H4 H5 H6
MoA Output
Experts:
Figure 1: Simple illustration of MoA. MoA consists of
a set of attention heads named attention experts. For
each token in the input, a Router selects kattention
heads among all attention experts with different con-
fidences. The output is a weighted sum of the selected
attention heads given the confidence calculated by the
Router.
ever, these large-scale models demand substan-
tially more computations than small-scale models.
A popular trend is to utilize conditional computa-
tion with a sparsely activated model to seek greater
computational efficiency. Thus, only a part of the
model’s parameters is used for a specific input dur-
ing the forward computation, which alleviates the
computational load.
Among these attempts, the Mixture of Experts
(MoE) (Jacobs et al.,1991;Jordan and Jacobs,
1994) is an essential technique. Since first ap-
plying the mixture of experts to Transformer ar-
chitecture (Shazeer et al.,2018), researchers have
mainly focused on combining the Feed-Forward
Network layer and the Mixture of Experts. Recent
works have discussed how to get a better routing
strategy (Shazeer et al.,2017;Dua et al.,2021;
Lewis et al.,2021;Nie et al.,2021) or how to scale
up the Mixture of Experts on different nodes of
GPUs (Lepikhin et al.,2021;Fedus et al.,2021).
However, few attempts have explored the possibil-
ity of combining MoE with the Multi-Head Atten-
arXiv:2210.05144v1 [cs.CL] 11 Oct 2022
tion (MHA) mechanism. Since the MHA is another
compulsory module in the Transformer architec-
ture, combining MoE and the attention mechanism
could also help achieve better performance while
restraining the computational cost.
Besides, previous research has investigated the
utility of different attention heads. Peng et al.
(2020) found that the combination (reallocation)
of a subset of attention heads helps the Translation
task since they prune the useless attention heads. In
the field of dependency parsing, researchers have
unveiled that some attention heads in BERT-like
language models (Devlin et al.,2019;Liu et al.,
2019) model individual dependency types (Htut
et al.,2019) and syntactic functions (Shen et al.,
2022). Voita et al. (2019) claimed that the attention
heads have different functions that could be cate-
gorized into three types. There is no need to pass
through all multiple attention heads for an input to-
ken if we could select some relevant attention heads
whose function is proper. Thus, we conceive an
attention mechanism that selects different attention
heads per token.
Based on the above discussion, we proposed
Mixture of Attention Heads (MoA) (Section 4), an
attention mechanism that selects different attention
heads for different inputs. A simple illustration of
this idea is shown in Figure 1. MoA includes a
set of of attention heads with different parameters.
Given an input, a routing network dynamically se-
lects a subset of
k
attention heads for each token.
The output is a weighted sum of the selected atten-
tion heads given the confidence calculated by the
routing network.
We conducted experiments on two tasks: Ma-
chine Translation and Masked Language Modeling
(Section 5). Experiments shown promising results
against several strong baselines. In all tasks, our
proposed mixture of attention heads outperforms
the original Transformer architecture (Vaswani
et al.,2017). Our model surpasses many large
models or achieves comparable results with only
a half computational cost. Our contributions can
be summarized in three folds: 1) We proposed a
new attention mechanism called Mixture of At-
tention Heads, combining the idea of Mixture of
Experts with the attention mechanism. 2) MoA can
improve the model’s performance without substan-
tially adding parameters and computational cost. 3)
MoA is easy to scale up while maintaining with a
restrained computation complexity, resulting in a
further performance amelioration.
2 Related Work
Mixture of Experts
The Mixture of Experts
(MoE) was firstly introduced in the 1990s (Jacobs
et al.,1991;Jordan and Jacobs,1994). Shazeer
et al. (2017) adopted this method into modern
deep learning architectures (LSTM; Hochreiter
and Schmidhuber 1997) and proved its effective-
ness in Language Modeling and Machine Trans-
lation. The MoE was used to substitute the FFN
layers in Transformer architecture (Vaswani et al.,
2017) by the Mesh Tensorflow library (Shazeer
et al.,2018). Gshard (Lepikhin et al.,2021) is a
lightweight module that helps scale up multilin-
gual neural machine translation Transformer with
a Sparsely-Gated Mixture of Experts beyond 600
billion parameters. In Switch Transformer (Fedus
et al.,2021), the authors scaled the MoE-integrated
Transformer architecture toward trillion parame-
ter models. GLaM (Du et al.,2021) utilized a
decoder-only architecture to do language model
pre-training. Rajbhandari et al. (2022) proposed a
Pyramid-Residual-MoE for smaller model size and
fast inference.
Various routing strategies (Shazeer et al.,2017;
Dua et al.,2021;Lewis et al.,2021;Nie et al.,
2021) have been investigated for stabilizing the
MoE training and balancing the expert loads. Chi
et al. (2022) pointed out the representation collapse
issue in the sparse Mixture of Experts models and
solved by a two-stage routing strategy.
Machine Translation Architectures
With origi-
nal Transformer architecture (Vaswani et al.,2017),
Ott et al. (2018) found that training with reduced
precision and large batch could improve the trans-
lation performance. Some models get better perfor-
mance on translation by using larger scale of Trans-
former. Liu et al. (2020a) deepened the encoder
and decoder of the Transformer by adequately ini-
tializing the model. DeepNet (Wang et al.,2022)
scaled Transformers up to 1,000 layers by intro-
ducing a new normalization function. However,
these methods require a great amount of compu-
tational cost. Some models make changes to the
self-attention module. Peng et al. (2020) proposed
MAE model. The reallocation of attention heads
got better performance on Translation, since the
model prune useless multi-head attention heads.
However, their method is difficult to scale up and
get further improvement of the results because it
needs to use all the attention heads in the model
rather than sparsely activate them. It also requires
the complicated block coordinate descent training
steps. Wu et al. (2019) proposed DynamicConv
and LightConv by replacing self-attention mecha-
nism with a lightweight convolution.
Specialization of Attention Heads
Since the
publication of Transformer architecture (Vaswani
et al.,2017), many researchers have been interested
in analyzing how the attention mechanism works.
Voita et al. (2019) systematically analyzed the at-
tention heads in the encoder and categorized them
into three functional subsets: positional, syntactic,
and rare words. When dealing with dependency
parsing, researchers also observed the same phe-
nomenon that different heads could capture dif-
ferent syntactic functions (Htut et al.,2019;Shen
et al.,2022).
3 Preliminaries
3.1 Mixture of Experts
MoE (Shazeer et al.,2017) contains a set of expert
networks
E1, E2, . . . , EN
and a routing network
G
. The output of the MoE is the weighted sum
of the output of each expert. The routing network
calculates the probability for each expert. Formally,
the output of the MoE can be written as:
y=
N
X
i=1
G(x)iEi(x)(1)
The routing network
G
is a Noisy Top-k Routing
network. Before the softmax function, they add
Gaussian noise to the gated logits, see Equation 3.
Then, they keep only the top k values, setting the
rest gate values to equal 0, see Equation 2.
G(x) = Softmax(TopK(H(x), k)) (2)
H(x)i=(x·Wg)i+σ(0,1)·(3)
Softplus((x·Wnoise)i)
3.2 Multi-head Attention
Vaswani et al. (2017) proposed an encoder-decoder
architecture Transformer, which contains the multi-
head attention module. Different heads from the
multi-head attention module attend to informa-
tion from different representation subspaces, which
learn the input from various perspectives.
Performing multi-head attention with
k
heads,
the
Q, K, V
are linearly projected
k
times with
different, learned linear projections to subspaces.
On each projected
Q
and
K
, the attention scores are
calculated, via Equation 4. Values deriving from
different heads are projected back to the model
dimension size and summed up, with Equation 5.
Watt
i=Softmax QW q
i(KW k
i)T
dk(4)
y=
k
X
i=1 Watt
iV W v
iWo
i(5)
where
Wq
i, W k
i, W v
iRdm×dh
and
Wo
i
Rdh×dm,dkis the dimension of the key K.
4 Mixture of Attention Heads
In this work, we propose a variant of multi-head at-
tention for Transformer called Mixture of Attention
Heads (MoA), illustrated in Figure 2. MoA con-
sists of two major components, the routing network
G
and a group of
N
attention experts
{E1, ..., EN}
.
Similar to standard multi-head self-attention, the
input of MoA includes three sequences, query se-
quence
Q
, key sequence
K
, and value sequence
V
. We note
qt
as the query vector at time step
t
.
For each
qt
, the routing network
G
selects a subset
of
k
experts
G(qt)⊆ {Ei}
based on
qt
and assign
a weight
wi
to each selected expert. Then, these
selected experts take
qt
,
K
, and
V
as inputs and
compute an output
Ei(qt, K, V )
. The output of the
MoA is the weighted sum of the selected experts’
outputs. Formally, the MoA output at time step
t
can be written as:
yt=X
iG(qt)
wi,t ·Ei(qt, K, V )(6)
4.1 Routing Network
Similar to previous mixture-of-expert methods, the
routing network assigns attention experts to input
query. In order to select
k
experts for query
qt
, we
compute a routing probability
pi
for each expert
Ei
.
The routing probability is modeled with a linear
layer Wgand a softmax function:
pi,t = Softmaxi(qt·Wg)(7)
Based on the routing probability
p
, we select the
top-
k
attention experts among all
N
attention ex-
perts with the largest probabilities. Formally, the
routing network is defined as:
G(Q) = TopK(pi,t, k)(8)
摘要:

MixtureofAttentionHeads:SelectingAttentionHeadsPerTokenXiaofengZhang1;2,YikangShen3;4,ZeyuHuang1;2,JieZhou4,WengeRong1,ZhangXiong11StateKeyLaboratoryofSoftwareDevelopmentEnvironment,SchoolofComputerScienceandEngineering,BeihangUniversity,China2Sino-FrenchEngineerSchool,BeihangUniversity,China3Mila...

展开>> 收起<<
Mixture of Attention Heads Selecting Attention Heads Per Token Xiaofeng Zhang12 Yikang Shen34 Zeyu Huang12 Jie Zhou4 Wenge Rong1 Zhang Xiong1 1State Key Laboratory of Software Development Environment.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:631.31KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注