Mixture of Attention Heads Selecting Attention Heads Per Token Xiaofeng Zhang12 Yikang Shen34 Zeyu Huang12 Jie Zhou4 Wenge Rong1 Zhang Xiong1 1State Key Laboratory of Software Development Environment

2025-05-02 0 0 631.31KB 13 页 10玖币

侵权投诉

Mixture of Attention Heads: Selecting Attention Heads Per Token

Xiaofeng Zhang1,2∗

, Yikang Shen3,4∗

, Zeyu Huang1,2, Jie Zhou4, Wenge Rong1, Zhang Xiong1

1State Key Laboratory of Software Development Environment,

School of Computer Science and Engineering, Beihang University, China

2Sino-French Engineer School, Beihang University, China

3Mila, University of Montreal, Canada

4Wechat AI, Tencent, China

Abstract

Mixture-of-Experts (MoE) networks have

been proposed as an efﬁcient way to scale

up model capacity and implement conditional

computing. However, the study of MoE com-

ponents mostly focused on the feedforward

layer in Transformer architecture. This pa-

per proposes the Mixture of Attention Heads

(MoA), a new architecture that combines

multi-head attention with the MoE mechanism.

MoA includes a set of attention heads that each

has its own set of parameters. Given an input,

a router dynamically selects a subset of katten-

tion heads per token. This conditional compu-

tation schema allows MoA to achieve stronger

performance than the standard multi-head at-

tention layer. Furthermore, the sparsely gated

MoA can easily scale up the number of atten-

tion heads and the number of parameters while

preserving computational efﬁciency. In addi-

tion to the performance improvements, MoA

also automatically differentiates heads’ utili-

ties, providing a new perspective to discuss the

model’s interpretability. We conducted exper-

iments on several important tasks, including

Machine Translation and Masked Language

Modeling. Experiments have shown promis-

ing results on several tasks against strong base-

lines that involve large and very deep models1.

1 Introduction

In recent years, large models have become a

popular trend in the research of Natural Lan-

guage Processing, especially large-scale Trans-

former (Vaswani et al.,2017). The model’s capac-

ity has increased from millions of parameters (De-

vlin et al.,2019;Liu et al.,2019), to billions of pa-

rameters (Shoeybi et al.,2019;Raffel et al.,2020;

Wang et al.,2022), even to trillions of parame-

ters (Du et al.,2021;Fedus et al.,2021). How-

∗

Equal contribution.

xiaofeng_z@buaa.edu.cn

yikang.shn@gmail.com

The code can be found at

https://github.com/

yikangshen/MoA.

Input

Top-K Router

H1 H2 H3 H4 H5 H6

MoA Output

Experts:

Figure 1: Simple illustration of MoA. MoA consists of

a set of attention heads named attention experts. For

each token in the input, a Router selects kattention

heads among all attention experts with different con-

ﬁdences. The output is a weighted sum of the selected

attention heads given the conﬁdence calculated by the

Router.

ever, these large-scale models demand substan-

tially more computations than small-scale models.

A popular trend is to utilize conditional computa-

tion with a sparsely activated model to seek greater

computational efﬁciency. Thus, only a part of the

model’s parameters is used for a speciﬁc input dur-

ing the forward computation, which alleviates the

computational load.

Among these attempts, the Mixture of Experts

(MoE) (Jacobs et al.,1991;Jordan and Jacobs,

1994) is an essential technique. Since ﬁrst ap-

plying the mixture of experts to Transformer ar-

chitecture (Shazeer et al.,2018), researchers have

mainly focused on combining the Feed-Forward

Network layer and the Mixture of Experts. Recent

works have discussed how to get a better routing

strategy (Shazeer et al.,2017;Dua et al.,2021;

Lewis et al.,2021;Nie et al.,2021) or how to scale

up the Mixture of Experts on different nodes of

GPUs (Lepikhin et al.,2021;Fedus et al.,2021).

However, few attempts have explored the possibil-

ity of combining MoE with the Multi-Head Atten-

arXiv:2210.05144v1 [cs.CL] 11 Oct 2022

tion (MHA) mechanism. Since the MHA is another

compulsory module in the Transformer architec-

ture, combining MoE and the attention mechanism

could also help achieve better performance while

restraining the computational cost.

Besides, previous research has investigated the

utility of different attention heads. Peng et al.

(2020) found that the combination (reallocation)

of a subset of attention heads helps the Translation

task since they prune the useless attention heads. In

the ﬁeld of dependency parsing, researchers have

unveiled that some attention heads in BERT-like

language models (Devlin et al.,2019;Liu et al.,

2019) model individual dependency types (Htut

et al.,2019) and syntactic functions (Shen et al.,

2022). Voita et al. (2019) claimed that the attention

heads have different functions that could be cate-

gorized into three types. There is no need to pass

through all multiple attention heads for an input to-

ken if we could select some relevant attention heads

whose function is proper. Thus, we conceive an

attention mechanism that selects different attention

heads per token.

Based on the above discussion, we proposed

Mixture of Attention Heads (MoA) (Section 4), an

attention mechanism that selects different attention

heads for different inputs. A simple illustration of

this idea is shown in Figure 1. MoA includes a

set of of attention heads with different parameters.

Given an input, a routing network dynamically se-

lects a subset of

attention heads for each token.

The output is a weighted sum of the selected atten-

tion heads given the conﬁdence calculated by the

routing network.

We conducted experiments on two tasks: Ma-

chine Translation and Masked Language Modeling

(Section 5). Experiments shown promising results

against several strong baselines. In all tasks, our

proposed mixture of attention heads outperforms

the original Transformer architecture (Vaswani

et al.,2017). Our model surpasses many large

models or achieves comparable results with only

a half computational cost. Our contributions can

be summarized in three folds: 1) We proposed a

new attention mechanism called Mixture of At-

tention Heads, combining the idea of Mixture of

Experts with the attention mechanism. 2) MoA can

improve the model’s performance without substan-

tially adding parameters and computational cost. 3)

MoA is easy to scale up while maintaining with a

restrained computation complexity, resulting in a

further performance amelioration.

2 Related Work

Mixture of Experts

The Mixture of Experts

(MoE) was ﬁrstly introduced in the 1990s (Jacobs

et al.,1991;Jordan and Jacobs,1994). Shazeer

et al. (2017) adopted this method into modern

deep learning architectures (LSTM; Hochreiter

and Schmidhuber 1997) and proved its effective-

ness in Language Modeling and Machine Trans-

lation. The MoE was used to substitute the FFN

layers in Transformer architecture (Vaswani et al.,

2017) by the Mesh Tensorﬂow library (Shazeer

et al.,2018). Gshard (Lepikhin et al.,2021) is a

lightweight module that helps scale up multilin-

gual neural machine translation Transformer with

a Sparsely-Gated Mixture of Experts beyond 600

billion parameters. In Switch Transformer (Fedus

et al.,2021), the authors scaled the MoE-integrated

Transformer architecture toward trillion parame-

ter models. GLaM (Du et al.,2021) utilized a

decoder-only architecture to do language model

pre-training. Rajbhandari et al. (2022) proposed a

Pyramid-Residual-MoE for smaller model size and

fast inference.

Various routing strategies (Shazeer et al.,2017;

Dua et al.,2021;Lewis et al.,2021;Nie et al.,

2021) have been investigated for stabilizing the

MoE training and balancing the expert loads. Chi

et al. (2022) pointed out the representation collapse

issue in the sparse Mixture of Experts models and

solved by a two-stage routing strategy.

Machine Translation Architectures

With origi-

nal Transformer architecture (Vaswani et al.,2017),

Ott et al. (2018) found that training with reduced

precision and large batch could improve the trans-

lation performance. Some models get better perfor-

mance on translation by using larger scale of Trans-

former. Liu et al. (2020a) deepened the encoder

and decoder of the Transformer by adequately ini-

tializing the model. DeepNet (Wang et al.,2022)

scaled Transformers up to 1,000 layers by intro-

ducing a new normalization function. However,

these methods require a great amount of compu-

tational cost. Some models make changes to the

self-attention module. Peng et al. (2020) proposed

MAE model. The reallocation of attention heads

got better performance on Translation, since the

model prune useless multi-head attention heads.

However, their method is difﬁcult to scale up and

get further improvement of the results because it

needs to use all the attention heads in the model

rather than sparsely activate them. It also requires

the complicated block coordinate descent training

steps. Wu et al. (2019) proposed DynamicConv

and LightConv by replacing self-attention mecha-

nism with a lightweight convolution.

Specialization of Attention Heads

Since the

publication of Transformer architecture (Vaswani

et al.,2017), many researchers have been interested

in analyzing how the attention mechanism works.

Voita et al. (2019) systematically analyzed the at-

tention heads in the encoder and categorized them

into three functional subsets: positional, syntactic,

and rare words. When dealing with dependency

parsing, researchers also observed the same phe-

nomenon that different heads could capture dif-

ferent syntactic functions (Htut et al.,2019;Shen

et al.,2022).

3 Preliminaries

3.1 Mixture of Experts

MoE (Shazeer et al.,2017) contains a set of expert

networks

E1, E2, . . . , EN

and a routing network

. The output of the MoE is the weighted sum

of the output of each expert. The routing network

calculates the probability for each expert. Formally,

the output of the MoE can be written as:

i=1

G(x)iEi(x)(1)

The routing network

is a Noisy Top-k Routing

network. Before the softmax function, they add

Gaussian noise to the gated logits, see Equation 3.

Then, they keep only the top k values, setting the

rest gate values to equal 0, see Equation 2.

G(x) = Softmax(TopK(H(x), k)) (2)

H(x)i=(x·Wg)i+σ(0,1)·(3)

Softplus((x·Wnoise)i)

3.2 Multi-head Attention

Vaswani et al. (2017) proposed an encoder-decoder

architecture Transformer, which contains the multi-

head attention module. Different heads from the

multi-head attention module attend to informa-

tion from different representation subspaces, which

learn the input from various perspectives.

Performing multi-head attention with

heads,

the

Q, K, V

are linearly projected

times with

different, learned linear projections to subspaces.

On each projected

and

, the attention scores are

calculated, via Equation 4. Values deriving from

different heads are projected back to the model

dimension size and summed up, with Equation 5.

Watt

i=Softmax QW q

i(KW k

i)T

√dk(4)

i=1 Watt

iV W v

iWo

i(5)

where

i, W k

i, W v

i∈Rdm×dh

and

i∈

Rdh×dm,dkis the dimension of the key K.

4 Mixture of Attention Heads

In this work, we propose a variant of multi-head at-

tention for Transformer called Mixture of Attention

Heads (MoA), illustrated in Figure 2. MoA con-

sists of two major components, the routing network

and a group of

attention experts

{E1, ..., EN}

Similar to standard multi-head self-attention, the

input of MoA includes three sequences, query se-

quence

, key sequence

, and value sequence

. We note

as the query vector at time step

For each

, the routing network

selects a subset

experts

G(qt)⊆ {Ei}

based on

and assign

a weight

to each selected expert. Then, these

selected experts take

, and

as inputs and

compute an output

Ei(qt, K, V )

. The output of the

MoA is the weighted sum of the selected experts’

outputs. Formally, the MoA output at time step

can be written as:

yt=X

i∈G(qt)

wi,t ·Ei(qt, K, V )(6)

4.1 Routing Network

Similar to previous mixture-of-expert methods, the

routing network assigns attention experts to input

query. In order to select

experts for query

, we

compute a routing probability

for each expert

The routing probability is modeled with a linear

layer Wgand a softmax function:

pi,t = Softmaxi(qt·Wg)(7)

Based on the routing probability

, we select the

top-

attention experts among all

attention ex-

perts with the largest probabilities. Formally, the

routing network is deﬁned as:

G(Q) = TopK(pi,t, k)(8)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MixtureofAttentionHeads:SelectingAttentionHeadsPerTokenXiaofengZhang1;2,YikangShen3;4,ZeyuHuang1;2,JieZhou4,WengeRong1,ZhangXiong11StateKeyLaboratoryofSoftwareDevelopmentEnvironment,SchoolofComputerScienceandEngineering,BeihangUniversity,China2Sino-FrenchEngineerSchool,BeihangUniversity,China3Mila...

展开>> 收起<<

Mixture of Attention Heads Selecting Attention Heads Per Token Xiaofeng Zhang12 Yikang Shen34 Zeyu Huang12 Jie Zhou4 Wenge Rong1 Zhang Xiong1 1State Key Laboratory of Software Development Environment.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Mixture of Attention Heads Selecting Attention Heads Per Token Xiaofeng Zhang12 Yikang Shen34 Zeyu Huang12 Jie Zhou4 Wenge Rong1 Zhang Xiong1 1State Key Laboratory of Software Development Environment

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: