tion (MHA) mechanism. Since the MHA is another
compulsory module in the Transformer architec-
ture, combining MoE and the attention mechanism
could also help achieve better performance while
restraining the computational cost.
Besides, previous research has investigated the
utility of different attention heads. Peng et al.
(2020) found that the combination (reallocation)
of a subset of attention heads helps the Translation
task since they prune the useless attention heads. In
the field of dependency parsing, researchers have
unveiled that some attention heads in BERT-like
language models (Devlin et al.,2019;Liu et al.,
2019) model individual dependency types (Htut
et al.,2019) and syntactic functions (Shen et al.,
2022). Voita et al. (2019) claimed that the attention
heads have different functions that could be cate-
gorized into three types. There is no need to pass
through all multiple attention heads for an input to-
ken if we could select some relevant attention heads
whose function is proper. Thus, we conceive an
attention mechanism that selects different attention
heads per token.
Based on the above discussion, we proposed
Mixture of Attention Heads (MoA) (Section 4), an
attention mechanism that selects different attention
heads for different inputs. A simple illustration of
this idea is shown in Figure 1. MoA includes a
set of of attention heads with different parameters.
Given an input, a routing network dynamically se-
lects a subset of
k
attention heads for each token.
The output is a weighted sum of the selected atten-
tion heads given the confidence calculated by the
routing network.
We conducted experiments on two tasks: Ma-
chine Translation and Masked Language Modeling
(Section 5). Experiments shown promising results
against several strong baselines. In all tasks, our
proposed mixture of attention heads outperforms
the original Transformer architecture (Vaswani
et al.,2017). Our model surpasses many large
models or achieves comparable results with only
a half computational cost. Our contributions can
be summarized in three folds: 1) We proposed a
new attention mechanism called Mixture of At-
tention Heads, combining the idea of Mixture of
Experts with the attention mechanism. 2) MoA can
improve the model’s performance without substan-
tially adding parameters and computational cost. 3)
MoA is easy to scale up while maintaining with a
restrained computation complexity, resulting in a
further performance amelioration.
2 Related Work
Mixture of Experts
The Mixture of Experts
(MoE) was firstly introduced in the 1990s (Jacobs
et al.,1991;Jordan and Jacobs,1994). Shazeer
et al. (2017) adopted this method into modern
deep learning architectures (LSTM; Hochreiter
and Schmidhuber 1997) and proved its effective-
ness in Language Modeling and Machine Trans-
lation. The MoE was used to substitute the FFN
layers in Transformer architecture (Vaswani et al.,
2017) by the Mesh Tensorflow library (Shazeer
et al.,2018). Gshard (Lepikhin et al.,2021) is a
lightweight module that helps scale up multilin-
gual neural machine translation Transformer with
a Sparsely-Gated Mixture of Experts beyond 600
billion parameters. In Switch Transformer (Fedus
et al.,2021), the authors scaled the MoE-integrated
Transformer architecture toward trillion parame-
ter models. GLaM (Du et al.,2021) utilized a
decoder-only architecture to do language model
pre-training. Rajbhandari et al. (2022) proposed a
Pyramid-Residual-MoE for smaller model size and
fast inference.
Various routing strategies (Shazeer et al.,2017;
Dua et al.,2021;Lewis et al.,2021;Nie et al.,
2021) have been investigated for stabilizing the
MoE training and balancing the expert loads. Chi
et al. (2022) pointed out the representation collapse
issue in the sparse Mixture of Experts models and
solved by a two-stage routing strategy.
Machine Translation Architectures
With origi-
nal Transformer architecture (Vaswani et al.,2017),
Ott et al. (2018) found that training with reduced
precision and large batch could improve the trans-
lation performance. Some models get better perfor-
mance on translation by using larger scale of Trans-
former. Liu et al. (2020a) deepened the encoder
and decoder of the Transformer by adequately ini-
tializing the model. DeepNet (Wang et al.,2022)
scaled Transformers up to 1,000 layers by intro-
ducing a new normalization function. However,
these methods require a great amount of compu-
tational cost. Some models make changes to the
self-attention module. Peng et al. (2020) proposed
MAE model. The reallocation of attention heads
got better performance on Translation, since the
model prune useless multi-head attention heads.
However, their method is difficult to scale up and
get further improvement of the results because it