2.2 NAS on Transformers
Recent work of Liu et al. [2022] applies RankNAS [Hu et al., 2021] to cosFormer [Qin et al., 2022] and standard
Transformers [Vaswani et al., 2017]. They perform a search of the hyperparameters on a cosFormer network and
compare these to the same method applied to the standard Transformer. Tsai et al. [2020] searches the hyperparameter
space of a BERT [Devlin et al., 2018] model architecture heterogeneously to find an optimal efficient network. These
methods search hyperparameters, whereas we fix hyperparameters and search over the space of attention mechanisms.
3 Method
3.1 DARTS Style NAS
We train a supernetwork that contains attention heads of each attention type. This is analogous to the edges in a DARTS
cell containing all possible operations for that edge. We also use ‘fixed
α
’ with masked validation accuracy as our
metric for the strength of each edge as detailed in Wang et al. [2021]. Standard DARTS assigns a weight for each
edge in the computation graph and softmaxes this on the output of each cell (see Figure 2, left, in Appendix C), since
we are using a ’fixed
α
’ approach we do not need to do this and simply average our edges without using learnable
weights. This allows us to train the supernetwork until convergence and then select the best attention (or prune out the
worst). Note that this removes the need to carry out the bi-level optimisation that is required in the original DARTS
paradigm that uses validation performance to train the edge weights. Our approach is illustrated in Figure 1. The
DARTS supernetwork cell and the standard Transformer architecture are given in Appendix C in Figure 2, to show how
our architecture relates to them.
3.2 Architecture and Search Space
Our Transformer encoder supernetwork consists of an embedding layer, an attention block supernetwork, a feed-forward
network (FFN), and a linear classifier layer. In normal Transformer attention, each attention block consists of: QKV
linear projections from features to number of heads
×
head dimension, scaled dot product attention on each head,
concatenation of heads, and a final dense projection back to the feature dimension [Vaswani et al., 2017]. In theory,
we want to search over the space of alternatives for the scaled dot product attention operation independently for each
head. This would replace scaled dot product attention in a normal Transformer with an average of candidate attention
mechanisms in a DARTS-like paradigm. The averages from each head would then be concatenated and linearly
projected back to the feature dimension as normal.
However, some attention mechanisms, such as Reformer [Kitaev et al., 2020], modify the linear projections or
concatenation operations. Because of this we instead search over candidate multi-head attention blocks where each
one implements the projections, attention, and concatenation operations. This is illustrated in Figure 4 in Appendix C
. When the candidate blocks are single head and we have
H
blocks per candidate attention mechanism, after the
architecture search is complete, the computation graph is equivalent to if we searched over the attention mechanisms
themselves with
H
heads. With the search at the block level with single heads, each attention mechanism can learn its
own linear projections, whereas in a search over just the attention mechanisms these would be shared within each head.
The disadvantage of this is increased memory and computation during the search, since now heads do not share the
linear projection.
3.3 Experiments
3.3.1 Finding Optimal Homogeneous Attention
For the first experiment we train a single layer network with a block for each attention mechanism. These blocks
are each initialized with the desired number of final heads for that task. After training we perform a single masked
validation accuracy trial and pick the highest scoring mechanism as a good candidate mechanism for that task when
used in a full Transformer model homogeneously. This paradigm is summarized in an algorithmic form in Algorithm 1
in Appendix D.
3.3.2 Finding Optimal Heterogeneous Attention
In this paradigm we try to learn an optimal single layer with a fixed number of heads and stack it, analogous to searching
for an optimal cell in DARTS.
3