and wider model. First, it is more latency friendly on commodity hardware. Second, wider models are smaller in
terms of the number of parameters. And, third, outputs of wider models are more interpretable.
To summarise, we make the following contributions in this paper:
• We demonstrate that wider and shallower models can typically equal or sometimes beat the accuracy of deeper
models when there is no pretraining of weights or embeddings. Across all 4 tasks, average accuracy for the
vanilla Transformer increases by 0.4% between normal deep models and our single layer wide models.
• We show that our results are consistent across a variety of different attention mechanisms and input sequence
lengths, and thus there is a general design equivalence in increasing the depth of a Transformer model vs
increasing the width. Averaged across all non-vanilla attention types and tasks, accuracy increases by 0.3%
from deepest to widest.
• We show that widening the models by fixing the attention computation size results in less overall parameters
and faster inference. We show that wider models are on average 1.4×smaller and have 3.1×faster inference
latency on a CPU and 1.9×on a GPU, compared to deep models.
• We demonstrate how single layer networks can have more interpretable predictions by inspecting the attention
weights of each head in a single layer.
2 Related Work
2.1 Wider Networks
Zagoruyko & Komodakis [2016] and Wu et al. [2019] respectively show and investigate how widening and making
shallower ResNet CNNs [He et al., 2016] can improve their performance. Transformers also use residual connections,
so this helps motivate our investigation.
Xue et al. [2022b] find that wider layers in Transformers (in both attention and sequence features) can improve per-
formance on vision and natural language tasks. However, they use a mixture-of-experts layer instead of the typical
feed-forward network at the output of the Transformer encoder. They do this because a much larger dimension of
sequence features means the number of parameters in the feed-forward layer would increase greatly. As we are only
increasing the attention feature dimension we do not face this issue.
Xue et al. [2022a] investigate masked autoencoder training and how it can help reduce the problem of over smoothing
in training deep Transformer networks - embeddings of tokens converging to be similar at deeper layers. They explore
the optimal configuration of the Transformer when using masked autoencoding to pretrain it, and then fine-tune it on
specific tasks. They find that for vision Transformers, it is better to pretrain with a deep autoencoder rather than a wide
one. As we are not using pretrainining and pretrained embedings, their results are not directly comparable to our own.
2.2 Architecture optimization for Transformers
A lot of research in the Transformer field focused on finding a more efficient attention mechanism [Beltagy et al., 2020,
Child et al., 2019, Choromanski et al., 2020, Katharopoulos et al., 2020, Kitaev et al., 2020, Liu et al., 2018, Tay et al.,
2020a, 2021, Wang et al., 2020b, Zaheer et al., 2020], as the original dot-product attention has quadratic complexities
with respect to input sequence length. Long Range Arena (LRA, Tay et al. [2020b]) compares and contrasts these
methods, noting that in absence of pretrained embeddings and model parameters, the best attention mechanism is task
dependent. Thus there is no clear best attention mechanism. Because of this we test different attention mechanisms on
a variety of tasks with no pretrained embeddings or model parameters. This makes clear that our findings are largely
independent of attention mechanism and task, and contextualise going wide as a general design decision.
The application of Neural Architecture Search (NAS) to Transformer models has also been explored. Both Neural
Architecture Transformer (NAT) Guo et al. [2019] and Hardware-aware Transformers Wang et al. [2020a] use NAS to
search for models that are more hardware efficient. So et al. [2019] use evolutionary search to optimize components
and connections inside an encoder or decoder layer. Liu et al. [2022] apply RankNAS [Hu et al., 2021] to cosFormer
[Qin et al., 2022] and standard Transformers [Vaswani et al., 2017]. They perform a search of the hyperparameters on
a cosFormer network and compare these to the same method applied to the standard Transformer. Tsai et al. [2020]
searches the hyperparameter space of a BERT [Devlin et al., 2018] model architecture heterogeneously to find an
optimal efficient network. These previous experiments have not varied or searched over different numbers of layers in
the Transformer encoder, which is a key factor in what we investigate.
3