1 Introduction
Recent years have witnessed a big convergence of model architectures across language, vision, speech,
and multimodal. Specifically, starting from the natural language processing, Transformers (Vaswani
et al.,2017) have become the de facto standard for various areas, including computer vision (Doso-
vitskiy et al.,2021), speech (Zhang et al.,2020b), and multimodal (Kim et al.,2021;Wang et al.,
2022b). Transformers fully leverage the parallelism advantage of GPU hardware and large-scale
data. It is appealing that we can use the same network architecture for a broad range of applications.
So the pretrained models can be seamlessly reused with the shared implementation and hardware
optimization. Moreover, general-purpose modeling is important to multimodal models, as different
modalities can be jointly encoded and fused by one model.
However, despite using the same name “Transformers”, there are significant differences in the
implementation of the architectures for different tasks. Figure 1summarizes the architectures for
state-of-the-art models that are widely used in various communities. For instance, some models (e.g.,
GPT, and ViT) adopt Pre-LayerNorm (Pre-LN) Transformers, while others use Post-LayerNorm
(Post-LN) variants (e.g., BERT, and machine translation) for better performance. Rather than directly
using the same architecture, we need to compare two Transformer variants on the specific tasks or
modalities to determine the backbone, which is ineffective for model development. More importantly,
considering multimodal models, the optimal Transformer variants are usually different for input
modalities. For the example of BEiT-3 (Wang et al.,2022b) vision-language pretraining, using
Post-LN is sub-optimal for vision encoding while Pre-LN is sub-optimal for the language part. The
true convergence of multimodal pretraining requires a unified architecture that performs well across
tasks and modalities. In addition, a pain point of Transformer architectures is training stability,
especially for large-scale models. We usually need significant efforts to tune hyperparameters or
babysit training processes.
As a result, we call for developing
Foundation Transformers
for true general-purpose modeling.
First, the desired modeling should be able to serve as a go-to architecture for various tasks and
modalities, so that we can use the same backbone without trial and error. The general-purpose
design principle also greatly supports the development of multimodal foundation models, as we can
use one unified Transformer for various modalities without performance degradation. Second, the
architectures should provide guaranteed training stability. The favored property can significantly
mitigate the difficulty of large-scale pretraining of foundation models.
In this work, we introduce MAGNETO as an implementation of Foundation Transformers to fulfill the
above goals. Specifically, we introduce Sub-LayerNorm (Sub-LN), which adds an extra LayerNorm
to each sublayer (i.e., multi-head self-attention, and feed-forward network). Moreover, MAGNETO has
a novel initialization method that has a theoretical guarantee to fundamentally improve the training
stability. This allows the models to be scaled up without pain. We evaluate MAGNETO on extensive
tasks and modalities, namely, masked language modeling (i.e., BERT), causal language modeling
(i.e., GPT), machine translation, masked image modeling (i.e., BEiT), speech recognition, and
vision-language pretraining (i.e., BEiT-3). Experimental results show that MAGNETO significantly
outperforms de facto Transformer variants on the downstream tasks. In addition, MAGNETO is more
stable in terms of optimization, which allows larger learning rates to improve results without training
divergence.
2 TL;DR for Practitioners
Figure 1illustrates the overview of the MAGNETO architecture. There are two key improvements in
terms of modeling. First, compared to the Pre-LN variant, Sub-LN introduces another LayerNorm
inside each sublayer (i.e., multi-head self-attention, and feed-forward network): one before the input
projection, and the other before the output projection. Second, we use the initialization with the
theoretical derivation from DeepNet (Wang et al.,2022a), which fundamentally improves the training
stability, allowing the model to be scaled up to massive sizes without pain.
As shown in Figure 2, we present the implementation of MAGNETO. There are only lines of code
changes on top of the vanilla Transformer architecture. Notably, following the derivation from
DeepNet, the weights of query projection and key projection are not scaled during initialization.
2