neither of which are satisfactory for the Transformer-based model case. First, these techniques may
be too general [
48
,
6
,
50
,
33
,
28
] to utilize the specifics of Transformer-based models well, such as the
multi-headed attention mechanism used in Transformers [
63
], or optimization opportunities available
in specific layers such as the LayerNorm [
4
] layer. For example, although checkpointing [
8
,
28
] can
significantly enlarge batch size, it also brings high overhead (e.g., 30% performance degradation
observed in some prior works [
8
]). Second, if prior works are specific, they focus on other types of
models/layers with ideas not being applicable to Transformers. For example, Gist and In-Place ABN
deal with CNNs [26, 53].
In our work, we demonstrate that low overhead memory footprint reduction can lead to a positive
improvement in throughput. In addition, unlike prior works which do not leverage the specifics of
Transformer-based models, we propose a new approach specifically tailored for Transformer-based
models, called Tempo. This approach includes three new techniques: (i) In-place GELU, (ii) In-place
LayerNorm, and (iii) Sub-Layer Dropout Recomputation. In-place GELU and In-place LayerNorm
both use alternative derivations for the computation of the backward passes of these layers. These
derivations allow some activations that are normally retained during the forward pass (to be later used
in the backward pass) to be discarded, leading to a more memory-efficient implementation. Sub-Layer
Dropout Recomputation discards activations within the high memory footprint attention mechanism
during the forward pass, then recomputes these during the backward pass without recomputing extra
unnecessary tensors. Tempo is able to increase the training throughput with larger batch sizes by
reducing the total memory footprint of the models during training. To our best knowledge, this is
the first work to explore memory footprint optimizations specifically for Transformer-based layers
that show not just footprint reduction, but the actual increase in throughput using the extra memory
savings. Tempo reduces the memory footprint of training Transformer-based models by targeting
a major part of the total footprint – the activation memory [
74
] (the saved feature maps during the
forward pass of the model that are required for backpropagation [
54
]). All the proposed techniques
provide a large memory footprint reduction with very low throughput degradation (as low as 1%).
Our results show up to 2
×
improvement in batch size for BERT
LARGE
pre-training at a sequence
length of 512 on modern GPUs while increasing the training throughput by up to 16% .
2 Background and Motivation
2.1 Memory Footprint of BERT
BERT [
12
] is a popular natural language processing model that is based on the Transformer ar-
chitecture [
63
]. The model has been successfully applied to a variety of tasks such as question
answering (SQuAD [
51
]), paraphrasing (MRPC [
13
]), natural language inference (MNLI [
68
]), and
others [
57
,
72
] through a two step training process. The training process entails first training on
a general unlabelled data set (pre-training) [
12
]. The faster second part of the training process
(fine-tuning) takes the parameter weights produced by the pre-training section and further trains on a
downstream task such as question answering [
51
] or sentiment analysis [
57
] which it accomplishes
through the addition of a specialized output layer [12].
The BERT architecture allows for multiple different configurations depending on model hyperparam-
eters selected, some being derived from the original Transformers paper; these include the hidden
layer size (H), sequence length (S), number of attention heads (A) and number of layers (L).
In the context of this work, we point out some of the relevant parts of the model and their activation
memory footprint with respect to these hyperparameters referring to Figure 1.
1
At this point, where attention [
63
] is calculated we observe that the size of each of the feature
maps goes as
O(S2)−
there are a variety of previous techniques and models that have been explored
in the literature to deal with this problem [
61
]. Additionally, at this point note that we store three
feature maps of size
[B×A×S2]
. Calculations based on Figure 1 at the BERT
BASE
parameters
show that at a sequence length of 512 these three feature maps
account for 56% of the encoder
layer activation memory.
2
At these two points, we store the input to the two LayerNorm layers of size [B×S×H]
3
Here a GELU [
21
] layer is used as the activation function for the preceding fully-connected layer
of size
[B×S×4H]
. The activation memory for this function stores
almost 17% of the total layer
activation memory of BERTBASE at a sequence length of 128.
2