Tempo Accelerating Transformer-Based Model Training through Memory Footprint Reduction Muralidhar Andoorveedu1 Zhanda Zhu23 Bojian Zheng13 Gennady Pekhimenko13

2025-05-02 0 0 2.72MB 24 页 10玖币
侵权投诉
Tempo: Accelerating Transformer-Based Model
Training through Memory Footprint Reduction
Muralidhar Andoorveedu1, Zhanda Zhu2,3, Bojian Zheng1,3, Gennady Pekhimenko1,3
1University of Toronto, Toronto, Canada
2Shanghai Jiao Tong University, Shanghai, China
3Vector Institute, Toronto, Canada
{andoorve, zhanda, bojian, pekhimenko}@cs.toronto.edu
Abstract
Training deep learning models can be computationally expensive. Prior works have
shown that increasing the batch size can potentially lead to better overall throughput.
However, the batch size is frequently limited by the accelerator memory capacity
due to the activations/feature maps stored for the training backward pass, as larger
batch sizes require larger feature maps to be stored. Transformer-based models,
which have recently seen a surge in popularity due to their good performance
and applicability to a variety of tasks, have a similar problem. To remedy this
issue, we propose Tempo, a new approach to efficiently use accelerator (e.g., GPU)
memory resources for training Transformer-based models. Our approach provides
drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing
the memory usage and ultimately leading to more efficient training. We implement
Tempo and evaluate the throughput, memory usage, and accuracy/loss on the
BERT
LARGE
pre-training task. We demonstrate that Tempo enables up to 2
×
higher batch sizes and 16% higher training throughput over the state-of-the-art
baseline. We also evaluate Tempo on GPT2 and RoBERTa models, showing 19%
and 26% speedup over the baseline.
1 Introduction
Transformer-based models such as BERT [
12
] and GPT-2 [
49
] have found success in numerous
general natural language processing tasks including question answering [
51
], paraphrasing [
13
],
natural language inference [
68
], and even areas outside language tasks such as image recognition [
14
].
However, training such models can be highly expensive in terms of time, monetary resources and
carbon footprint [
24
,
60
]. For instance, the pre-training of BERT
LARGE
takes 4 days to complete on
16 Cloud TPUs (64 TPU chips total) [
12
], which costs about $10,000 [
56
]. Training a more recent
Transformer-based model, GPT-3, has an even more astonishing price tag - $12 million[66]. Hence,
even a small decrease in the end-to-end training time of Transformer-based models matters.
Although there has been significant progress made in accelerating Transformers using specialized
hardware (e.g., Google TPUs [
30
], NVIDIA Tensor Cores [
39
]) in the past few years, a fundamental
issue with Transformer-based models is that they are limited by the memory capacity of hardware
accelerators. For example, even a batch size of 1 does not fit into a modern GPU with 12GB of
memory when training BERT with sequence length 512 [
15
]. Reducing memory footprint [
48
,
8
,
52
]
is a viable option to allow larger batch training, leading to better hardware utilization and ultimately
improved training throughput [73].
Many existing approaches to memory footprint reduction (e.g., offloading [
52
,
65
,
48
], checkpoint-
ing [
8
,
73
,
33
,
28
], and data compression/encoding [
26
,
6
]) either have high computational overhead
or do not apply to Transformer-based models directly. Prior approaches fall into two main categories,
Preprint. Under review.
arXiv:2210.10246v2 [cs.LG] 24 Jan 2023
neither of which are satisfactory for the Transformer-based model case. First, these techniques may
be too general [
48
,
6
,
50
,
33
,
28
] to utilize the specifics of Transformer-based models well, such as the
multi-headed attention mechanism used in Transformers [
63
], or optimization opportunities available
in specific layers such as the LayerNorm [
4
] layer. For example, although checkpointing [
8
,
28
] can
significantly enlarge batch size, it also brings high overhead (e.g., 30% performance degradation
observed in some prior works [
8
]). Second, if prior works are specific, they focus on other types of
models/layers with ideas not being applicable to Transformers. For example, Gist and In-Place ABN
deal with CNNs [26, 53].
In our work, we demonstrate that low overhead memory footprint reduction can lead to a positive
improvement in throughput. In addition, unlike prior works which do not leverage the specifics of
Transformer-based models, we propose a new approach specifically tailored for Transformer-based
models, called Tempo. This approach includes three new techniques: (i) In-place GELU, (ii) In-place
LayerNorm, and (iii) Sub-Layer Dropout Recomputation. In-place GELU and In-place LayerNorm
both use alternative derivations for the computation of the backward passes of these layers. These
derivations allow some activations that are normally retained during the forward pass (to be later used
in the backward pass) to be discarded, leading to a more memory-efficient implementation. Sub-Layer
Dropout Recomputation discards activations within the high memory footprint attention mechanism
during the forward pass, then recomputes these during the backward pass without recomputing extra
unnecessary tensors. Tempo is able to increase the training throughput with larger batch sizes by
reducing the total memory footprint of the models during training. To our best knowledge, this is
the first work to explore memory footprint optimizations specifically for Transformer-based layers
that show not just footprint reduction, but the actual increase in throughput using the extra memory
savings. Tempo reduces the memory footprint of training Transformer-based models by targeting
a major part of the total footprint – the activation memory [
74
] (the saved feature maps during the
forward pass of the model that are required for backpropagation [
54
]). All the proposed techniques
provide a large memory footprint reduction with very low throughput degradation (as low as 1%).
Our results show up to 2
×
improvement in batch size for BERT
LARGE
pre-training at a sequence
length of 512 on modern GPUs while increasing the training throughput by up to 16% .
2 Background and Motivation
2.1 Memory Footprint of BERT
BERT [
12
] is a popular natural language processing model that is based on the Transformer ar-
chitecture [
63
]. The model has been successfully applied to a variety of tasks such as question
answering (SQuAD [
51
]), paraphrasing (MRPC [
13
]), natural language inference (MNLI [
68
]), and
others [
57
,
72
] through a two step training process. The training process entails first training on
a general unlabelled data set (pre-training) [
12
]. The faster second part of the training process
(fine-tuning) takes the parameter weights produced by the pre-training section and further trains on a
downstream task such as question answering [
51
] or sentiment analysis [
57
] which it accomplishes
through the addition of a specialized output layer [12].
The BERT architecture allows for multiple different configurations depending on model hyperparam-
eters selected, some being derived from the original Transformers paper; these include the hidden
layer size (H), sequence length (S), number of attention heads (A) and number of layers (L).
In the context of this work, we point out some of the relevant parts of the model and their activation
memory footprint with respect to these hyperparameters referring to Figure 1.
1
At this point, where attention [
63
] is calculated we observe that the size of each of the feature
maps goes as
O(S2)
there are a variety of previous techniques and models that have been explored
in the literature to deal with this problem [
61
]. Additionally, at this point note that we store three
feature maps of size
[B×A×S2]
. Calculations based on Figure 1 at the BERT
BASE
parameters
show that at a sequence length of 512 these three feature maps
account for 56% of the encoder
layer activation memory.
2
At these two points, we store the input to the two LayerNorm layers of size [B×S×H]
3
Here a GELU [
21
] layer is used as the activation function for the preceding fully-connected layer
of size
[B×S×4H]
. The activation memory for this function stores
almost 17% of the total layer
activation memory of BERTBASE at a sequence length of 128.
2
Figure 1: A diagram of a single Transformer encoder [
63
] layer used in BERT [
12
]. This is based on
the Huggingface implementation of BERT [
69
]. As in the BERT paper,
A
represents the number of
attention heads, and
H
represents the hidden size. We represent the batch size by
B
and the sequence
length by
S
. Sizes of intermediate tensors (both retained activations and unretained intermediates)
are annotated.
2.2 Why Activation Memory Matters
As iterated in previous works [
52
,
26
,
48
,
73
,
6
] there are multiple benefits to reducing the memory
footprint of models. First, it allows for larger models which can positively affect the model’s
performance on downstream tasks [
12
]. Second, memory footprint reduction can allow for a larger
batch size. This, in turn, could lead to better utilization of the GPU hardware [
17
], increasing the
overall throughput [
73
]. In order to verify this possibility for Transformer-based models, we conduct
our own experiments using Huggingface’s BERT implementation [
69
] to train BERT
LARGE
on the
MRPC [
13
] fine-tuning task. Figure 2 shows the throughput on this task for sequence lengths of
128 and 512. From the figure, we conclude that there is a steady improvement in batch size when
the sequence length is 128. This is also the case when the sequence length is 512, however, in this
situation the trend ends more abruptly as the memory consumption of the model exceeds the GPU
memory capacity, showing a clear opportunity to take advantage of memory footprint reduction.
0 2 4 6 8 10 12 14 16
10
20
30
Batch Size
Throughput (Sequences/s)
S= 128
S= 512
Figure 2: Plots of throughput (sequences/s) vs batch size for BERT
LARGE
[
12
] fine-tuning on the
MRPC [
13
] task at sequence lengths 128 and 512 on four 2080Ti [
40
] GPUs. The maximum batch
sizes are respectively 16 and 2.
We note that previous works on Transformer-based models show that although the model parameters
contribute to the memory footprint, the main memory capacity consumer during training is actually
the activation feature maps [
74
,
28
,
8
,
48
,
26
,
33
,
6
]. In addition, the majority of this activation
memory will be used in each of the BERT Transformer encoder layers. Profiling the Huggingface
BERT
BASE
implementation [
69
] on the MRPC [
13
] fine-tuning task at a batch size of 32 and
3
sequence length of 128 shows that 66% of the total memory is taken up by these encoder activations.
More details on this are shown in Appendix A.
2.3 Key Prior Works
There are three major prior techniques used in training memory footprint reduction of deep learning
models. The first of these is Checkpointing [
8
,
33
,
28
,
73
]. This technique involves discarding certain
feature maps in the forward pass while retaining others. Later, in the backward pass, these discarded
feature maps may be recomputed from the retained feature maps, and thus used in the computation of
the gradients. The second technique is Offloading [
48
,
52
,
65
]. In this case, the main idea involves
taking feature maps that would be stored in the GPU memory, and instead offloading them to the
CPU memory. These techniques can also involve pre-fetching tensors from the CPU memory in
anticipation of their use. Offloading suffers from a dependence on system variables such as the
communication channel bandwidth [
52
,
48
]. It also requires extensive engineering effort to avoid
high overhead [6]. Finally, Compression/encoding; this can be divided into two different categories,
lossless and lossy [
26
,
6
]. However, the fundamental idea is to compress, or reduce the space taken
up by feature maps in the forward pass, then decompress it for use in the backward pass.
These techniques are usually largely orthogonal to one another as was shown in prior works where
both offloading and checkpointing are used simultaneously [
48
,
65
]. We expand on these techniques
in Appendix C.
2.4 Why Tempo?
Although the techniques in the previous section show good performance on a variety of models,
they suffer from a variety of issues. Checkpointing’s scope is often too broad to consider certain
layer-specific optimizations and alternative derivations that can provide lower overhead [
53
]. Fur-
thermore, overhead can be high (as much as 30%) [
8
]. Offloading can be system- dependent and
requires significant engineering effort, while compression can be lossy or not applicable to the
Transformer case. Hence, there is a clear need for a deeper look at activation memory optimizations
for Transformer-based neural networks in particular. To our best knowledge, our work is the first to
explore such optimizations tuned to improving the throughput of Transformer-based models. Table 1
shows a summary comparison of Tempo and various other techniques, with the major points that
differentiates our technique from prior work.
Feature
Capuchin
Checkmate
ActNN
Gist
Tempo
Layer-Specific 7 7 7 3 3
Transformer-Specific 7 7 7 7 3
Lossless 3 3 7 12
Drop-In Layer Replacement 7 7 3 3 3
Online 3 7 3 3 3
Table 1: Comparison between Tempo and Capuchin [
48
], Checkmate [
28
], ActNN [
6
], and Gist [
26
].
3 Tempo: Key Ideas
We now present the major ideas that lays behind the design of Tempo: (1)
In-place GELU
, (2)
In-place LayerNorm
, and (3)
Sub-Layer Dropout Recomputation
. The major theme behind all of
these ideas is to compute the backward pass as normal, while using less storage to do so. To this end,
In-place GELU and In-place LayerNorm compute the output of each layer in-place; instead using
the output activation to compute the gradient. Sub-Layer Dropout Recomputation also discards the
output, and through a closer look at the structure of the Dropout layer is able to recompute the output
without excessive recomputation. We strongly suggest reading Appendix E for the implementation
details. We also add in this appendix a new optimization of softmax that we use that further reduces
memory [18].
1Some of the Gist [26] optimizations are lossy.
2
Accuracy of our lossy optimization is tunable, offering a flexible tradeoff between the accuracy and the
hardware cost.
4
3.1 In-place GELU
The GELU layer is used as an activation function for the feed-forward section of the BERT layer (
3
in Figure 1) [
12
]. A plot of this function is shown in Figure 3a. Referring to the baseline in Figure 3b,
note that both
X
and
Y
are stored for the backward pass.
Y
is needed for the downstream fully
connected layer, while
X
is stored for the GELU layer itself [
46
]. Prior work has demonstrated that
certain activation functions such as ReLU may be computed in-place [
26
]. This can be done without
affecting the calculation of the backward pass. If we were able to compute the GELU function
in-place, potentially by recovering the input from the output on the backward pass, we could save the
storage required for
X
. However, this is impossible to do directly. A key observation to make with
respect to the GELU function is that it is not bijective – hence there is no function that will be able to
compute the input from the output without additional information.
However, we observe that the GELU function is both continuous and has only one extremum, a
minimum value at
x≈ −0.75179
as can be seen in Figure 3a. Notably, this implies that just one
additional piece of information: which
side of the minimum
the input originates from, allows us
to compute the inverse of the GELU. This is because on each side of the minimum the function
is one-to-one, and hence the input is recoverable from the output in each section. Based on this
key observation, we can discard the input, and simply retain the output of the GELU, as well as
the additional information on whether the input is greater than or equal to the value at which the
minimum occurs. Figure 3b illustrates the difference between our method and the baseline.
In order to execute this efficiently on a real system, we note that the original derivative in terms of
the input can be composed with the function inverse in order to create a composite kernel. This
kernel consists of a polynomial approximation of this composite function, the approximation being
necessary since GELU is transcendental, and therefore the inverse cannot be solved in terms of
elementary functions [58]. Further details are discussed in Appendix E.
Figure 3(a): A plot of the GELU[
21
] function near
the origin, along with the marked minimum point.
Figure 3(b): Saved feature maps between the base-
line and Tempo. Note that our method only saves
a 8-bit mask
3
that denotes whether the input is
greater or less than the minimum value, instead of
the full 32-bit input feature map.
3.2 In-place LayerNorm
The LayerNorm layer is used at multiple points in the Transformer encoder layer [
63
], which we
denote by
2
in Figure 1. Usually, the gradient computation of LayerNorm relies on the gradient
input from the next layer, as well as the input feature map which is stashed for this computation [
46
].
Similar to GELU, we are able to derive an expression for the gradient of the LayerNorm layer as
a function of its output. In this context, the output of LayerNorm must be stashed to compute the
gradient of the successive fully connected layer anyways. Using this approach, the memory footprint
overhead of LayerNorm is just the intermediate mean and variance computed in the forward pass.
The full derivation is presented in Appendix E which is extended from the treatment of BatchNorm
in [53].
Comparison with Checkpointing
: Note that although In-place GELU requires more memory
compared to recomputing
Y
from
X
, it will have increased overhead due to the recomputation.
3
Pytorch boolean masks use 8-bits per value [
46
]. Masks can also be implemented as 1-bit manually but this
brings extra overhead due to unpacking and packing bit tensors.
5
摘要:

Tempo:AcceleratingTransformer-BasedModelTrainingthroughMemoryFootprintReductionMuralidharAndoorveedu1,ZhandaZhu2,3,BojianZheng1,3,GennadyPekhimenko1,31UniversityofToronto,Toronto,Canada2ShanghaiJiaoTongUniversity,Shanghai,China3VectorInstitute,Toronto,Canada{andoorve,zhanda,bojian,pekhimenko}@cs.tor...

展开>> 收起<<
Tempo Accelerating Transformer-Based Model Training through Memory Footprint Reduction Muralidhar Andoorveedu1 Zhanda Zhu23 Bojian Zheng13 Gennady Pekhimenko13.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:2.72MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注