Tempo Accelerating Transformer-Based Model Training through Memory Footprint Reduction Muralidhar Andoorveedu1 Zhanda Zhu23 Bojian Zheng13 Gennady Pekhimenko13

2025-05-02 0 0 2.72MB 24 页 10玖币

侵权投诉

Tempo: Accelerating Transformer-Based Model

Training through Memory Footprint Reduction

Muralidhar Andoorveedu1, Zhanda Zhu2,3, Bojian Zheng1,3, Gennady Pekhimenko1,3

1University of Toronto, Toronto, Canada

2Shanghai Jiao Tong University, Shanghai, China

3Vector Institute, Toronto, Canada

{andoorve, zhanda, bojian, pekhimenko}@cs.toronto.edu

Abstract

Training deep learning models can be computationally expensive. Prior works have

shown that increasing the batch size can potentially lead to better overall throughput.

However, the batch size is frequently limited by the accelerator memory capacity

due to the activations/feature maps stored for the training backward pass, as larger

batch sizes require larger feature maps to be stored. Transformer-based models,

which have recently seen a surge in popularity due to their good performance

and applicability to a variety of tasks, have a similar problem. To remedy this

issue, we propose Tempo, a new approach to efﬁciently use accelerator (e.g., GPU)

memory resources for training Transformer-based models. Our approach provides

drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing

the memory usage and ultimately leading to more efﬁcient training. We implement

Tempo and evaluate the throughput, memory usage, and accuracy/loss on the

BERT

LARGE

pre-training task. We demonstrate that Tempo enables up to 2

higher batch sizes and 16% higher training throughput over the state-of-the-art

baseline. We also evaluate Tempo on GPT2 and RoBERTa models, showing 19%

and 26% speedup over the baseline.

1 Introduction

Transformer-based models such as BERT [

] and GPT-2 [

] have found success in numerous

general natural language processing tasks including question answering [

], paraphrasing [

natural language inference [

], and even areas outside language tasks such as image recognition [

However, training such models can be highly expensive in terms of time, monetary resources and

carbon footprint [

]. For instance, the pre-training of BERT

LARGE

takes 4 days to complete on

16 Cloud TPUs (64 TPU chips total) [

], which costs about $10,000 [

]. Training a more recent

Transformer-based model, GPT-3, has an even more astonishing price tag - $12 million[66]. Hence,

even a small decrease in the end-to-end training time of Transformer-based models matters.

Although there has been signiﬁcant progress made in accelerating Transformers using specialized

hardware (e.g., Google TPUs [

], NVIDIA Tensor Cores [

]) in the past few years, a fundamental

issue with Transformer-based models is that they are limited by the memory capacity of hardware

accelerators. For example, even a batch size of 1 does not ﬁt into a modern GPU with 12GB of

memory when training BERT with sequence length 512 [

]. Reducing memory footprint [

]

is a viable option to allow larger batch training, leading to better hardware utilization and ultimately

improved training throughput [73].

Many existing approaches to memory footprint reduction (e.g., ofﬂoading [

], checkpoint-

ing [

], and data compression/encoding [

]) either have high computational overhead

or do not apply to Transformer-based models directly. Prior approaches fall into two main categories,

Preprint. Under review.

arXiv:2210.10246v2 [cs.LG] 24 Jan 2023

neither of which are satisfactory for the Transformer-based model case. First, these techniques may

be too general [

] to utilize the speciﬁcs of Transformer-based models well, such as the

multi-headed attention mechanism used in Transformers [

], or optimization opportunities available

in speciﬁc layers such as the LayerNorm [

] layer. For example, although checkpointing [

] can

signiﬁcantly enlarge batch size, it also brings high overhead (e.g., 30% performance degradation

observed in some prior works [

]). Second, if prior works are speciﬁc, they focus on other types of

models/layers with ideas not being applicable to Transformers. For example, Gist and In-Place ABN

deal with CNNs [26, 53].

In our work, we demonstrate that low overhead memory footprint reduction can lead to a positive

improvement in throughput. In addition, unlike prior works which do not leverage the speciﬁcs of

Transformer-based models, we propose a new approach speciﬁcally tailored for Transformer-based

models, called Tempo. This approach includes three new techniques: (i) In-place GELU, (ii) In-place

LayerNorm, and (iii) Sub-Layer Dropout Recomputation. In-place GELU and In-place LayerNorm

both use alternative derivations for the computation of the backward passes of these layers. These

derivations allow some activations that are normally retained during the forward pass (to be later used

in the backward pass) to be discarded, leading to a more memory-efﬁcient implementation. Sub-Layer

Dropout Recomputation discards activations within the high memory footprint attention mechanism

during the forward pass, then recomputes these during the backward pass without recomputing extra

unnecessary tensors. Tempo is able to increase the training throughput with larger batch sizes by

reducing the total memory footprint of the models during training. To our best knowledge, this is

the ﬁrst work to explore memory footprint optimizations speciﬁcally for Transformer-based layers

that show not just footprint reduction, but the actual increase in throughput using the extra memory

savings. Tempo reduces the memory footprint of training Transformer-based models by targeting

a major part of the total footprint – the activation memory [

] (the saved feature maps during the

forward pass of the model that are required for backpropagation [

]). All the proposed techniques

provide a large memory footprint reduction with very low throughput degradation (as low as 1%).

Our results show up to 2

improvement in batch size for BERT

LARGE

pre-training at a sequence

length of 512 on modern GPUs while increasing the training throughput by up to 16% .

2 Background and Motivation

2.1 Memory Footprint of BERT

BERT [

] is a popular natural language processing model that is based on the Transformer ar-

chitecture [

]. The model has been successfully applied to a variety of tasks such as question

answering (SQuAD [

]), paraphrasing (MRPC [

]), natural language inference (MNLI [

]), and

others [

] through a two step training process. The training process entails ﬁrst training on

a general unlabelled data set (pre-training) [

]. The faster second part of the training process

(ﬁne-tuning) takes the parameter weights produced by the pre-training section and further trains on a

downstream task such as question answering [

] or sentiment analysis [

] which it accomplishes

through the addition of a specialized output layer [12].

The BERT architecture allows for multiple different conﬁgurations depending on model hyperparam-

eters selected, some being derived from the original Transformers paper; these include the hidden

layer size (H), sequence length (S), number of attention heads (A) and number of layers (L).

In the context of this work, we point out some of the relevant parts of the model and their activation

memory footprint with respect to these hyperparameters referring to Figure 1.



At this point, where attention [

] is calculated we observe that the size of each of the feature

maps goes as

O(S2)−

there are a variety of previous techniques and models that have been explored

in the literature to deal with this problem [

]. Additionally, at this point note that we store three

feature maps of size

[B×A×S2]

. Calculations based on Figure 1 at the BERT

BASE

parameters

show that at a sequence length of 512 these three feature maps

account for 56% of the encoder

layer activation memory.

At these two points, we store the input to the two LayerNorm layers of size [B×S×H]



Here a GELU [

] layer is used as the activation function for the preceding fully-connected layer

of size

[B×S×4H]

. The activation memory for this function stores

almost 17% of the total layer

activation memory of BERTBASE at a sequence length of 128.

Figure 1: A diagram of a single Transformer encoder [

] layer used in BERT [

]. This is based on

the Huggingface implementation of BERT [

]. As in the BERT paper,

represents the number of

attention heads, and

represents the hidden size. We represent the batch size by

and the sequence

length by

. Sizes of intermediate tensors (both retained activations and unretained intermediates)

are annotated.

2.2 Why Activation Memory Matters

As iterated in previous works [

] there are multiple beneﬁts to reducing the memory

footprint of models. First, it allows for larger models which can positively affect the model’s

performance on downstream tasks [

]. Second, memory footprint reduction can allow for a larger

batch size. This, in turn, could lead to better utilization of the GPU hardware [

], increasing the

overall throughput [

]. In order to verify this possibility for Transformer-based models, we conduct

our own experiments using Huggingface’s BERT implementation [

] to train BERT

LARGE

on the

MRPC [

] ﬁne-tuning task. Figure 2 shows the throughput on this task for sequence lengths of

128 and 512. From the ﬁgure, we conclude that there is a steady improvement in batch size when

the sequence length is 128. This is also the case when the sequence length is 512, however, in this

situation the trend ends more abruptly as the memory consumption of the model exceeds the GPU

memory capacity, showing a clear opportunity to take advantage of memory footprint reduction.

0 2 4 6 8 10 12 14 16

Batch Size

Throughput (Sequences/s)

S= 128

S= 512

Figure 2: Plots of throughput (sequences/s) vs batch size for BERT

LARGE

[

] ﬁne-tuning on the

MRPC [

] task at sequence lengths 128 and 512 on four 2080Ti [

] GPUs. The maximum batch

sizes are respectively 16 and 2.

We note that previous works on Transformer-based models show that although the model parameters

contribute to the memory footprint, the main memory capacity consumer during training is actually

the activation feature maps [

]. In addition, the majority of this activation

memory will be used in each of the BERT Transformer encoder layers. Proﬁling the Huggingface

BERT

BASE

implementation [

] on the MRPC [

] ﬁne-tuning task at a batch size of 32 and

sequence length of 128 shows that 66% of the total memory is taken up by these encoder activations.

More details on this are shown in Appendix A.

2.3 Key Prior Works

There are three major prior techniques used in training memory footprint reduction of deep learning

models. The ﬁrst of these is Checkpointing [

]. This technique involves discarding certain

feature maps in the forward pass while retaining others. Later, in the backward pass, these discarded

feature maps may be recomputed from the retained feature maps, and thus used in the computation of

the gradients. The second technique is Ofﬂoading [

]. In this case, the main idea involves

taking feature maps that would be stored in the GPU memory, and instead ofﬂoading them to the

CPU memory. These techniques can also involve pre-fetching tensors from the CPU memory in

anticipation of their use. Ofﬂoading suffers from a dependence on system variables such as the

communication channel bandwidth [

]. It also requires extensive engineering effort to avoid

high overhead [6]. Finally, Compression/encoding; this can be divided into two different categories,

lossless and lossy [

]. However, the fundamental idea is to compress, or reduce the space taken

up by feature maps in the forward pass, then decompress it for use in the backward pass.

These techniques are usually largely orthogonal to one another as was shown in prior works where

both ofﬂoading and checkpointing are used simultaneously [

]. We expand on these techniques

in Appendix C.

2.4 Why Tempo?

Although the techniques in the previous section show good performance on a variety of models,

they suffer from a variety of issues. Checkpointing’s scope is often too broad to consider certain

layer-speciﬁc optimizations and alternative derivations that can provide lower overhead [

]. Fur-

thermore, overhead can be high (as much as 30%) [

]. Ofﬂoading can be system- dependent and

requires signiﬁcant engineering effort, while compression can be lossy or not applicable to the

Transformer case. Hence, there is a clear need for a deeper look at activation memory optimizations

for Transformer-based neural networks in particular. To our best knowledge, our work is the ﬁrst to

explore such optimizations tuned to improving the throughput of Transformer-based models. Table 1

shows a summary comparison of Tempo and various other techniques, with the major points that

differentiates our technique from prior work.

Feature

Capuchin

Checkmate

ActNN

Gist

Tempo

Layer-Speciﬁc 7 7 7 3 3

Transformer-Speciﬁc 7 7 7 7 3

Lossless 3 3 7 ∼1∼2

Drop-In Layer Replacement 7 7 3 3 3

Online 3 7 3 3 3

Table 1: Comparison between Tempo and Capuchin [

], Checkmate [

], ActNN [

], and Gist [

3 Tempo: Key Ideas

We now present the major ideas that lays behind the design of Tempo: (1)

In-place GELU

, (2)

In-place LayerNorm

, and (3)

Sub-Layer Dropout Recomputation

. The major theme behind all of

these ideas is to compute the backward pass as normal, while using less storage to do so. To this end,

In-place GELU and In-place LayerNorm compute the output of each layer in-place; instead using

the output activation to compute the gradient. Sub-Layer Dropout Recomputation also discards the

output, and through a closer look at the structure of the Dropout layer is able to recompute the output

without excessive recomputation. We strongly suggest reading Appendix E for the implementation

details. We also add in this appendix a new optimization of softmax that we use that further reduces

memory [18].

1Some of the Gist [26] optimizations are lossy.

Accuracy of our lossy optimization is tunable, offering a ﬂexible tradeoff between the accuracy and the

hardware cost.

3.1 In-place GELU

The GELU layer is used as an activation function for the feed-forward section of the BERT layer (



in Figure 1) [

]. A plot of this function is shown in Figure 3a. Referring to the baseline in Figure 3b,

note that both

and

are stored for the backward pass.

is needed for the downstream fully

connected layer, while

is stored for the GELU layer itself [

]. Prior work has demonstrated that

certain activation functions such as ReLU may be computed in-place [

]. This can be done without

affecting the calculation of the backward pass. If we were able to compute the GELU function

in-place, potentially by recovering the input from the output on the backward pass, we could save the

storage required for

. However, this is impossible to do directly. A key observation to make with

respect to the GELU function is that it is not bijective – hence there is no function that will be able to

compute the input from the output without additional information.

However, we observe that the GELU function is both continuous and has only one extremum, a

minimum value at

x≈ −0.75179

as can be seen in Figure 3a. Notably, this implies that just one

additional piece of information: which

side of the minimum

the input originates from, allows us

to compute the inverse of the GELU. This is because on each side of the minimum the function

is one-to-one, and hence the input is recoverable from the output in each section. Based on this

key observation, we can discard the input, and simply retain the output of the GELU, as well as

the additional information on whether the input is greater than or equal to the value at which the

minimum occurs. Figure 3b illustrates the difference between our method and the baseline.

In order to execute this efﬁciently on a real system, we note that the original derivative in terms of

the input can be composed with the function inverse in order to create a composite kernel. This

kernel consists of a polynomial approximation of this composite function, the approximation being

necessary since GELU is transcendental, and therefore the inverse cannot be solved in terms of

elementary functions [58]. Further details are discussed in Appendix E.

Figure 3(a): A plot of the GELU[

] function near

the origin, along with the marked minimum point.

Figure 3(b): Saved feature maps between the base-

line and Tempo. Note that our method only saves

a 8-bit mask

that denotes whether the input is

greater or less than the minimum value, instead of

the full 32-bit input feature map.

3.2 In-place LayerNorm

The LayerNorm layer is used at multiple points in the Transformer encoder layer [

], which we

denote by



in Figure 1. Usually, the gradient computation of LayerNorm relies on the gradient

input from the next layer, as well as the input feature map which is stashed for this computation [

Similar to GELU, we are able to derive an expression for the gradient of the LayerNorm layer as

a function of its output. In this context, the output of LayerNorm must be stashed to compute the

gradient of the successive fully connected layer anyways. Using this approach, the memory footprint

overhead of LayerNorm is just the intermediate mean and variance computed in the forward pass.

The full derivation is presented in Appendix E which is extended from the treatment of BatchNorm

in [53].

Comparison with Checkpointing

: Note that although In-place GELU requires more memory

compared to recomputing

from

, it will have increased overhead due to the recomputation.

Pytorch boolean masks use 8-bits per value [

]. Masks can also be implemented as 1-bit manually but this

brings extra overhead due to unpacking and packing bit tensors.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Tempo:AcceleratingTransformer-BasedModelTrainingthroughMemoryFootprintReductionMuralidharAndoorveedu1,ZhandaZhu2,3,BojianZheng1,3,GennadyPekhimenko1,31UniversityofToronto,Toronto,Canada2ShanghaiJiaoTongUniversity,Shanghai,China3VectorInstitute,Toronto,Canada{andoorve,zhanda,bojian,pekhimenko}@cs.tor...

展开>> 收起<<

Tempo Accelerating Transformer-Based Model Training through Memory Footprint Reduction Muralidhar Andoorveedu1 Zhanda Zhu23 Bojian Zheng13 Gennady Pekhimenko13.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Tempo Accelerating Transformer-Based Model Training through Memory Footprint Reduction Muralidhar Andoorveedu1 Zhanda Zhu23 Bojian Zheng13 Gennady Pekhimenko13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: