ByteTransformer A High-Performance Transformer Boosted for Variable-Length Inputs Yujia ZhaiChengquan JiangyLeyuan WangyXiaoying JiayShang Zhangz

2025-04-27 0 0 1.4MB 12 页 10玖币

侵权投诉

ByteTransformer: A High-Performance Transformer

Boosted for Variable-Length Inputs

Yujia Zhai,∗¶ Chengquan Jiang,†¶ Leyuan Wang,†Xiaoying Jia,†Shang Zhang,‡

Zizhong Chen,∗Xin Liu,†§ Yibo Zhu†

∗University of California, Riverside

†ByteDance Ltd.

‡NVIDIA Corporation

§Correspondence to liuxin.ai@bytedance.com

¶These authors contributed equally to this work.

Abstract—Transformers have become keystone models in nat-

ural language processing over the past decade. They have

achieved great popularity in deep learning applications, but the

increasing sizes of the parameter spaces required by transformer

models generate a commensurate need to accelerate perfor-

mance. Natural language processing problems are also routinely

faced with variable-length sequences, as word counts commonly

vary among sentences. Existing deep learning frameworks pad

variable-length sequences to a maximal length, which adds

signiﬁcant memory and computational overhead. In this paper,

we present ByteTransformer, a high-performance transformer

boosted for variable-length inputs. We propose a padding-free

algorithm that liberates the entire transformer from redundant

computations on zero padded tokens. In addition to algorithmic-

level optimization, we provide architecture-aware optimizations

for transformer functional modules, especially the performance-

critical algorithm Multi-Head Attention (MHA). Experimental

results on an NVIDIA A100 GPU with variable-length sequence

inputs validate that our fused MHA outperforms PyTorch by

6.13x. The end-to-end performance of ByteTransformer for a for-

ward BERT transformer surpasses state-of-the-art transformer

frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent

TurboTransformer, Microsoft DeepSpeed-Inference and NVIDIA

FasterTransformer, by 87%, 131%, 138%, 74% and 55%,

respectively. We also demonstrate the general applicability of

our optimization methods to other BERT-like models, including

ALBERT, DistilBERT, and DeBERTa.

Index Terms—Transformer, BERT, Multi-head Attention,

MHA, Natural Language Processing, NVIDIA GPU, CUTLASS

I. INTRODUCTION

The transformer model [1]is a proven effective architecture

widely used in a variety of deep learning (DL) applications,

such as language modeling [2], [3], neural machine translation

[1], [4] and recommendation systems [5], [6]. The last decade

has witnessed rapid developments in natural language process-

ing (NLP) pre-training models based on the transformer model,

such as Seq2seq [1], GPT-2 [7] and XLNET [3], which have

also greatly accelerated the progress of NLP. Of all the pre-

training models based on transformers, Bidirectional Encoder

Representations from Transformers (BERT), proposed in 2018

[2], is arguably the most seminal, inspiring a series of subse-

We have made ByteTransformer open-source and available at a public

GitHub repository: https://github.com/bytedance/ByteTransformer.

quent works and outperforming reference models on a dozen

NLP tasks at the time of creation.

BERT-like models consume increasingly larger parameter

space and correspondingly more computational resources.

When BERT was discovered, a large model required 340

million parameters [8], but currently a full GPT-3 model

requires 170 billion parameters [9]. The base BERT model

requires 6.9 billion ﬂoating-point operations to inference a 40-

word sentence, and this number increases to 20 billion when

translating a 20-word sentence using a base Seq2Seq model

[10]. The size of the parameter space and the computational

demands increase the cost of the training and inference for

BERT-like models, which requires the attention of the DL

community in order to accelerate these models.

To exploit hardware efﬁciency, DL frameworks adopt a

batching strategy, where multiple batches are executed con-

currently. Since batched execution requires task shapes in

different batches to be identical, DL frameworks presume

ﬁxed-length inputs when designing the software [11]–[14].

However, this assumption cannot always hold, because trans-

former models are often faced with variable-length input

problems [8], [10]. In order to deploy models with variable-

length inputs directly to conventional frameworks that support

only ﬁxed-length models, a straightforward solution is to pad

all sequences with zeros to the maximal sequence length.

However, this immediately brings in redundant computations

on wasted padded tokens. These padded zeros also introduce

signiﬁcant memory overhead that can hinder a large trans-

former model from being efﬁciently deployed.

Existing popular DL frameworks, such as Google Tensor-

Flow with XLA [15], [16], Meta PyTorch with JIT [17], and

OctoML TVM [18], leverage the domain-speciﬁc just-in-time

compilation technique to boost performance. Another widely-

adopted strategy to generate low-level performance optimiza-

tion is delicate manual tuning: NVIDIA TensorRT [19], a DL

runtime, falls into this category. Yet all of these frameworks

require the input sequence lengths to be identical to exploit

the speedup of batch processing. To lift the restriction on

ﬁxed sequence lengths, Tencent [10] and Baidu [8] provide

explicit support for models with variable sequence lengths.

They group sequences with similar lengths before launching

arXiv:2210.03052v4 [cs.LG] 20 Feb 2023

batched kernels to minimize the padding overhead. However,

this proactive grouping approach still introduces irremovable

padding overhead when grouping and padding sequences with

similar yet different lengths.

In contrast to training processes that can be computed

ofﬂine, the inference stage of a serving system must be

processed online with low latency, which imposes high perfor-

mance requirements on DL frameworks. A highly efﬁcient DL

inference framework for NLP models requires delicate kernel-

level optimizations and explicit end-to-end designs to avoid

wasted computations on zero tokens when handling variable-

length inputs. However, existing DL frameworks do not meet

these expectations. In order to remedy this deﬁcit, we present

ByteTransformer, a highly efﬁcient transformer framework

optimized for variable-length inputs in NLP problems. We not

only design an algorithm that frees the entire transformer of

padding when dealing with variable-length sequences, but also

provide a set of hand-tuned fused GPU kernels to minimize

the cost of accessing GPU global memory. More speciﬁcally,

our contributions include:

•We design and develop ByteTransformer, a high-

performance GPU-accelerated transformer optimized for

variable-length inputs. ByteTransformer has been de-

ployed to serve world-class applications including TikTok

and Douyin of ByteDance.

•We propose a padding-free algorithm that packs the input

tensor with variable-length sequences and calculates the

positioning offset vector for all transformer operations to

index, which keeps the whole transformer pipeline free

from padding and calculations on zero tokens.

•We propose a fused Multi-Head Attention (MHA) to

alleviate the memory overhead of the intermediate matrix,

which is quadratic to the sequence length, in MHA with-

out introducing redundant calculations due to padding

for variable-length inputs. Part of our fused MHA has

been deployed in the production code base of NVIDIA

CUTLASS.

•We hand-tune the memory footprints of layer normal-

ization, adding bias and activation to squeeze the ﬁnal

performance of the system.

•We benchmark the performance of ByteTransformer on

an NVIDIA A100 GPU for forward pass of BERT-like

transformers, including BERT, ALBERT, DistilBERT,

and DeBERTa. Experimental results demonstrate our

fused MHA outperforms standard PyTorch attention by

6.13X. Regarding the end-to-end performance of standard

BERT transformer, ByteTransformer surpasses PyTorch,

TensorFlow, Tencent TurboTransformer, Microsoft Deep-

Speed and NVIDIA FasterTransformer by 87%, 131%,

138%, 74%, and 55%, respectively.

The rest of the paper is organized as follows: we introduce

background and related works in Section II, and then detail our

systematic optimization approach in Section III. Evaluation

results are given in Section IV. We conclude our paper and

present future work in Section V.

II. BACKGROUND AND RELATED WORKS

We provide an overview of the transformer model, including

its encoder-decoder architecture and multi-head attention layer.

We also survey related works on DL framework acceleration.

A. The transformer architecture

Fig. 1: The transformer architecture. [1]

Figure 1 shows the encoder-decoder model architecture of

the transformer. It consists of stacks of multiple encoder and

decoder layers. In an encoder layer, there is a multi-head

attention layer followed by a feed-forward network (FFN)

layer. A layer normalization (layernorm) operation is applied

after both MHA and FFN. In a decoder layer, there are two

sets of consecutive MHA layers and one FFN layer, and each

operation is normalized with a layernorm. The FFN is used

to improve the capacity of the model. In practice, FFN is

implemented by multiplying the tensor by a larger scaled

tensor using GEMM. Here we skip the embedding descriptions

in the ﬁgure, and refer an interested reader to [1] for details.

Although we show both encoder and decoder modules for

this transformer, a BERT transformer model only contains the

encoder section [2]. In this paper, we present optimizations

for BERT-like transformer models, which can be extended to

other transformers containing decoder sections.

Self-attention is a key module of the transformer architec-

ture. Conceptually, self-attention computes the signiﬁcance of

each position of the input sequence, with the information from

other positions considered. A self-attention receives three input

tensors: query (Q), key (K), and value (V). Self-attention can

be split into multiple heads. The Q and K tensors are ﬁrst

multiplied (1st GEMM) to compute the dot product of the

query against all keys. This dot product is then scaled by the

hidden dimension dkand passed through a softmax function to

calculate the weights corresponding to the value tensor. Each

head of the output tensor is concatenated before going through

another linear layer by multiplying against tensor V (2nd

GEMM). Expressing self-attention as a mathematical formula,

we have:

Attention(Q, K, V ) = softmax(QKT

√dk

)×V(1)

Whereas the formula of multi-head attention is:

Multihead(Q, K, V ) = Concat(headi, ..., headh), here

headi=Attention(Qi, Ki, Vi).

B. Related works on DL acceleration

Performance is a crucial aspect in the real-world deployment

of software systems, attracting signiﬁcant attention across

various applications [20]–[22], including DL frameworks. The

conventional DL frameworks, such as PyTorch, TensorFlow,

TVM, and TensorRT are designed explicitly for ﬁxed-length

input tensors. When dealing with NLP problems with variable-

length input, all sequences are padded to the maximal length,

which leads to signiﬁcant wasted calculations on zero tokens.

A few DL frameworks, such as Tencent TurboTransformer

[10] and NVIDIA FasterTransformer [23], employ explicit

designs for variable-length inputs. TurboTransformer designs

run-time algorithms to group and pad sequences with similar

lengths to minimize the padding overhead. TurboTransformer

also uses a run-time memory scheduling strategy to improve

end-to-end performance. Kernel-level optimizations are of the

same signiﬁcance as algorithmic optimizations. NVIDIA’s

FasterTransformer uses vendor-speciﬁc libraries such as Ten-

sorRT and cuBLAS [24] as its back-end, which provide

optimized implementations of various operations at the kernel

level.

Other end-to-end DL frameworks have also presented op-

timizations for BERT-like transformers, such as E.T. [25]

and DeepSpeed-Inference [26]. E.T. introduces a novel MHA

architecture for NVIDIA Volta GPUs and includes pruning

designs for end-to-end transformer models. In contrast, Byte-

Transformer targets unpruned models and is optimized for

NVIDIA Ampere GPUs. DeepSpeed-Inference is optimized

for large distributed models on multiple GPUs, while Byte-

Transformer currently focuses on lighter single-GPU models.

In addition to end-to-end performance acceleration, the

research community has also made focused efforts to improve

a key algorithm of the transformer, multi-head attention.

PyTorch provides a standard implementation of MHA [27].

NVIDIA TensorRT utilizes a fused MHA for short sequences

with lengths up to 512, as described in [28]. To handle

longer sequences, FlashAttention was proposed by Stanford

researchers in [29]. FlashAttention assigns the workload of a

whole attention unit to a single threadblock (CTA). However,

this approach can result in underutilization on wide GPUs

when there are not enough attention units assigned. Our fused

MHA, on the other hand, provides high performance for both

short and long sequences for variable-length inputs without

leading to performance degradation in small-batch scenarios.

TABLE I. Summarizing state-of-the-art transformers.

variable-len kernel fused kernel

support tuning MHA fusion

Tensorﬂow XLA no yes no no

PyTorch JIT no yes no no

FasterTransformer yes yes ≤512 no

TurboTransformer yes yes no partially

ByteTransformer yes yes yes yes

Table I surveys state-of-the-art transformers. TensorFlow

and PyTorch provide tuned kernels but require padding

for variable-length inputs. NVIDIA FasterTransformer and

Tencent TurboTransformer, although providing support for

variable-length inputs, do not perform comprehensive kernel

fusion or explicit optimization for the hot-spot algorithm

MHA for any length of sequence. In addition, TurboTrans-

former only optimizes part of the fusible operations in the

transformer model, such as layernorm and activation, namely

’partial kernel fusion’ in the table. Our ByteTransformer, in

contrast, starting with a systemic proﬁling to locate bottleneck

algorithms, precisely tunes a series of kernels including the key

algorithm MHA. We also propose a padding-free algorithm

which completely removes redundant calculations for variable-

length inputs from the entire transformer.

III. DESIGNS AND OPTIMIZATIONS

In this section, we present our algorithmic and kernel-level

optimizations to improve the end-to-end performance of BERT

transformer under variable-length inputs.

A. Math expression of BERT transformer encoder

Figure 2(a) illustrates the architecture of the transformer

encoder. The input tensor is ﬁrst processed through the BERT

pipeline, where it is multiplied by a built-in attribute matrix

to perform Q, K, and V positioning encoding. This operation

can be implemented using three separate GEMM operations

or in batch mode. Realizing that the corresponding attribute

matrices to Q, K, and V are all the same shape (hidden dim x

hidden dim), we pack them to continuous memory space and

launch a single batched GEMM kernel that calculates Q, K,

and V to reduce the kernel launch overhead at runtime. Bias

matrices for Q, K, and V are then added to the encoded tensor,

which is passed through the self-attention module. In addition

to the multi-head attention module, the BERT transformer

encoder includes projection, feed forward network, and layer

normalization. The encoder pipeline can be represented as

a series of mathematical operations, including six GEMMs

(shown in light purple) and other memory-bound operations

(shown in light blue).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ByteTransformer:AHigh-PerformanceTransformerBoostedforVariable-LengthInputsYujiaZhai,{ChengquanJiang,y{LeyuanWang,yXiaoyingJia,yShangZhang,zZizhongChen,XinLiu,yxYiboZhuyUniversityofCalifornia,RiversideyByteDanceLtd.zNVIDIACorporationxCorrespondencetoliuxin.ai@bytedance.com{Theseauthorscontributed...

展开>> 收起<<

ByteTransformer A High-Performance Transformer Boosted for Variable-Length Inputs Yujia ZhaiChengquan JiangyLeyuan WangyXiaoying JiayShang Zhangz.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ByteTransformer A High-Performance Transformer Boosted for Variable-Length Inputs Yujia ZhaiChengquan JiangyLeyuan WangyXiaoying JiayShang Zhangz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: