ByteTransformer A High-Performance Transformer Boosted for Variable-Length Inputs Yujia ZhaiChengquan JiangyLeyuan WangyXiaoying JiayShang Zhangz

2025-04-27 0 0 1.4MB 12 页 10玖币
侵权投诉
ByteTransformer: A High-Performance Transformer
Boosted for Variable-Length Inputs
Yujia Zhai,Chengquan Jiang,Leyuan Wang,Xiaoying Jia,Shang Zhang,
Zizhong Chen,Xin Liu,§ Yibo Zhu
University of California, Riverside
ByteDance Ltd.
NVIDIA Corporation
§Correspondence to liuxin.ai@bytedance.com
These authors contributed equally to this work.
Abstract—Transformers have become keystone models in nat-
ural language processing over the past decade. They have
achieved great popularity in deep learning applications, but the
increasing sizes of the parameter spaces required by transformer
models generate a commensurate need to accelerate perfor-
mance. Natural language processing problems are also routinely
faced with variable-length sequences, as word counts commonly
vary among sentences. Existing deep learning frameworks pad
variable-length sequences to a maximal length, which adds
significant memory and computational overhead. In this paper,
we present ByteTransformer, a high-performance transformer
boosted for variable-length inputs. We propose a padding-free
algorithm that liberates the entire transformer from redundant
computations on zero padded tokens. In addition to algorithmic-
level optimization, we provide architecture-aware optimizations
for transformer functional modules, especially the performance-
critical algorithm Multi-Head Attention (MHA). Experimental
results on an NVIDIA A100 GPU with variable-length sequence
inputs validate that our fused MHA outperforms PyTorch by
6.13x. The end-to-end performance of ByteTransformer for a for-
ward BERT transformer surpasses state-of-the-art transformer
frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent
TurboTransformer, Microsoft DeepSpeed-Inference and NVIDIA
FasterTransformer, by 87%, 131%, 138%, 74% and 55%,
respectively. We also demonstrate the general applicability of
our optimization methods to other BERT-like models, including
ALBERT, DistilBERT, and DeBERTa.
Index Terms—Transformer, BERT, Multi-head Attention,
MHA, Natural Language Processing, NVIDIA GPU, CUTLASS
I. INTRODUCTION
The transformer model [1]is a proven effective architecture
widely used in a variety of deep learning (DL) applications,
such as language modeling [2], [3], neural machine translation
[1], [4] and recommendation systems [5], [6]. The last decade
has witnessed rapid developments in natural language process-
ing (NLP) pre-training models based on the transformer model,
such as Seq2seq [1], GPT-2 [7] and XLNET [3], which have
also greatly accelerated the progress of NLP. Of all the pre-
training models based on transformers, Bidirectional Encoder
Representations from Transformers (BERT), proposed in 2018
[2], is arguably the most seminal, inspiring a series of subse-
We have made ByteTransformer open-source and available at a public
GitHub repository: https://github.com/bytedance/ByteTransformer.
quent works and outperforming reference models on a dozen
NLP tasks at the time of creation.
BERT-like models consume increasingly larger parameter
space and correspondingly more computational resources.
When BERT was discovered, a large model required 340
million parameters [8], but currently a full GPT-3 model
requires 170 billion parameters [9]. The base BERT model
requires 6.9 billion floating-point operations to inference a 40-
word sentence, and this number increases to 20 billion when
translating a 20-word sentence using a base Seq2Seq model
[10]. The size of the parameter space and the computational
demands increase the cost of the training and inference for
BERT-like models, which requires the attention of the DL
community in order to accelerate these models.
To exploit hardware efficiency, DL frameworks adopt a
batching strategy, where multiple batches are executed con-
currently. Since batched execution requires task shapes in
different batches to be identical, DL frameworks presume
fixed-length inputs when designing the software [11]–[14].
However, this assumption cannot always hold, because trans-
former models are often faced with variable-length input
problems [8], [10]. In order to deploy models with variable-
length inputs directly to conventional frameworks that support
only fixed-length models, a straightforward solution is to pad
all sequences with zeros to the maximal sequence length.
However, this immediately brings in redundant computations
on wasted padded tokens. These padded zeros also introduce
significant memory overhead that can hinder a large trans-
former model from being efficiently deployed.
Existing popular DL frameworks, such as Google Tensor-
Flow with XLA [15], [16], Meta PyTorch with JIT [17], and
OctoML TVM [18], leverage the domain-specific just-in-time
compilation technique to boost performance. Another widely-
adopted strategy to generate low-level performance optimiza-
tion is delicate manual tuning: NVIDIA TensorRT [19], a DL
runtime, falls into this category. Yet all of these frameworks
require the input sequence lengths to be identical to exploit
the speedup of batch processing. To lift the restriction on
fixed sequence lengths, Tencent [10] and Baidu [8] provide
explicit support for models with variable sequence lengths.
They group sequences with similar lengths before launching
1
arXiv:2210.03052v4 [cs.LG] 20 Feb 2023
batched kernels to minimize the padding overhead. However,
this proactive grouping approach still introduces irremovable
padding overhead when grouping and padding sequences with
similar yet different lengths.
In contrast to training processes that can be computed
offline, the inference stage of a serving system must be
processed online with low latency, which imposes high perfor-
mance requirements on DL frameworks. A highly efficient DL
inference framework for NLP models requires delicate kernel-
level optimizations and explicit end-to-end designs to avoid
wasted computations on zero tokens when handling variable-
length inputs. However, existing DL frameworks do not meet
these expectations. In order to remedy this deficit, we present
ByteTransformer, a highly efficient transformer framework
optimized for variable-length inputs in NLP problems. We not
only design an algorithm that frees the entire transformer of
padding when dealing with variable-length sequences, but also
provide a set of hand-tuned fused GPU kernels to minimize
the cost of accessing GPU global memory. More specifically,
our contributions include:
We design and develop ByteTransformer, a high-
performance GPU-accelerated transformer optimized for
variable-length inputs. ByteTransformer has been de-
ployed to serve world-class applications including TikTok
and Douyin of ByteDance.
We propose a padding-free algorithm that packs the input
tensor with variable-length sequences and calculates the
positioning offset vector for all transformer operations to
index, which keeps the whole transformer pipeline free
from padding and calculations on zero tokens.
We propose a fused Multi-Head Attention (MHA) to
alleviate the memory overhead of the intermediate matrix,
which is quadratic to the sequence length, in MHA with-
out introducing redundant calculations due to padding
for variable-length inputs. Part of our fused MHA has
been deployed in the production code base of NVIDIA
CUTLASS.
We hand-tune the memory footprints of layer normal-
ization, adding bias and activation to squeeze the final
performance of the system.
We benchmark the performance of ByteTransformer on
an NVIDIA A100 GPU for forward pass of BERT-like
transformers, including BERT, ALBERT, DistilBERT,
and DeBERTa. Experimental results demonstrate our
fused MHA outperforms standard PyTorch attention by
6.13X. Regarding the end-to-end performance of standard
BERT transformer, ByteTransformer surpasses PyTorch,
TensorFlow, Tencent TurboTransformer, Microsoft Deep-
Speed and NVIDIA FasterTransformer by 87%, 131%,
138%, 74%, and 55%, respectively.
The rest of the paper is organized as follows: we introduce
background and related works in Section II, and then detail our
systematic optimization approach in Section III. Evaluation
results are given in Section IV. We conclude our paper and
present future work in Section V.
II. BACKGROUND AND RELATED WORKS
We provide an overview of the transformer model, including
its encoder-decoder architecture and multi-head attention layer.
We also survey related works on DL framework acceleration.
A. The transformer architecture
Fig. 1: The transformer architecture. [1]
Figure 1 shows the encoder-decoder model architecture of
the transformer. It consists of stacks of multiple encoder and
decoder layers. In an encoder layer, there is a multi-head
attention layer followed by a feed-forward network (FFN)
layer. A layer normalization (layernorm) operation is applied
after both MHA and FFN. In a decoder layer, there are two
sets of consecutive MHA layers and one FFN layer, and each
operation is normalized with a layernorm. The FFN is used
to improve the capacity of the model. In practice, FFN is
implemented by multiplying the tensor by a larger scaled
tensor using GEMM. Here we skip the embedding descriptions
in the figure, and refer an interested reader to [1] for details.
Although we show both encoder and decoder modules for
this transformer, a BERT transformer model only contains the
encoder section [2]. In this paper, we present optimizations
for BERT-like transformer models, which can be extended to
other transformers containing decoder sections.
Self-attention is a key module of the transformer architec-
ture. Conceptually, self-attention computes the significance of
each position of the input sequence, with the information from
other positions considered. A self-attention receives three input
tensors: query (Q), key (K), and value (V). Self-attention can
2
be split into multiple heads. The Q and K tensors are first
multiplied (1st GEMM) to compute the dot product of the
query against all keys. This dot product is then scaled by the
hidden dimension dkand passed through a softmax function to
calculate the weights corresponding to the value tensor. Each
head of the output tensor is concatenated before going through
another linear layer by multiplying against tensor V (2nd
GEMM). Expressing self-attention as a mathematical formula,
we have:
Attention(Q, K, V ) = softmax(QKT
dk
)×V(1)
Whereas the formula of multi-head attention is:
Multihead(Q, K, V ) = Concat(headi, ..., headh), here
headi=Attention(Qi, Ki, Vi).
B. Related works on DL acceleration
Performance is a crucial aspect in the real-world deployment
of software systems, attracting significant attention across
various applications [20]–[22], including DL frameworks. The
conventional DL frameworks, such as PyTorch, TensorFlow,
TVM, and TensorRT are designed explicitly for fixed-length
input tensors. When dealing with NLP problems with variable-
length input, all sequences are padded to the maximal length,
which leads to significant wasted calculations on zero tokens.
A few DL frameworks, such as Tencent TurboTransformer
[10] and NVIDIA FasterTransformer [23], employ explicit
designs for variable-length inputs. TurboTransformer designs
run-time algorithms to group and pad sequences with similar
lengths to minimize the padding overhead. TurboTransformer
also uses a run-time memory scheduling strategy to improve
end-to-end performance. Kernel-level optimizations are of the
same significance as algorithmic optimizations. NVIDIAs
FasterTransformer uses vendor-specific libraries such as Ten-
sorRT and cuBLAS [24] as its back-end, which provide
optimized implementations of various operations at the kernel
level.
Other end-to-end DL frameworks have also presented op-
timizations for BERT-like transformers, such as E.T. [25]
and DeepSpeed-Inference [26]. E.T. introduces a novel MHA
architecture for NVIDIA Volta GPUs and includes pruning
designs for end-to-end transformer models. In contrast, Byte-
Transformer targets unpruned models and is optimized for
NVIDIA Ampere GPUs. DeepSpeed-Inference is optimized
for large distributed models on multiple GPUs, while Byte-
Transformer currently focuses on lighter single-GPU models.
In addition to end-to-end performance acceleration, the
research community has also made focused efforts to improve
a key algorithm of the transformer, multi-head attention.
PyTorch provides a standard implementation of MHA [27].
NVIDIA TensorRT utilizes a fused MHA for short sequences
with lengths up to 512, as described in [28]. To handle
longer sequences, FlashAttention was proposed by Stanford
researchers in [29]. FlashAttention assigns the workload of a
whole attention unit to a single threadblock (CTA). However,
this approach can result in underutilization on wide GPUs
when there are not enough attention units assigned. Our fused
MHA, on the other hand, provides high performance for both
short and long sequences for variable-length inputs without
leading to performance degradation in small-batch scenarios.
TABLE I. Summarizing state-of-the-art transformers.
variable-len kernel fused kernel
support tuning MHA fusion
Tensorflow XLA no yes no no
PyTorch JIT no yes no no
FasterTransformer yes yes 512 no
TurboTransformer yes yes no partially
ByteTransformer yes yes yes yes
Table I surveys state-of-the-art transformers. TensorFlow
and PyTorch provide tuned kernels but require padding
for variable-length inputs. NVIDIA FasterTransformer and
Tencent TurboTransformer, although providing support for
variable-length inputs, do not perform comprehensive kernel
fusion or explicit optimization for the hot-spot algorithm
MHA for any length of sequence. In addition, TurboTrans-
former only optimizes part of the fusible operations in the
transformer model, such as layernorm and activation, namely
’partial kernel fusion’ in the table. Our ByteTransformer, in
contrast, starting with a systemic profiling to locate bottleneck
algorithms, precisely tunes a series of kernels including the key
algorithm MHA. We also propose a padding-free algorithm
which completely removes redundant calculations for variable-
length inputs from the entire transformer.
III. DESIGNS AND OPTIMIZATIONS
In this section, we present our algorithmic and kernel-level
optimizations to improve the end-to-end performance of BERT
transformer under variable-length inputs.
A. Math expression of BERT transformer encoder
Figure 2(a) illustrates the architecture of the transformer
encoder. The input tensor is first processed through the BERT
pipeline, where it is multiplied by a built-in attribute matrix
to perform Q, K, and V positioning encoding. This operation
can be implemented using three separate GEMM operations
or in batch mode. Realizing that the corresponding attribute
matrices to Q, K, and V are all the same shape (hidden dim x
hidden dim), we pack them to continuous memory space and
launch a single batched GEMM kernel that calculates Q, K,
and V to reduce the kernel launch overhead at runtime. Bias
matrices for Q, K, and V are then added to the encoded tensor,
which is passed through the self-attention module. In addition
to the multi-head attention module, the BERT transformer
encoder includes projection, feed forward network, and layer
normalization. The encoder pipeline can be represented as
a series of mathematical operations, including six GEMMs
(shown in light purple) and other memory-bound operations
(shown in light blue).
3
摘要:

ByteTransformer:AHigh-PerformanceTransformerBoostedforVariable-LengthInputsYujiaZhai,{ChengquanJiang,y{LeyuanWang,yXiaoyingJia,yShangZhang,zZizhongChen,XinLiu,yxYiboZhuyUniversityofCalifornia,RiversideyByteDanceLtd.zNVIDIACorporationxCorrespondencetoliuxin.ai@bytedance.com{Theseauthorscontributed...

展开>> 收起<<
ByteTransformer A High-Performance Transformer Boosted for Variable-Length Inputs Yujia ZhaiChengquan JiangyLeyuan WangyXiaoying JiayShang Zhangz.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.4MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注