
be split into multiple heads. The Q and K tensors are first
multiplied (1st GEMM) to compute the dot product of the
query against all keys. This dot product is then scaled by the
hidden dimension dkand passed through a softmax function to
calculate the weights corresponding to the value tensor. Each
head of the output tensor is concatenated before going through
another linear layer by multiplying against tensor V (2nd
GEMM). Expressing self-attention as a mathematical formula,
we have:
Attention(Q, K, V ) = softmax(QKT
√dk
)×V(1)
Whereas the formula of multi-head attention is:
Multihead(Q, K, V ) = Concat(headi, ..., headh), here
headi=Attention(Qi, Ki, Vi).
B. Related works on DL acceleration
Performance is a crucial aspect in the real-world deployment
of software systems, attracting significant attention across
various applications [20]–[22], including DL frameworks. The
conventional DL frameworks, such as PyTorch, TensorFlow,
TVM, and TensorRT are designed explicitly for fixed-length
input tensors. When dealing with NLP problems with variable-
length input, all sequences are padded to the maximal length,
which leads to significant wasted calculations on zero tokens.
A few DL frameworks, such as Tencent TurboTransformer
[10] and NVIDIA FasterTransformer [23], employ explicit
designs for variable-length inputs. TurboTransformer designs
run-time algorithms to group and pad sequences with similar
lengths to minimize the padding overhead. TurboTransformer
also uses a run-time memory scheduling strategy to improve
end-to-end performance. Kernel-level optimizations are of the
same significance as algorithmic optimizations. NVIDIA’s
FasterTransformer uses vendor-specific libraries such as Ten-
sorRT and cuBLAS [24] as its back-end, which provide
optimized implementations of various operations at the kernel
level.
Other end-to-end DL frameworks have also presented op-
timizations for BERT-like transformers, such as E.T. [25]
and DeepSpeed-Inference [26]. E.T. introduces a novel MHA
architecture for NVIDIA Volta GPUs and includes pruning
designs for end-to-end transformer models. In contrast, Byte-
Transformer targets unpruned models and is optimized for
NVIDIA Ampere GPUs. DeepSpeed-Inference is optimized
for large distributed models on multiple GPUs, while Byte-
Transformer currently focuses on lighter single-GPU models.
In addition to end-to-end performance acceleration, the
research community has also made focused efforts to improve
a key algorithm of the transformer, multi-head attention.
PyTorch provides a standard implementation of MHA [27].
NVIDIA TensorRT utilizes a fused MHA for short sequences
with lengths up to 512, as described in [28]. To handle
longer sequences, FlashAttention was proposed by Stanford
researchers in [29]. FlashAttention assigns the workload of a
whole attention unit to a single threadblock (CTA). However,
this approach can result in underutilization on wide GPUs
when there are not enough attention units assigned. Our fused
MHA, on the other hand, provides high performance for both
short and long sequences for variable-length inputs without
leading to performance degradation in small-batch scenarios.
TABLE I. Summarizing state-of-the-art transformers.
variable-len kernel fused kernel
support tuning MHA fusion
Tensorflow XLA no yes no no
PyTorch JIT no yes no no
FasterTransformer yes yes ≤512 no
TurboTransformer yes yes no partially
ByteTransformer yes yes yes yes
Table I surveys state-of-the-art transformers. TensorFlow
and PyTorch provide tuned kernels but require padding
for variable-length inputs. NVIDIA FasterTransformer and
Tencent TurboTransformer, although providing support for
variable-length inputs, do not perform comprehensive kernel
fusion or explicit optimization for the hot-spot algorithm
MHA for any length of sequence. In addition, TurboTrans-
former only optimizes part of the fusible operations in the
transformer model, such as layernorm and activation, namely
’partial kernel fusion’ in the table. Our ByteTransformer, in
contrast, starting with a systemic profiling to locate bottleneck
algorithms, precisely tunes a series of kernels including the key
algorithm MHA. We also propose a padding-free algorithm
which completely removes redundant calculations for variable-
length inputs from the entire transformer.
III. DESIGNS AND OPTIMIZATIONS
In this section, we present our algorithmic and kernel-level
optimizations to improve the end-to-end performance of BERT
transformer under variable-length inputs.
A. Math expression of BERT transformer encoder
Figure 2(a) illustrates the architecture of the transformer
encoder. The input tensor is first processed through the BERT
pipeline, where it is multiplied by a built-in attribute matrix
to perform Q, K, and V positioning encoding. This operation
can be implemented using three separate GEMM operations
or in batch mode. Realizing that the corresponding attribute
matrices to Q, K, and V are all the same shape (hidden dim x
hidden dim), we pack them to continuous memory space and
launch a single batched GEMM kernel that calculates Q, K,
and V to reduce the kernel launch overhead at runtime. Bias
matrices for Q, K, and V are then added to the encoded tensor,
which is passed through the self-attention module. In addition
to the multi-head attention module, the BERT transformer
encoder includes projection, feed forward network, and layer
normalization. The encoder pipeline can be represented as
a series of mathematical operations, including six GEMMs
(shown in light purple) and other memory-bound operations
(shown in light blue).
3