TPU-MLIR A Compiler For TPU Using MLIR Pengchao Hu Man Lu Lei Wang Guoyue Jiang fpengchao.human.lulei.wangguoyue.jiang gsophgo.com

2025-05-06 0 0 675.64KB 11 页 10玖币
侵权投诉
TPU-MLIR: A Compiler For TPU Using MLIR
Pengchao Hu Man Lu Lei Wang Guoyue Jiang
{pengchao.hu,man.lu,lei.wang,guoyue.jiang}@sophgo.com
Sophgo Inc.
Abstract
Multi-level intermediate representations (MLIR)
show great promise for reducing the cost of building
domain-specific compilers by providing a reusable and
extensible compiler infrastructure. This work presents
TPU-MLIR, an end-to-end compiler based on MLIR
that deploys pre-trained neural network (NN) models to
a custom ASIC called a Tensor Processing Unit (TPU).
TPU-MLIR defines two new dialects to implement its
functionality: 1. a Tensor operation (TOP) dialect that
encodes the deep learning graph semantics and inde-
pendent of the deep learning framework and 2. a TPU
kernel dialect to provide a standard kernel computation
on TPU. A NN model is translated to the TOP dialect
and then lowered to the TPU dialect for different TPUs
according to the chip’s configuration. We demonstrate
how to use the MLIR pass pipeline to organize and per-
form optimization on TPU to generate machine code.
The paper also presents a verification procedure to en-
sure the correctness of each transform stage.
1. Introduction
The development of deep learning (DL) has pro-
foundly impacted various scientific fields, including
speech recognition, computer vision, and natural lan-
guage processing. In order to facilitate the process of
training deep learning models, industry and academia
have developed many frameworks, such as Caffe, Ten-
sorflow, Pytorch, Mxnet, and PaddlePaddle, which
boost deep learning in many areas. However, each
framework has its proprietary graph representation,
which brings lots of work for deploying as we need to
support many DL model formats.
At the same time, matrix multiplication and high
dimensional tensor convolution are the heavy compu-
tation in DL, which evoke the passion of chip architects
to design customized DL accelerators to achieve high
performance at low energy. Although GPU is still the
leading hardware in training DL models and all the
DL frameworks have contributed much work to sup-
port this general-purpose hardware, GPU is not the
perfect piece in the inference domain of DL. GPU is
for gaming, graph rendering, scientific computation,
and much more, not tailored for DL only. Thus, many
DL accelerators, such as Google TPU, Apple Bonic,
Graphcore IPU, and SOPHGO TPU, are more energy
efficient than GPU and benefit many of these emerging
DL applications.
In addition, the DL community has resorted to
domain-specific compilers for rescue to address the
drawback of DL libraries and alleviate the burden of
manually optimizing the DL models on each DL hard-
ware. The DL compilers take the model described in
the DL frameworks as inputs and generate efficient
code for various DL hardware as outputs. The trans-
formation between a model definition and specific code
implementation is highly optimized, considering the
model specification and hardware architecture. Several
popular DL compilers, such as TVM, Tensor Compre-
hension, and XLA, have been proposed by industry and
academia. Specifically, they incorporate DL-oriented
optimizations such as layer and operator fusion, which
enables highly efficient code generation.
Herein, We provide TPU-MLIR, an open-source DL
compiler for TPU. In particular, we chose Open Neural
Network Exchange (ONNX)[1] as a DL format to repre-
sent our compiler’s input model and use Multi-level In-
termediate Representation (MLIR) [7], a modern open-
source compiler infrastructure for multi-level interme-
diate representation, to design TPU-MLIR1compiler.
In this work, we will introduce our compiler by
presenting the overall design and architecture of
the compiler,
introducing two new dialects: TOP dialect to en-
code the deep learning graph semantics indepen-
dent of the deep learning framework and TPU di-
alect to provide a common lowering point for all
TOP dialect operations but device-dependent,
1https://github.com/sophgo/tpu-mlir
1
arXiv:2210.15016v2 [cs.PL] 9 Feb 2023
detailing each compile stage, such as converting
NN models to Top dialect as device independent
and then converting TOP to TPU for various chips
and types,
defining WeightOp for weight operation and store
weight data in the NumPy npz file, and
providing InferenceInterface for TOP and TPU to
ensure correct conversions.
We organize the remainder of the paper as follows.
In Sec. 2, we briefly discuss MLIR, ONNX, on which
our compiler is based, and the calibration processing,
which tailors computation for TPU. Sec. 3, we intro-
duce our compiler’s design principle and architecture
and discuss TOP and TPU dialects. We also discuss
using inference to ensure correctness in each conver-
sion stage. Finally, we conclude our paper and discuss
future work in Sec. 4.
2. Background
2.1. MLIR
The MLIR, with much reusable and extensible, is
a novel approach for constructing new domain-specific
compilers. An open ecosystem is the most significant
difference from LLVM. MLIR standardizes the Static
Single Assignment (SSA)-based IR data structures al-
lowing one to express a range of concepts as first-class
operations. Operations can represent many different
levels of abstraction and computations, from dataflow
graphs to target-specific instructions and even hard-
ware circuitry. They take and produce zero or more
values, called operands and results, respectively. A
value represents data at runtime and is associated with
a type known at compile-time, whereas types model
compile-time information about values. Complemen-
tary to this, attributes contain compile-time informa-
tion to operations. Operations, Attributes, and type
systems are open and extensible. The custom types,
operations, and attributes are logically grouped into
dialects. A dialect is one of the most fundamental as-
pects of MLIR that enables the infrastructure to imple-
ment a stack of reusable abstractions. Each abstraction
encodes and preserves transformation validity precon-
ditions directly in its IR, reducing the complexity and
cost of analysis passes. The MLIR IR has a recursive
structure where operations contain a list of regions, and
regions contain a list of blocks, which in turn, contain
a list of operations.
In particular, MLIR features operation, attribute
and type interfaces providing a generic way of inter-
acting with the IR. Interfaces allow transformations
and analyses to work with abstract properties rather
than fixed lists of supported concepts. Interfaces can
be implemented separately from operations and mixed
in using MLIR’s registration mechanism, thus fully sep-
arating IR concepts from transformations. Further-
more, transformations can be written as compositions
of orthogonal localized ”match and rewrite” primitives.
These are often decomposed further into rewriting rules
when applied within a dialect and lowering rules when
converting from a higher-level dialect to a lower-level
dialect. Throughout the compilation, separate dialects
can co-exist to form a hybrid program representation.
The ability to progressively lower dialects to the tar-
get hardware during the compilation process has made
MLIR an excellent compiler infrastructure for domain-
specific languages.
This article relies on several MLIR dialects and
types, briefly described below.
2.1.1 Ranked Tensor Type
Values with tensor type represent aggregate N-
dimensional homogeneous data indicated by element
type and a fixed rank with a list of dimensions2. Each
dimension could be a static non-negative integer con-
stant or be dynamically determined (marked by ?).
This abstracted runtime representation carries both
the tensor data values and information about the ten-
sor shape, but the compiler has not decided on its rep-
resentation in memory. Tensor values are immutable
and subject to def-use SSA semantics[9]. Operations on
tensors are often free of side effects, and operations al-
ways create new tensors with a value. The textual for-
mat of the tensor is tensorhd1xd2x· · · xdNxdtypei, where
d1,d2, ... dNare integers or symbol ?representing the
dimensions of a tensor, and dtype is the type of the el-
ements in a tensor, e.g., F32 for float32. A tensor can
be unranked when its shapes are unknown. MLIR uses
tensorh∗xdtypeito represent unranked tensor types.
2.1.2 Quantization Dialect
Quantization dialect3provides a family of quantized
types and type-conversion operations. The ”quanti-
zation” refers to the conversion of floating-point com-
putations to corresponding variants expressed in inte-
ger math for inference, as has been supported by low-
bit depth inference engines such as various accelerator
hardware and many DSPs. There are three types de-
fined in quantization dialect: UniformQuantizedType,
UniformQuantizedPerAxisType, and CalibratedQuan-
tizedType. The UniformQuantizedType and Unifor-
2https://mlir.llvm.org/docs/Dialects/Builtin/#rankedtensortype
3https://mlir.llvm.org/docs/Dialects/QuantDialect
2
mQuantizedPerAxisType represent the mapping be-
tween expressed values (e.g., a floating-point computer
type) and storage values (typically of an integral com-
puter type), expressing the affine transformations from
uniformly spaced points to the real number line. The
relationship is: realValue =scale ×(quantizedValue
zeroPoint) and will be discussed in more detail in Sec-
tion 2.3. Where CalibratedQuantizedType holds the
range from the given min and max value of the his-
togram data of the tensor, used for recording the statis-
tics information of the tensor. The UniformQuan-
tizedPerAxisType applies affine transformation indi-
vidually to each index along a specific axis of a ten-
sor type. However, UniformQuantizedType applies the
affine transformation to every value within the target
type. The type-conversion defined in quantization di-
alect provides three operations for converting between
types based on a QuantizedType and its expressed and
storage sub-types. Those operations are: quant.qcast
converting from an expressed type to QuantizedType,
quant.dcast converting from a QuantizedType to its
expressed type, and quant.scast converting between a
QuantizedType and its storage type.
2.2. ONNX
ONNX is an open-source framework-independent
format widely used for exchanging computation graph
models, including deep learning and traditional ma-
chine learning. It was accepted as a graduate project in
Linux Foundation AI and maintained by open-source
communities. ONNX defines an extensible computa-
tion graph model, operators, and standard data types
for deep learning and provides a set of specifications to
convert a model to a basic ONNX format and another
to get the model back from this ONNX form. It is
an ideal tool for framework interoperability, especially
when deploying a model to specific hardware[5].
ONNX reduces the friction of moving trained DL
models among AI frameworks and platforms. ONNX
uses the Protocol Buffers language for its syntax and
provides rich documents and tools to formalize each
operation’s semantics and verify its correctness.
2.3. Quantization
Quantization is a promising technique to reduce
deep learning models’ memory footprint, inference la-
tency, and power consumption, which replaces high-
cost floating-point (always F32) computation with
low-cost fixed-point numbers[4] (e.g., INT8/INT16)
or float-point (e.g., BF16/F16). Because most cur-
rent DL models are heavily over-parameterized and ro-
bust to extreme discretization, there is much oppor-
tunity for reducing numeral precision without impact-
ing the model’s accuracy, bringing ample search space
for tuning. Although many quantization methods
have emerged, there is not a single well-posed or well-
conditioned problem being solved[3]. Instead, one is
interested in some error metric (based on classification
quality, data similarity, etc.). to guide the quantization
process. However, due to the over-parameterization, it
is possible to have a high error between a quantized and
the original model while still attaining excellent gen-
eralization performance. Finally, different layers in a
Neural Net have a different impact on the loss function,
which motivates a mixed-precision approach quantiza-
tion.
2.3.1 Uniform Quantization
The quantization process is a function mapping from
real values rto some numeral values. Quantization
function such as
quant(r) = round(r
s) + zp (1)
where quant is the quantization operator, ris a real-
valued input (activation or weight), sis a float-point
scaling factor, and zp is an integer zero point, is known
as uniform quantization, as the resulting quantized val-
ues are evenly spaced.
2.3.2 Symmetric and Asymmetric Quantiza-
tion
A crucial factor in uniform Quantization is choosing the
scaling factor sin Equation 1. This scaling factor, also
known as resolution, divides a given range of real-values
r into several partitions s=βα
2b
1, where [α, β] denotes
the clipping range that we are clipping the real-values
with, and b is the quantization bit width[4][6]. There-
fore, one should determine the clipping range [α, β]
before generating the scaling factor. If the clipping
range of αequals β, we get Symmetric Quantization,
and on the contrary, we get asymmetric Quantization.
The asymmetric quantization method often results in
a tighter clipping range than symmetric Quantization,
which is especially important when the dynamic range
of the tensor is imbalanced, e.g., the result of ReLU
always has non-negative values.
2.3.3 Calibration
The process of choosing the clipping range is called
“calibration.” One popular method for pre-calculation
is to run a series of inferences on some sample data
and then get the distribution of each tensor in the
graph. Using the min/max of the signal for both sym-
metric and asymmetric Quantization is typical in most
3
摘要:

TPU-MLIR:ACompilerForTPUUsingMLIRPengchaoHuManLuLeiWangGuoyueJiangfpengchao.hu,man.lu,lei.wang,guoyue.jiangg@sophgo.comSophgoInc.AbstractMulti-levelintermediaterepresentations(MLIR)showgreatpromiseforreducingthecostofbuildingdomain-speci ccompilersbyprovidingareusableandextensiblecompilerinfrastruct...

展开>> 收起<<
TPU-MLIR A Compiler For TPU Using MLIR Pengchao Hu Man Lu Lei Wang Guoyue Jiang fpengchao.human.lulei.wangguoyue.jiang gsophgo.com.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:675.64KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注