TPU-MLIR A Compiler For TPU Using MLIR Pengchao Hu Man Lu Lei Wang Guoyue Jiang fpengchao.human.lulei.wangguoyue.jiang gsophgo.com

2025-05-06 0 0 675.64KB 11 页 10玖币

侵权投诉

TPU-MLIR: A Compiler For TPU Using MLIR

Pengchao Hu Man Lu Lei Wang Guoyue Jiang

{pengchao.hu,man.lu,lei.wang,guoyue.jiang}@sophgo.com

Sophgo Inc.

Abstract

Multi-level intermediate representations (MLIR)

show great promise for reducing the cost of building

domain-speciﬁc compilers by providing a reusable and

extensible compiler infrastructure. This work presents

TPU-MLIR, an end-to-end compiler based on MLIR

that deploys pre-trained neural network (NN) models to

a custom ASIC called a Tensor Processing Unit (TPU).

TPU-MLIR deﬁnes two new dialects to implement its

functionality: 1. a Tensor operation (TOP) dialect that

encodes the deep learning graph semantics and inde-

pendent of the deep learning framework and 2. a TPU

kernel dialect to provide a standard kernel computation

on TPU. A NN model is translated to the TOP dialect

and then lowered to the TPU dialect for diﬀerent TPUs

according to the chip’s conﬁguration. We demonstrate

how to use the MLIR pass pipeline to organize and per-

form optimization on TPU to generate machine code.

The paper also presents a veriﬁcation procedure to en-

sure the correctness of each transform stage.

1. Introduction

The development of deep learning (DL) has pro-

foundly impacted various scientiﬁc ﬁelds, including

speech recognition, computer vision, and natural lan-

guage processing. In order to facilitate the process of

training deep learning models, industry and academia

have developed many frameworks, such as Caﬀe, Ten-

sorﬂow, Pytorch, Mxnet, and PaddlePaddle, which

boost deep learning in many areas. However, each

framework has its proprietary graph representation,

which brings lots of work for deploying as we need to

support many DL model formats.

At the same time, matrix multiplication and high

dimensional tensor convolution are the heavy compu-

tation in DL, which evoke the passion of chip architects

to design customized DL accelerators to achieve high

performance at low energy. Although GPU is still the

leading hardware in training DL models and all the

DL frameworks have contributed much work to sup-

port this general-purpose hardware, GPU is not the

perfect piece in the inference domain of DL. GPU is

for gaming, graph rendering, scientiﬁc computation,

and much more, not tailored for DL only. Thus, many

DL accelerators, such as Google TPU, Apple Bonic,

Graphcore IPU, and SOPHGO TPU, are more energy

eﬃcient than GPU and beneﬁt many of these emerging

DL applications.

In addition, the DL community has resorted to

domain-speciﬁc compilers for rescue to address the

drawback of DL libraries and alleviate the burden of

manually optimizing the DL models on each DL hard-

ware. The DL compilers take the model described in

the DL frameworks as inputs and generate eﬃcient

code for various DL hardware as outputs. The trans-

formation between a model deﬁnition and speciﬁc code

implementation is highly optimized, considering the

model speciﬁcation and hardware architecture. Several

popular DL compilers, such as TVM, Tensor Compre-

hension, and XLA, have been proposed by industry and

academia. Speciﬁcally, they incorporate DL-oriented

optimizations such as layer and operator fusion, which

enables highly eﬃcient code generation.

Herein, We provide TPU-MLIR, an open-source DL

compiler for TPU. In particular, we chose Open Neural

Network Exchange (ONNX)[1] as a DL format to repre-

sent our compiler’s input model and use Multi-level In-

termediate Representation (MLIR) [7], a modern open-

source compiler infrastructure for multi-level interme-

diate representation, to design TPU-MLIR1compiler.

In this work, we will introduce our compiler by

•presenting the overall design and architecture of

the compiler,

•introducing two new dialects: TOP dialect to en-

code the deep learning graph semantics indepen-

dent of the deep learning framework and TPU di-

alect to provide a common lowering point for all

TOP dialect operations but device-dependent,

1https://github.com/sophgo/tpu-mlir

arXiv:2210.15016v2 [cs.PL] 9 Feb 2023

•detailing each compile stage, such as converting

NN models to Top dialect as device independent

and then converting TOP to TPU for various chips

and types,

•deﬁning WeightOp for weight operation and store

weight data in the NumPy npz ﬁle, and

•providing InferenceInterface for TOP and TPU to

ensure correct conversions.

We organize the remainder of the paper as follows.

In Sec. 2, we brieﬂy discuss MLIR, ONNX, on which

our compiler is based, and the calibration processing,

which tailors computation for TPU. Sec. 3, we intro-

duce our compiler’s design principle and architecture

and discuss TOP and TPU dialects. We also discuss

using inference to ensure correctness in each conver-

sion stage. Finally, we conclude our paper and discuss

future work in Sec. 4.

2. Background

2.1. MLIR

The MLIR, with much reusable and extensible, is

a novel approach for constructing new domain-speciﬁc

compilers. An open ecosystem is the most signiﬁcant

diﬀerence from LLVM. MLIR standardizes the Static

Single Assignment (SSA)-based IR data structures al-

lowing one to express a range of concepts as ﬁrst-class

operations. Operations can represent many diﬀerent

levels of abstraction and computations, from dataﬂow

graphs to target-speciﬁc instructions and even hard-

ware circuitry. They take and produce zero or more

values, called operands and results, respectively. A

value represents data at runtime and is associated with

a type known at compile-time, whereas types model

compile-time information about values. Complemen-

tary to this, attributes contain compile-time informa-

tion to operations. Operations, Attributes, and type

systems are open and extensible. The custom types,

operations, and attributes are logically grouped into

dialects. A dialect is one of the most fundamental as-

pects of MLIR that enables the infrastructure to imple-

ment a stack of reusable abstractions. Each abstraction

encodes and preserves transformation validity precon-

ditions directly in its IR, reducing the complexity and

cost of analysis passes. The MLIR IR has a recursive

structure where operations contain a list of regions, and

regions contain a list of blocks, which in turn, contain

a list of operations.

In particular, MLIR features operation, attribute

and type interfaces providing a generic way of inter-

acting with the IR. Interfaces allow transformations

and analyses to work with abstract properties rather

than ﬁxed lists of supported concepts. Interfaces can

be implemented separately from operations and mixed

in using MLIR’s registration mechanism, thus fully sep-

arating IR concepts from transformations. Further-

more, transformations can be written as compositions

of orthogonal localized ”match and rewrite” primitives.

These are often decomposed further into rewriting rules

when applied within a dialect and lowering rules when

converting from a higher-level dialect to a lower-level

dialect. Throughout the compilation, separate dialects

can co-exist to form a hybrid program representation.

The ability to progressively lower dialects to the tar-

get hardware during the compilation process has made

MLIR an excellent compiler infrastructure for domain-

speciﬁc languages.

This article relies on several MLIR dialects and

types, brieﬂy described below.

2.1.1 Ranked Tensor Type

Values with tensor type represent aggregate N-

dimensional homogeneous data indicated by element

type and a ﬁxed rank with a list of dimensions2. Each

dimension could be a static non-negative integer con-

stant or be dynamically determined (marked by ?).

This abstracted runtime representation carries both

the tensor data values and information about the ten-

sor shape, but the compiler has not decided on its rep-

resentation in memory. Tensor values are immutable

and subject to def-use SSA semantics[9]. Operations on

tensors are often free of side eﬀects, and operations al-

ways create new tensors with a value. The textual for-

mat of the tensor is tensorhd1xd2x· · · xdNxdtypei, where

d1,d2, ... dNare integers or symbol ?representing the

dimensions of a tensor, and dtype is the type of the el-

ements in a tensor, e.g., F32 for ﬂoat32. A tensor can

be unranked when its shapes are unknown. MLIR uses

tensorh∗xdtypeito represent unranked tensor types.

2.1.2 Quantization Dialect

Quantization dialect3provides a family of quantized

types and type-conversion operations. The ”quanti-

zation” refers to the conversion of ﬂoating-point com-

putations to corresponding variants expressed in inte-

ger math for inference, as has been supported by low-

bit depth inference engines such as various accelerator

hardware and many DSPs. There are three types de-

ﬁned in quantization dialect: UniformQuantizedType,

UniformQuantizedPerAxisType, and CalibratedQuan-

tizedType. The UniformQuantizedType and Unifor-

2https://mlir.llvm.org/docs/Dialects/Builtin/#rankedtensortype

3https://mlir.llvm.org/docs/Dialects/QuantDialect

mQuantizedPerAxisType represent the mapping be-

tween expressed values (e.g., a ﬂoating-point computer

type) and storage values (typically of an integral com-

puter type), expressing the aﬃne transformations from

uniformly spaced points to the real number line. The

relationship is: realValue =scale ×(quantizedValue −

zeroPoint) and will be discussed in more detail in Sec-

tion 2.3. Where CalibratedQuantizedType holds the

range from the given min and max value of the his-

togram data of the tensor, used for recording the statis-

tics information of the tensor. The UniformQuan-

tizedPerAxisType applies aﬃne transformation indi-

vidually to each index along a speciﬁc axis of a ten-

sor type. However, UniformQuantizedType applies the

aﬃne transformation to every value within the target

type. The type-conversion deﬁned in quantization di-

alect provides three operations for converting between

types based on a QuantizedType and its expressed and

storage sub-types. Those operations are: quant.qcast

converting from an expressed type to QuantizedType,

quant.dcast converting from a QuantizedType to its

expressed type, and quant.scast converting between a

QuantizedType and its storage type.

2.2. ONNX

ONNX is an open-source framework-independent

format widely used for exchanging computation graph

models, including deep learning and traditional ma-

chine learning. It was accepted as a graduate project in

Linux Foundation AI and maintained by open-source

communities. ONNX deﬁnes an extensible computa-

tion graph model, operators, and standard data types

for deep learning and provides a set of speciﬁcations to

convert a model to a basic ONNX format and another

to get the model back from this ONNX form. It is

an ideal tool for framework interoperability, especially

when deploying a model to speciﬁc hardware[5].

ONNX reduces the friction of moving trained DL

models among AI frameworks and platforms. ONNX

uses the Protocol Buﬀers language for its syntax and

provides rich documents and tools to formalize each

operation’s semantics and verify its correctness.

2.3. Quantization

Quantization is a promising technique to reduce

deep learning models’ memory footprint, inference la-

tency, and power consumption, which replaces high-

cost ﬂoating-point (always F32) computation with

low-cost ﬁxed-point numbers[4] (e.g., INT8/INT16)

or ﬂoat-point (e.g., BF16/F16). Because most cur-

rent DL models are heavily over-parameterized and ro-

bust to extreme discretization, there is much oppor-

tunity for reducing numeral precision without impact-

ing the model’s accuracy, bringing ample search space

for tuning. Although many quantization methods

have emerged, there is not a single well-posed or well-

conditioned problem being solved[3]. Instead, one is

interested in some error metric (based on classiﬁcation

quality, data similarity, etc.). to guide the quantization

process. However, due to the over-parameterization, it

is possible to have a high error between a quantized and

the original model while still attaining excellent gen-

eralization performance. Finally, diﬀerent layers in a

Neural Net have a diﬀerent impact on the loss function,

which motivates a mixed-precision approach quantiza-

tion.

2.3.1 Uniform Quantization

The quantization process is a function mapping from

real values rto some numeral values. Quantization

function such as

quant(r) = round(r

s) + zp (1)

where quant is the quantization operator, ris a real-

valued input (activation or weight), sis a ﬂoat-point

scaling factor, and zp is an integer zero point, is known

as uniform quantization, as the resulting quantized val-

ues are evenly spaced.

2.3.2 Symmetric and Asymmetric Quantiza-

tion

A crucial factor in uniform Quantization is choosing the

scaling factor sin Equation 1. This scaling factor, also

known as resolution, divides a given range of real-values

r into several partitions s=β−α

−1, where [α, β] denotes

the clipping range that we are clipping the real-values

with, and b is the quantization bit width[4][6]. There-

fore, one should determine the clipping range [α, β]

before generating the scaling factor. If the clipping

range of αequals −β, we get Symmetric Quantization,

and on the contrary, we get asymmetric Quantization.

The asymmetric quantization method often results in

a tighter clipping range than symmetric Quantization,

which is especially important when the dynamic range

of the tensor is imbalanced, e.g., the result of ReLU

always has non-negative values.

2.3.3 Calibration

The process of choosing the clipping range is called

“calibration.” One popular method for pre-calculation

is to run a series of inferences on some sample data

and then get the distribution of each tensor in the

graph. Using the min/max of the signal for both sym-

metric and asymmetric Quantization is typical in most

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TPU-MLIR:ACompilerForTPUUsingMLIRPengchaoHuManLuLeiWangGuoyueJiangfpengchao.hu,man.lu,lei.wang,guoyue.jiangg@sophgo.comSophgoInc.AbstractMulti-levelintermediaterepresentations(MLIR)showgreatpromiseforreducingthecostofbuildingdomain-speciccompilersbyprovidingareusableandextensiblecompilerinfrastruct...

展开>> 收起<<

TPU-MLIR A Compiler For TPU Using MLIR Pengchao Hu Man Lu Lei Wang Guoyue Jiang fpengchao.human.lulei.wangguoyue.jiang gsophgo.com.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TPU-MLIR A Compiler For TPU Using MLIR Pengchao Hu Man Lu Lei Wang Guoyue Jiang fpengchao.human.lulei.wangguoyue.jiang gsophgo.com

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: