
•detailing each compile stage, such as converting
NN models to Top dialect as device independent
and then converting TOP to TPU for various chips
and types,
•defining WeightOp for weight operation and store
weight data in the NumPy npz file, and
•providing InferenceInterface for TOP and TPU to
ensure correct conversions.
We organize the remainder of the paper as follows.
In Sec. 2, we briefly discuss MLIR, ONNX, on which
our compiler is based, and the calibration processing,
which tailors computation for TPU. Sec. 3, we intro-
duce our compiler’s design principle and architecture
and discuss TOP and TPU dialects. We also discuss
using inference to ensure correctness in each conver-
sion stage. Finally, we conclude our paper and discuss
future work in Sec. 4.
2. Background
2.1. MLIR
The MLIR, with much reusable and extensible, is
a novel approach for constructing new domain-specific
compilers. An open ecosystem is the most significant
difference from LLVM. MLIR standardizes the Static
Single Assignment (SSA)-based IR data structures al-
lowing one to express a range of concepts as first-class
operations. Operations can represent many different
levels of abstraction and computations, from dataflow
graphs to target-specific instructions and even hard-
ware circuitry. They take and produce zero or more
values, called operands and results, respectively. A
value represents data at runtime and is associated with
a type known at compile-time, whereas types model
compile-time information about values. Complemen-
tary to this, attributes contain compile-time informa-
tion to operations. Operations, Attributes, and type
systems are open and extensible. The custom types,
operations, and attributes are logically grouped into
dialects. A dialect is one of the most fundamental as-
pects of MLIR that enables the infrastructure to imple-
ment a stack of reusable abstractions. Each abstraction
encodes and preserves transformation validity precon-
ditions directly in its IR, reducing the complexity and
cost of analysis passes. The MLIR IR has a recursive
structure where operations contain a list of regions, and
regions contain a list of blocks, which in turn, contain
a list of operations.
In particular, MLIR features operation, attribute
and type interfaces providing a generic way of inter-
acting with the IR. Interfaces allow transformations
and analyses to work with abstract properties rather
than fixed lists of supported concepts. Interfaces can
be implemented separately from operations and mixed
in using MLIR’s registration mechanism, thus fully sep-
arating IR concepts from transformations. Further-
more, transformations can be written as compositions
of orthogonal localized ”match and rewrite” primitives.
These are often decomposed further into rewriting rules
when applied within a dialect and lowering rules when
converting from a higher-level dialect to a lower-level
dialect. Throughout the compilation, separate dialects
can co-exist to form a hybrid program representation.
The ability to progressively lower dialects to the tar-
get hardware during the compilation process has made
MLIR an excellent compiler infrastructure for domain-
specific languages.
This article relies on several MLIR dialects and
types, briefly described below.
2.1.1 Ranked Tensor Type
Values with tensor type represent aggregate N-
dimensional homogeneous data indicated by element
type and a fixed rank with a list of dimensions2. Each
dimension could be a static non-negative integer con-
stant or be dynamically determined (marked by ?).
This abstracted runtime representation carries both
the tensor data values and information about the ten-
sor shape, but the compiler has not decided on its rep-
resentation in memory. Tensor values are immutable
and subject to def-use SSA semantics[9]. Operations on
tensors are often free of side effects, and operations al-
ways create new tensors with a value. The textual for-
mat of the tensor is tensorhd1xd2x· · · xdNxdtypei, where
d1,d2, ... dNare integers or symbol ?representing the
dimensions of a tensor, and dtype is the type of the el-
ements in a tensor, e.g., F32 for float32. A tensor can
be unranked when its shapes are unknown. MLIR uses
tensorh∗xdtypeito represent unranked tensor types.
2.1.2 Quantization Dialect
Quantization dialect3provides a family of quantized
types and type-conversion operations. The ”quanti-
zation” refers to the conversion of floating-point com-
putations to corresponding variants expressed in inte-
ger math for inference, as has been supported by low-
bit depth inference engines such as various accelerator
hardware and many DSPs. There are three types de-
fined in quantization dialect: UniformQuantizedType,
UniformQuantizedPerAxisType, and CalibratedQuan-
tizedType. The UniformQuantizedType and Unifor-
2https://mlir.llvm.org/docs/Dialects/Builtin/#rankedtensortype
3https://mlir.llvm.org/docs/Dialects/QuantDialect
2