
folding are used to identify mergeable nodes in graph IRs af-
ter precomputing statically-determinable components. Graph
IRs specify high-level inputs and outputs of each operator,
but do not restrict how each operator is implemented.
Backend: Low-Level IRs and Optimizations.
Hardware-
specific low-level IRs are generated from graph IRs. Instead
of translating graph IRs directly into standard IRs like LLVM
IR [55], low-level IRs are employed as an intermediary step
for customized optimizations using prior knowledge of DL
models and hardware characteristics. Graph IR operators can
be converted into low-level linear algebra operators [85]. For
example, a fully connected (FC) operator can be represented
as matrix multiplication followed by addition. Such repre-
sentations alleviate the hurdles of directly supporting many
high-level operators on each hardware target. Instead, trans-
lation to a new hardware target only needs the support of
low-level linear algebra operators. Low-level IRs are usually
memory related. Hence, optimizations at this step can include
hardware intrinsic mapping, memory allocation, loop-related
optimizations, and parallelization [17,22,84,110].
Backend: Scheduling and Tuning.
Policies mapping an op-
erator to low-level code are called schedules. A compiler
backend often searches a vast combinatorial scheduling space
for optimal parameter settings like loop unrolling factors.
Halide [84] introduces a scheduling language with manual and
automated schedule optimization primitives. Recent works
explore launching auto scheduling and tuning to enhance opti-
mization [12,22,23,70,97,113,114]. These methods alleviate
manual efforts to decide schedules and optimal parameters.
Backend: Code Gen.
Low-level IRs are compiled to gener-
ate code for different hardware targets like CPUs and GPUs.
When generating machine code, a DNN operator (or sev-
eral fused operators) is typically compiled into an individual
assembly function. Low-level IRs can be converted into ma-
ture tool-chains IRs like LLVM or CUDA IR [73] to explore
hardware-specific optimizations. For instance, Glow [85] can
perform fine-grained loop-oriented optimizations in LLVM
IR. DL compilers like TVM and Glow compile optimized
IR code into standalone executables. Kernel libraries can be
used by DL compilers NNFusion [64] and XLA [95] to stati-
cally link with DNN executables. Decompiling executables
statically linked with kernel libraries are much easier: such
executables contain many wrappers toward kernel libraries.
These wrappers (e.g., a trampoline to the Conv implementa-
tion in kernel libraries) can be used to infer DNN models. This
work mainly focuses on decompiling “self-contained” exe-
cutables emitted by TVM and Glow, given their importance
and difficulty. For completeness, we demonstrate decompiling
NNFusion-emitted executables in Sec. 4.4.
Real-World Significance of DL Compilers.
DL compilers
offer systematic optimization to improve DNN model adop-
tion. Though many DNN models to date are deployed us-
ing DL frameworks like Tensorflow, DL compilers cannot
be disregarded as a growing trend. Edge devices and low-
power processors suppliers are incorporating DL compilers
into their applications to reap the benefits of DNN mod-
els [48,74,75,83]. Cloud service providers like Amazon and
Google include DL compilers into their DL services to boost
performance [14,101]. Amazon uses DL compilers to compile
DNN models on Intel x86 CPUs [49,61]. Facebook deploys
Glow-compiled DNN models on Intel CPUs [69]. Overall, DL
compilers are increasingly vital to boost DL on Intel CPUs,
embedded devices, and other heterogeneous hardware back-
ends. We design BTD, a decompiler for Intel x86 DNN exe-
cutables. We show how BTD can accelerate common DNN
attacks (Appendix D) and migrate DNN executables to GPUs
(Sec. 8). Sec. 8explains why BTD does not decompile ex-
ecutables on GPUs/accelerators. GPU/accelerator platforms
lack disassemblers/dynamic instrumentation infrastructures,
and the DL compiler support for GPU platforms is immature
(e.g., cannot generate standalone executables).
3 Decompiling DNN Executables
Definition.
BTD decompiles DL executables to recover DNN
high-level specifications. The full specifications include:
1
DNN operators (e.g., ReLU, Pooling, and Conv) and their
topological connectivity,
2
dimensions of each DNN oper-
ator, such as #channels in Conv, and
3
parameters of each
DNN operator, such as weights and biases, which are im-
portant configurations learned during model training. Sec. 4
details BTD’s processes to recover each component.
Query-Based Model Extraction.
Given a (remote) DNN
model with obscure specifications, adversaries can continu-
ously feed inputs
x
to the model and collect its prediction
outputs
y
. This way, adversaries can gradually assemble a
training dataset (x,y)to train a local model [79,96].
This approach may have the following challenges: 1) for a
DNN executable without prior knowledge of its functionality,
it is unclear how to prepare inputs
x
aligned with its normal
inputs; 2) even if the functionality is known, it may still be
challenging to prepare a non-trivial collection of
x
for models
trained on private data (e.g., medical images); 3) local retrain-
ing may require rich hardware and is costly; and 4) existing
query-based model extraction generally requires prior knowl-
edge of model architectures and dimensions [79]. In contrast,
BTD only requires a valid input. For instance, a meaningless
image is sufficient to decompile executables of CV models.
Also, according to the notation in
Definition
, local retraining
assumes
1
+
2
as prior knowledge, whereas BTD fully
recovers 1 + 2 + 3 from DNN executables.
Model Extraction via Side Channels.
Architectural-level
hints (e.g., side channels) leaked during model inference can
be used for model extraction [30,45,46,104,105,116]. These
works primarily recover high-level model architecture, which
are
1
or
1
+
2
according to our notation in
Definition
. In
contrast, BTD statically recovers
1
and then dynamically
recovers
2
+
3
from DNN executables (but coverage is
not an issue; see Sec. 4.2 for clarification). Sec. 9further
compares BTD with prior model extraction works.
3