GRANITE A Graph Neural Network Model for Basic Block Throughput Estimation Ondˇrej Sýkora Phitchaya Mangpo PhothilimthanayCharith MendisAmir Yazdanbakhshy

2025-04-29 0 0 634.65KB 13 页 10玖币
侵权投诉
GRANITE: A Graph Neural Network Model for
Basic Block Throughput Estimation
Ondˇ
rej Sýkora Phitchaya Mangpo PhothilimthanaCharith MendisAmir Yazdanbakhsh
Google Research Google Research, Brain Team University of Illinois at Urbana Champaign
ondrasej@google.com,mangpo@google.com,charithm@illinois.edu,ayazdan@google.com
Abstract
Analytical hardware performance models yield swift es-
timation of desired hardware performance metrics. However,
developing these analytical models for modern processors
with sophisticated microarchitectures is an extremely la-
borious task and requires a firm understanding of target
microarchitecture’s internal structure. In this paper, we
introduce GRANITE
1
, a new machine learning model that
estimates the throughput of basic blocks across different
microarchitectures. GRANITE uses a graph representation
of basic blocks that captures both structural and data
dependencies between instructions. This representation is
processed using a graph neural network that takes advantage
of the relational information captured in the graph and
learns a rich neural representation of the basic block that
allows more precise throughput estimation. Our results
establish a new state-of-the-art for basic block performance
estimation with an average test error of 6.9%across a wide
range of basic blocks and microarchitectures for the x86-64
target. Compared to recent work, this reduced the error by
1.7%wile improving training and inference throughput by
approximately 3.0
×
. In addition, we propose the use of multi-
task learning with independent multi-layer feed forward de-
coder networks. Our results show that this technique further
improves precision of all learned models while significantly
reducing per-microarchitecture training costs. We perform
an extensive set of ablation studies and comparisons with
prior work, concluding a set of methods to achieve high
accuracy for basic block performance estimation.
1. Introduction
A basic block is a sequence of instructions with neither
incoming nor outgoing branches. Basic blocks are natural
input objects to many code optimization algorithms because
the instructions of a basic block can be modified, as long as
the invariants at the beginning and at the end of the basic
block are preserved. See Table 1for an example basic block.
Accurate and fast performance estimation of basic blocks
is often crucial at the various stages of compilation and
software optimization [17] because real hardware measure-
ments are expensive to collect and tedious to obtain. For
example, various performance estimation methods are used
for inlining [8], register allocation [1], fusing [9], hardware-
software co-design [1013], and critical path analysis [14]. To
1GRANITE
: A
GRA
ph
N
eural network model for bas
I
c block
T
hroughput
Estimation
provide a fast performance estimation, hand-tuned analytical
models [1518], tailored for one or few sets of microar-
chitectures, have been developed. However, these analytical
models are often lack generality across different processors
and require domain expertise and thorough knowledge
of internal organization of microarchitectural components,
which are generally obscured by hardware companies. Even
with sufficient domain knowledge, developing a complete
and thorough analytical model for modern processors is
an error-prone and work-intensive task. In addition, due
to the increasing complexity of modern microarchitectures,
these analytical models may overlook some corner cases in
performance estimation and underperform in generalizing
the estimation to these cases. As such, using the analytical
models can mislead the optimization algorithm and yield
sub-optimal solutions.
Learned models for throughput estimation.
To address
the aforementioned challenges, a handful of work delegated
the task of performance estimation to machine learning [8,
11,12]. For basic block throughput estimation specifically,
Ithemal [19] uses a machine learning model based on a
sequential Long-Short Term Memory (LSTM) to learn a
representation of basic blocks followed by a linear transfor-
mation to predict the throughput values. While Ithemal [19]
delivered a notable accuracy improvement across multiple
x86-64 microarchitectures compared to analytical models
of the time, it represents a basic block as a sequence of
instructions without any additional information about its
structure. We argue that adding information such as data
dependency could contribute to the inductive bias of a model
and enables the model to reason about code with higher
accuracy.
Graph-based representation learning of basic blocks.
Data and control flow in basic blocks can be naturally
expressed using a graph [20]. This paper sets out to use
graph neural networks on this representation of code to learn
an expressive representation of basic blocks. The proposed
representation learning method, dubbed GRANITE, does not
commit to any feature engineering of the input basic blocks.
Compared to prior work, we believe that leveraging a graph
representation is a more natural and intuitive approach to
©
2022 IEEE. Personal use of this material is permitted. Permission
from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component
of this work in other works.
arXiv:2210.03894v2 [cs.LG] 11 Oct 2022
TABLE 1: An example basic block in x86-64 assembly from the BHive
dataset [22].
0: CMP R15D, 1
1: SBB EAX, EAX
2: AND EAX, 0x8
3: TEST ECX, ECX
4: MOV DWORD PTR[RBP - 3], EAX
5: MOV EAX, 1
6: CMOVG EAX, ECX
7: CMP EDX, EAX
represent basic blocks, better capturing the dependencies and
interactions between instructions.
GRANITE outperforms Ithemal [19] in terms of accuracy
and establishes a new state-of-the-arts results on x86-64 basic
block throughput estimation. We evaluate GRANITE for the
task of throughput estimation, achieving a new state of the art
accuracy, with a nearly 1.7% lower MAPE across multiple
x86-64 microarchitectures, compared to the most recent prior
work [19]. We argue that using a graph representation of
basic blocks is a key contributing factor in achieving higher
prediction accuracy, which is a direct consequence of better
generalization to unseen basic blocks.
Multi-task throughput estimation model.
While there are
differences in performance of different microarchitectures,
there are often also many similarities because of how their
design evolved, but also due to instruction set semantics that
are microarchitecture-independent. Multi-task learning [21] is
a technique that uses a collection of related tasks to train the
same model. By exploiting the relatedness of the tasks, the
model learns a better internal representation of the problem
domain and it often leads to improved performance on the
individual tasks. To our best knowledge, existing work [19]
did not take full advantage of these similarities and focused
on developing or training a separate model for each target
microarchitecture. We argue that the similarities between
microarchitectures can be exploited to achieve faster training
and learning richer representations of code. To this end, we
propose a multi-headed task-dedicated representation learning
where the graph network is shared by all microarchitectures
and each head is trained for a different microarchitecture.
We evaluated a multi-task model against models that
were trained only for a single microarchitecture. Our results
demonstrate that it is feasible to learn a shared representation
of basic blocks that support performance predictions for
all target microarchitectures. The computational costs of
training a model supporting multiple microarchitectures are
only marginally higher than the cost of training a single
single-task model. In addition, we found that employing
multi-task learning further reduces the prediction errors on
all microarchitectures compared to training exclusively on
data from a single microarchitecture.
2. Motivation and Background
2.1. Manual Tuning of Simulator Parameters
Recent work [18] proposes an analytical model to predict
the throughput of basic blocks for Intel microarchitectures
with sufficient accuracy (< 1%). To develop this analytical
model, the authors performed a detailed study of the un-
derlying Intel microarchitectures and manually tuned the
microarchitecture-specific parameters of their simulator to
match the ground-truth values. The suggested analytical
model establishes stronger baselines for learned throughput
estimation across a limited set of microarchitectures and
provides interpretable insights about the underlying bottle-
necks of the target microarchitecture. However, the hand-
tuned analytical models generally suffer from:
(1)
a lack of
generality across wide-range of unseen microarchitectures,
(2)
a tedious task of maintaining such an analytical model
after each generation of microarchitectures, and finally
(3)
the demand of expert knowledge about the details of the
underlying microarchitecture. On the other hand, learned
models (such as our work and Ithemal [19]), marginally
trade off prediction accuracy and interpretability of the results
for generality across wide range of microarchitectures and
eliminating the need for expert knowledge in the development
process. In summary, analytical and learned models have
different objectives and could be beneficial in downstream
tasks with different objectives.
2.2. Learned Model for Throughput Estimation
Ithemal [19], the most recent learned model for basic
block throughput estimation, formulates the throughput
estimation problem as a regression problem with the objective
to minimize the mean absolute percentage error between
ground-truth data (obtained from hardware measurements)
and the output of the learned model. It employs a two-level
LSTM [23] network that generates an embedding vector
for each input basic block. The objective of the first level
LSTM network is to generate an embedding vector for each
instruction of the input basic block. The second level uses
the instruction embedding vectors to compute an embedding
vector for the whole basic block.
In the input, Ithemal receives a sequence of instructions
for each basic block (e.g. “SBB EAX, EAX”, as illustrated in
Table 1). When presenting instructions to the model, Ithemal
tokenizes each assembly instruction into (1) instruction
mnemonic, (2) input operands, and (3) output operands. For
example, “SBB EAX, EBX” is tokenized as “SBB | <S> |
EAX | EBX | <D> | EAX | <E>” tokens, where “<S>”, “<D>”,
and “<E>” are special tokens that separate the three groups
of tokens. Each token is mapped to a learned embedding
vector (each embedding vector is a real-valued vector of a
fixed size), and these vectors are fed to the first-level LSTM
network. Finally, the generated instruction embeddings pass
through the second LSTM layer to obtain an embedding
vector per basic block. The generated basic block embedding
is then passed to a decoder network to obtain an estimation of
the basic block throughput. In the Ithemal model, the decoder
is a dot product of the basic block embedding vector with a
vector of learned weights.
While Ithemal demonstrates a promising path forward
for performance estimation of basic blocks, its input data
format presents the instructions to the model linearly, as
they are laid out in memory, and it relies on the model and
the training process to discover dependencies between the
instructions on it own. Since these dependencies are well
defined and easy to extract using existing tools, we suggest
including them in the basic block representation explicitly, to
guide the computation of the model. This work wields graphs
as a natural and intuitive way to represent basic blocks and
the underlying dependencies, expecting that a graph neural
network model will be able to benefit from the additional
information and produce more precise throughput estimates.
2.3. Graph Neural Network
The family of graph neural networks (GNNs) [2428]
has shown to be effective in a diverse range of applications
and domains [2932]. Generally, GNNs yield promising
results in applications with highly structured inputs where
the relationships between elements of the input can be easily
expressed using a graph. The main objective of a GNN
is to learn to map the information structured as a graph
into an embedding space (a vector representation). In a
nutshell, the learning process of a GNN model consists of
propagating information between graph nodes and edges
via multiple message passing iterations, followed by an
aggregation step. At each message passing iteration, the
node and edge embeddings are updated according to received
messages from their neighbors in the graph. The final learned
embeddings are then employed in downstream tasks such as
regression, classification, and ranking.
3. GRANITE Model Architecture
The GRANITE model is composed of the following
building blocks:
Graph encoding of basic blocks
The first step in
GRANITE is to transform basic blocks into a graph
representation and convert the instructions and basic
block dependencies into node and edge labels according
to their types. The constructed graph representation for
basic blocks is used as input to the graph neural network.
Graph neural network
Next, GRANITE uses a
GNN model with the objective to learn an expressive
embedding for each basic block. As part of the training
process, the GNN model iteratively exchanges relevant
information between basic block elements with the
objective of computing the embedding vectors.
Decoder network
Each instruction embedding vec-
tor passes through an additional decoder network with
non-linearity that computes the contribution of the
instruction to the basic block throughput. GRANITE
predicts the final throughput values for each basic block
by adding all individual instructions’ contributions to
the overall throughput.
Multi-task decoder network
The multi-task version
of GRANITE uses a multi-task decoder that predicts the
throughput values across multiple microarchitectures
simultaneously. Other parts of the model are shared
across all target microarchitectures. Intuitively, the task
of the shared parts is to learn an internal representation
of basic block structure, while the decoder networks
are responsible for throughput estimation.
3.1. Graph Encoding of Basic Blocks
We model each basic block as a dependency graph
inspired by [20], but using a more compact format. The
GRANITE graph is designed to capture the semantic relation-
ships between instructions as well as the type and category of
instructions and registers. The nodes of the graph consist of
a set of instruction and value nodes (e.g. values in registers,
immediate values, etc.), whereas the edges indicate data and
structural dependencies between the instructions and values
represented by the nodes. Figure 1shows an example basic
block in the GRANITE encoding.
Each node of the graph corresponds to an element of
the assembly language, similar to one token in the Ithemal
model [19]. Broadly, we can categorize node types into two
groups: instruction nodes that represent instructions, and
value nodes that model the input and output values passed
between instructions. Table 2summarizes the node types
in GRANITE graph representation and assembly language
tokens that can be associated with them. We represent each
assembly instruction by a unique instruction mnemonic node.
Infrequently, an assembly instruction may have prefixes
that modify their behavior, such as “LOCK” or “REP”.
We represent each prefix by a separate graph node that
is connected to the instruction mnemonic node by an edge.
Each instruction node is connected to zero or more value
nodes representing the instruction operands. The operands
are values stored in registers or memory, immediate values,
and results of address computation. Each value node has zero
or one incoming edge from the instruction mnemonic node of
the instruction that produces it (no incoming edge means that
the value is not produced by an instruction of the block), and
zero or more outgoing edges to instructions that consume the
value. These edges represent the data dependencies between
instructions. The token associated with a value node is the
name of a register if the value is stored in a register, or
a special token if the value is stored in memory, it is an
immediate value, or it is the result of an address computation.
Note that the nodes represent a value in a storage location,
not the storage location itself and the graph may contain
multiple value nodes with a given register name, if multiple
instructions in the block write to this register. For example, in
Figure 1, register “RAX” is a destination operand for “MOV”
instruction and is used as a source operand to calculate the
memory address for “ADD” instruction. In the same example,
you can see two different “Memory” nodes; one is used as
an input operand, the other as an output operand. Since the
value written by the “ADD” instruction maybe different from
the value it reads, they are represented as two distinct nodes.
摘要:

GRANITE:AGraphNeuralNetworkModelforBasicBlockThroughputEstimationOndrejSýkoraPhitchayaMangpoPhothilimthanayCharithMendisAmirYazdanbakhshyGoogleResearchyGoogleResearch,BrainTeamUniversityofIllinoisatUrbanaChampaignondrasej@google.com,mangpo@google.com,charithm@illinois.edu,ayazdan@google.comAbstra...

收起<<
GRANITE A Graph Neural Network Model for Basic Block Throughput Estimation Ondˇrej Sýkora Phitchaya Mangpo PhothilimthanayCharith MendisAmir Yazdanbakhshy.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:13 页 大小:634.65KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注