GRANITE A Graph Neural Network Model for Basic Block Throughput Estimation Ondˇrej Sýkora Phitchaya Mangpo PhothilimthanayCharith MendisAmir Yazdanbakhshy

2025-04-29 1 0 634.65KB 13 页 10玖币

侵权投诉

GRANITE: A Graph Neural Network Model for

Basic Block Throughput Estimation

Ondˇ

rej Sýkora Phitchaya Mangpo Phothilimthana†Charith Mendis∗Amir Yazdanbakhsh†

Google Research †Google Research, Brain Team ∗University of Illinois at Urbana Champaign

ondrasej@google.com,mangpo@google.com,charithm@illinois.edu,ayazdan@google.com

Abstract

Analytical hardware performance models yield swift es-

timation of desired hardware performance metrics. However,

developing these analytical models for modern processors

with sophisticated microarchitectures is an extremely la-

borious task and requires a ﬁrm understanding of target

microarchitecture’s internal structure. In this paper, we

introduce GRANITE

, a new machine learning model that

estimates the throughput of basic blocks across different

microarchitectures. GRANITE uses a graph representation

of basic blocks that captures both structural and data

dependencies between instructions. This representation is

processed using a graph neural network that takes advantage

of the relational information captured in the graph and

learns a rich neural representation of the basic block that

allows more precise throughput estimation. Our results

establish a new state-of-the-art for basic block performance

estimation with an average test error of 6.9%across a wide

range of basic blocks and microarchitectures for the x86-64

target. Compared to recent work, this reduced the error by

1.7%wile improving training and inference throughput by

approximately 3.0

. In addition, we propose the use of multi-

task learning with independent multi-layer feed forward de-

coder networks. Our results show that this technique further

improves precision of all learned models while signiﬁcantly

reducing per-microarchitecture training costs. We perform

an extensive set of ablation studies and comparisons with

prior work, concluding a set of methods to achieve high

accuracy for basic block performance estimation.

1. Introduction

A basic block is a sequence of instructions with neither

incoming nor outgoing branches. Basic blocks are natural

input objects to many code optimization algorithms because

the instructions of a basic block can be modiﬁed, as long as

the invariants at the beginning and at the end of the basic

block are preserved. See Table 1for an example basic block.

Accurate and fast performance estimation of basic blocks

is often crucial at the various stages of compilation and

software optimization [1–7] because real hardware measure-

ments are expensive to collect and tedious to obtain. For

example, various performance estimation methods are used

for inlining [8], register allocation [1], fusing [9], hardware-

software co-design [10–13], and critical path analysis [14]. To

1GRANITE

: A

GRA

eural network model for bas

c block

hroughput

Estimation

provide a fast performance estimation, hand-tuned analytical

models [15–18], tailored for one or few sets of microar-

chitectures, have been developed. However, these analytical

models are often lack generality across different processors

and require domain expertise and thorough knowledge

of internal organization of microarchitectural components,

which are generally obscured by hardware companies. Even

with sufﬁcient domain knowledge, developing a complete

and thorough analytical model for modern processors is

an error-prone and work-intensive task. In addition, due

to the increasing complexity of modern microarchitectures,

these analytical models may overlook some corner cases in

performance estimation and underperform in generalizing

the estimation to these cases. As such, using the analytical

models can mislead the optimization algorithm and yield

sub-optimal solutions.

Learned models for throughput estimation.

To address

the aforementioned challenges, a handful of work delegated

the task of performance estimation to machine learning [8,

11,12]. For basic block throughput estimation speciﬁcally,

Ithemal [19] uses a machine learning model based on a

sequential Long-Short Term Memory (LSTM) to learn a

representation of basic blocks followed by a linear transfor-

mation to predict the throughput values. While Ithemal [19]

delivered a notable accuracy improvement across multiple

x86-64 microarchitectures compared to analytical models

of the time, it represents a basic block as a sequence of

instructions without any additional information about its

structure. We argue that adding information such as data

dependency could contribute to the inductive bias of a model

and enables the model to reason about code with higher

accuracy.

Graph-based representation learning of basic blocks.

Data and control ﬂow in basic blocks can be naturally

expressed using a graph [20]. This paper sets out to use

graph neural networks on this representation of code to learn

an expressive representation of basic blocks. The proposed

representation learning method, dubbed GRANITE, does not

commit to any feature engineering of the input basic blocks.

Compared to prior work, we believe that leveraging a graph

representation is a more natural and intuitive approach to

2022 IEEE. Personal use of this material is permitted. Permission

from IEEE must be obtained for all other uses, in any current or future

media, including reprinting/republishing this material for advertising

or promotional purposes, creating new collective works, for resale or

redistribution to servers or lists, or reuse of any copyrighted component

of this work in other works.

arXiv:2210.03894v2 [cs.LG] 11 Oct 2022

TABLE 1: An example basic block in x86-64 assembly from the BHive

dataset [22].

0: CMP R15D, 1

1: SBB EAX, EAX

2: AND EAX, 0x8

3: TEST ECX, ECX

4: MOV DWORD PTR[RBP - 3], EAX

5: MOV EAX, 1

6: CMOVG EAX, ECX

7: CMP EDX, EAX

represent basic blocks, better capturing the dependencies and

interactions between instructions.

GRANITE outperforms Ithemal [19] in terms of accuracy

and establishes a new state-of-the-arts results on x86-64 basic

block throughput estimation. We evaluate GRANITE for the

task of throughput estimation, achieving a new state of the art

accuracy, with a nearly 1.7% lower MAPE across multiple

x86-64 microarchitectures, compared to the most recent prior

work [19]. We argue that using a graph representation of

basic blocks is a key contributing factor in achieving higher

prediction accuracy, which is a direct consequence of better

generalization to unseen basic blocks.

Multi-task throughput estimation model.

While there are

differences in performance of different microarchitectures,

there are often also many similarities because of how their

design evolved, but also due to instruction set semantics that

are microarchitecture-independent. Multi-task learning [21] is

a technique that uses a collection of related tasks to train the

same model. By exploiting the relatedness of the tasks, the

model learns a better internal representation of the problem

domain and it often leads to improved performance on the

individual tasks. To our best knowledge, existing work [19]

did not take full advantage of these similarities and focused

on developing or training a separate model for each target

microarchitecture. We argue that the similarities between

microarchitectures can be exploited to achieve faster training

and learning richer representations of code. To this end, we

propose a multi-headed task-dedicated representation learning

where the graph network is shared by all microarchitectures

and each head is trained for a different microarchitecture.

We evaluated a multi-task model against models that

were trained only for a single microarchitecture. Our results

demonstrate that it is feasible to learn a shared representation

of basic blocks that support performance predictions for

all target microarchitectures. The computational costs of

training a model supporting multiple microarchitectures are

only marginally higher than the cost of training a single

single-task model. In addition, we found that employing

multi-task learning further reduces the prediction errors on

all microarchitectures compared to training exclusively on

data from a single microarchitecture.

2. Motivation and Background

2.1. Manual Tuning of Simulator Parameters

Recent work [18] proposes an analytical model to predict

the throughput of basic blocks for Intel microarchitectures

with sufﬁcient accuracy (< 1%). To develop this analytical

model, the authors performed a detailed study of the un-

derlying Intel microarchitectures and manually tuned the

microarchitecture-speciﬁc parameters of their simulator to

match the ground-truth values. The suggested analytical

model establishes stronger baselines for learned throughput

estimation across a limited set of microarchitectures and

provides interpretable insights about the underlying bottle-

necks of the target microarchitecture. However, the hand-

tuned analytical models generally suffer from:

(1)

a lack of

generality across wide-range of unseen microarchitectures,

(2)

a tedious task of maintaining such an analytical model

after each generation of microarchitectures, and ﬁnally

(3)

the demand of expert knowledge about the details of the

underlying microarchitecture. On the other hand, learned

models (such as our work and Ithemal [19]), marginally

trade off prediction accuracy and interpretability of the results

for generality across wide range of microarchitectures and

eliminating the need for expert knowledge in the development

process. In summary, analytical and learned models have

different objectives and could be beneﬁcial in downstream

tasks with different objectives.

2.2. Learned Model for Throughput Estimation

Ithemal [19], the most recent learned model for basic

block throughput estimation, formulates the throughput

estimation problem as a regression problem with the objective

to minimize the mean absolute percentage error between

ground-truth data (obtained from hardware measurements)

and the output of the learned model. It employs a two-level

LSTM [23] network that generates an embedding vector

for each input basic block. The objective of the ﬁrst level

LSTM network is to generate an embedding vector for each

instruction of the input basic block. The second level uses

the instruction embedding vectors to compute an embedding

vector for the whole basic block.

In the input, Ithemal receives a sequence of instructions

for each basic block (e.g. “SBB EAX, EAX”, as illustrated in

Table 1). When presenting instructions to the model, Ithemal

tokenizes each assembly instruction into (1) instruction

mnemonic, (2) input operands, and (3) output operands. For

example, “SBB EAX, EBX” is tokenized as “SBB | <S> |

EAX | EBX | <D> | EAX | <E>” tokens, where “<S>”, “<D>”,

and “<E>” are special tokens that separate the three groups

of tokens. Each token is mapped to a learned embedding

vector (each embedding vector is a real-valued vector of a

ﬁxed size), and these vectors are fed to the ﬁrst-level LSTM

network. Finally, the generated instruction embeddings pass

through the second LSTM layer to obtain an embedding

vector per basic block. The generated basic block embedding

is then passed to a decoder network to obtain an estimation of

the basic block throughput. In the Ithemal model, the decoder

is a dot product of the basic block embedding vector with a

vector of learned weights.

While Ithemal demonstrates a promising path forward

for performance estimation of basic blocks, its input data

format presents the instructions to the model linearly, as

they are laid out in memory, and it relies on the model and

the training process to discover dependencies between the

instructions on it own. Since these dependencies are well

deﬁned and easy to extract using existing tools, we suggest

including them in the basic block representation explicitly, to

guide the computation of the model. This work wields graphs

as a natural and intuitive way to represent basic blocks and

the underlying dependencies, expecting that a graph neural

network model will be able to beneﬁt from the additional

information and produce more precise throughput estimates.

2.3. Graph Neural Network

The family of graph neural networks (GNNs) [24–28]

has shown to be effective in a diverse range of applications

and domains [29–32]. Generally, GNNs yield promising

results in applications with highly structured inputs where

the relationships between elements of the input can be easily

expressed using a graph. The main objective of a GNN

is to learn to map the information structured as a graph

into an embedding space (a vector representation). In a

nutshell, the learning process of a GNN model consists of

propagating information between graph nodes and edges

via multiple message passing iterations, followed by an

aggregation step. At each message passing iteration, the

node and edge embeddings are updated according to received

messages from their neighbors in the graph. The ﬁnal learned

embeddings are then employed in downstream tasks such as

regression, classiﬁcation, and ranking.

3. GRANITE Model Architecture

The GRANITE model is composed of the following

building blocks:

•Graph encoding of basic blocks →

The ﬁrst step in

GRANITE is to transform basic blocks into a graph

representation and convert the instructions and basic

block dependencies into node and edge labels according

to their types. The constructed graph representation for

basic blocks is used as input to the graph neural network.

•Graph neural network →

Next, GRANITE uses a

GNN model with the objective to learn an expressive

embedding for each basic block. As part of the training

process, the GNN model iteratively exchanges relevant

information between basic block elements with the

objective of computing the embedding vectors.

•Decoder network →

Each instruction embedding vec-

tor passes through an additional decoder network with

non-linearity that computes the contribution of the

instruction to the basic block throughput. GRANITE

predicts the ﬁnal throughput values for each basic block

by adding all individual instructions’ contributions to

the overall throughput.

•Multi-task decoder network →

The multi-task version

of GRANITE uses a multi-task decoder that predicts the

throughput values across multiple microarchitectures

simultaneously. Other parts of the model are shared

across all target microarchitectures. Intuitively, the task

of the shared parts is to learn an internal representation

of basic block structure, while the decoder networks

are responsible for throughput estimation.

3.1. Graph Encoding of Basic Blocks

We model each basic block as a dependency graph

inspired by [20], but using a more compact format. The

GRANITE graph is designed to capture the semantic relation-

ships between instructions as well as the type and category of

instructions and registers. The nodes of the graph consist of

a set of instruction and value nodes (e.g. values in registers,

immediate values, etc.), whereas the edges indicate data and

structural dependencies between the instructions and values

represented by the nodes. Figure 1shows an example basic

block in the GRANITE encoding.

Each node of the graph corresponds to an element of

the assembly language, similar to one token in the Ithemal

model [19]. Broadly, we can categorize node types into two

groups: instruction nodes that represent instructions, and

value nodes that model the input and output values passed

between instructions. Table 2summarizes the node types

in GRANITE graph representation and assembly language

tokens that can be associated with them. We represent each

assembly instruction by a unique instruction mnemonic node.

Infrequently, an assembly instruction may have preﬁxes

that modify their behavior, such as “LOCK” or “REP”.

We represent each preﬁx by a separate graph node that

is connected to the instruction mnemonic node by an edge.

Each instruction node is connected to zero or more value

nodes representing the instruction operands. The operands

are values stored in registers or memory, immediate values,

and results of address computation. Each value node has zero

or one incoming edge from the instruction mnemonic node of

the instruction that produces it (no incoming edge means that

the value is not produced by an instruction of the block), and

zero or more outgoing edges to instructions that consume the

value. These edges represent the data dependencies between

instructions. The token associated with a value node is the

name of a register if the value is stored in a register, or

a special token if the value is stored in memory, it is an

immediate value, or it is the result of an address computation.

Note that the nodes represent a value in a storage location,

not the storage location itself and the graph may contain

multiple value nodes with a given register name, if multiple

instructions in the block write to this register. For example, in

Figure 1, register “RAX” is a destination operand for “MOV”

instruction and is used as a source operand to calculate the

memory address for “ADD” instruction. In the same example,

you can see two different “Memory” nodes; one is used as

an input operand, the other as an output operand. Since the

value written by the “ADD” instruction maybe different from

the value it reads, they are represented as two distinct nodes.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GRANITE:AGraphNeuralNetworkModelforBasicBlockThroughputEstimationOndrejSýkoraPhitchayaMangpoPhothilimthanayCharithMendisAmirYazdanbakhshyGoogleResearchyGoogleResearch,BrainTeamUniversityofIllinoisatUrbanaChampaignondrasej@google.com,mangpo@google.com,charithm@illinois.edu,ayazdan@google.comAbstra...

展开>> 收起<<

GRANITE A Graph Neural Network Model for Basic Block Throughput Estimation Ondˇrej Sýkora Phitchaya Mangpo PhothilimthanayCharith MendisAmir Yazdanbakhshy.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GRANITE A Graph Neural Network Model for Basic Block Throughput Estimation Ondˇrej Sýkora Phitchaya Mangpo PhothilimthanayCharith MendisAmir Yazdanbakhshy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: