Towards Performance Portable Programming for Distributed Heterogeneous Systems Polykarpos Thomadakis

2025-05-06 0 0 316.71KB 9 页 10玖币

侵权投诉

Towards Performance Portable Programming for

Distributed Heterogeneous Systems

Polykarpos Thomadakis

Department of Computer Science

Old Dominion University

Norfolk, Virginia

pthom001@odu.edu

Nikos Chrisochoides

Department of Computer Science

Old Dominion University

Norfolk, Virginia

nikos@cs.odu.edu

Abstract—Hardware heterogeneity is here to stay for high-

performance computing. Large-scale systems are currently

equipped with multiple GPU accelerators per compute node

and are expected to incorporate more specialized hardware in

the future. This shift in the computing ecosystem offers many

opportunities for performance improvement; however, it also

increases the complexity of programming for such architectures.

This work introduces a runtime framework that enables effortless

programming for heterogeneous systems while efﬁciently utilizing

hardware resources. The framework is integrated within a

distributed and scalable runtime system to facilitate performance

portability across heterogeneous nodes. Along with the design,

this paper describes the implementation and optimizations per-

formed, achieving up to 300% improvement in a shared memory

benchmark and up to 10 times in distributed device commu-

nication. Preliminary results indicate that our software incurs

low overhead and achieves 40% improvement in a distributed

Jacobi proxy application while hiding the idiosyncrasies of the

hardware.

Index Terms—Asynchronous task-based runtime, heteroge-

neous systems, GPU computing, performance portability, runtime

framework

I. INTRODUCTION

The recent slowdown in Moore’s Law is leading to large-

scale disruptions in the computing ecosystem. Users and

vendors are transitioning from utilizing computing nodes of

relatively homogeneous CPU architectures to systems led by

multiple GPU devices per node. This trend is expected to con-

tinue in the foreseeable future, incorporating many more types

of heterogeneous devices, including FPGAs, System-on-Chips

(SoCs), and specialized hardware for artiﬁcial intelligence [1].

The new computing ecosystem sets the basis to signiﬁcantly

improve performance, energy efﬁciency, reliability and secu-

rity; thus, high-performance computing (HPC) systems are

adapted to optimize their performance on traditional workloads

and modern workloads.

Exploiting extreme heterogeneity requires the development

of new techniques and abstractions that handle the increased

complexity in productivity, portability and performance. The

new techniques should allow users to express their applica-

tions’ workﬂow in a uniform way, hiding the idiosyncrasies

of the underlying architecture while implicitly handling per-

formance portability by optimizing scheduling, load balancing

and data transfers between the heterogeneous devices of a

node. Utilizing and orchestrating data movement and work

on multiple such nodes, further increases the complexity of

developing heterogeneity-aware applications. Thus, the run-

time system should also facilitate seamless use of distributed

heterogeneous nodes, by providing abstractions for data and

workload that are independent of the underlying hardware.

In this paper, we present a novel runtime system that enables

seamless, efﬁcient and performance portable development of

distributed heterogeneous applications. First, we introduce a

heterogeneous tasking framework that aims to optimize the

parallel execution of heterogeneous tasks on a single node.

The tasking framework provides a programming model that

automatically leverages heterogeneous devices. In contrast to

other systems, our framework does not require the applica-

tion to pick a device where a task should run; instead, the

application only picks a device type and the framework is

responsible to schedule the task to the optimal device. Second,

we extend a homogeneous distributed system, namely the

Parallel Runtime Environment for Multicomputer Applications

(PREMA) [2], to handle heterogeneous devices by integrating

it with the heterogeneous tasking framework. Along with the

design and implementation of the ﬁnal product, we present

optimizations that contributed to achieving high performance.

The evaluation results with microbenchmarks and a proxy

application show that our system incurs low overhead with

scalable performance.

The major contributions of this paper are as follows.

•A stand-alone tasking framework offering performance

portability over multiple heterogeneous devices.

•Integration of the aforementioned tasking framework into

a distributed runtime to leverage complex distributed

heterogeneous computing systems.

•A series of performance optimizations that achieve sig-

niﬁcant improvements.

•Performance analysis on a distributed proxy application.

II. BACKGROUND

The Parallel Runtime Environment for Multicomputer Ap-

plications (PREMA) [2] is a runtime framework, designed

to handle the needs of applications targeting extreme-scale

homogeneous computing platforms. It manages the burden of

latency-hiding, shared and distributed memory scheduling/load

arXiv:2210.01238v1 [cs.DC] 3 Oct 2022

balancing, and provides a global address space that drastically

decreases the complexity of developing extreme-scale applica-

tions. PREMA consists of three software layers that provide

different features according to the principle of separation of

concerns.

The ﬁrst layer, the Data Movement and Control

Substrate (DMCS), offers an asynchronous message-driven

execution model where messages are associated with a

task/function (referred to as handler in this context) that is

invoked implicitly on the receiver upon their arrival, similar to

Active Messages [3]. Communication and handler invocations

are by default asynchronous, and one-sided, requiring no

explicit involvement of the receiver. The Mobile Object Layer

(MOL) extends DMCS with the introduction of the mobile

objects. A mobile object is a location-independent container

that is provided by the runtime system to capture coarse-

grained application data. Mobile objects can be targeted for

remote handler executions uniformly and independently of

their location. Adapting to a programming model where the

workﬂow is expressed as interactions (i.e., handler invocations)

between mobile objects leads to a more naturally asynchronous

design for the application and allows PREMA to better handle

latencies while exposing a uniform programming interface that

hides the structure of the underlying platform. Mobile objects

are also used to provide applications with implicit distributed

load balancing through the Implicit Load Balancing (ILB)

layer but this feature is not utilized in the context of this work.

In order to provide more ﬂexibility for the runtime system

to overlap latencies, an application is encouraged to perform

over-decomposition of its data. In over-decomposition, the data

domain of an application running on a platform of Pprocess-

ing elements is decomposed into Nchunks where NP.

Decomposing the data domain into many more pieces than

the available processing elements gives the runtime system

more options on scheduling computational tasks and ﬁlling the

idle time stemming from data movement and communication

operations.

III. RELATED WORK

Several systems have been adapted to efﬁciently utilize

GPUs in their workﬂow, while new ones have emerged in

an attempt to create new standards for their use. Systems like

Charm++ [4], HPX [5] and X10 [6] have introduced new in-

terfaces to provide support for GPUs. However, these systems

let the users explicitly handle issues like requesting memory

transfers, managing device platforms, task allocations, work

queues to optimize performance. In contrast, the proposed

work provides a uniform abstraction for heterogeneous tasks

and data, and implicitly handles scheduling, load balancing

and latency overlapping independently of the target device

backend. StarPU [7], OmpSs [8] and ParSec [9] offer dif-

ferent high-level approaches to efﬁciently utilize distributed

heterogeneous systems. However, their programming model

is mostly suitable for applications whose workﬂow follow a

regular pattern and can be inferred mostly statically. PREMA,

on the other hand, adopts a dynamic, message-driven program-

ming model that is more suitable for irregular applications.

SYCL and DPC++/oneAPI [10], as well as the newest

version of OpenMP are recent attempts to provide performance

portable interfaces in modern C++ that can target hetero-

geneous devices. However, users still need to handle load

balancing, scheduling and work queues for multi-device sys-

tems and need to combine them with another runtime solution

that targets distributed nodes. In fact, such systems could be

implemented with our heterogeneous tasking framework as a

interoperability backend.

IV. DESIGN AND IMPLEMENTATION

A. Programming Model

The programming model of the heterogeneous tasking

framework builds upon two simple abstractions: the het-

erogeneous objects (hetero objects) and heterogeneous tasks

(hetero tasks). A hetero object uniformly represents a user-

deﬁned data object residing on one or more computing devices

of a heterogeneous compute node (e.g., CPUs, GPUs, FPGAs).

Applications treat such objects as opaque containers for data

without being aware of their physical location. A hetero task

encapsulates a non-preemptive computing kernel that runs

to completion and implements a medium-grained parallel

computation. Like hetero objects, hetero tasks are deﬁned and

handled by the application in a uniform way, independent of

the device they will execute on.

1) Heterogeneous Objects: Handling copies of the same

data on different heterogeneous devices can lead to error-

prone, and difﬁcult to understand and maintain application

code. In general, applications need to handle data transfers

between them, use the correct pointer for the respective

device and also keep track of their coherence. A hetero object

is an abstraction that automatically handles such concerns,

maintaining the different copies of the same data in a single

reference. The underlying system handles hetero objects to

guarantee that the most recent version of the data will be

available at the target device at the time that they will be

needed. For example, accessing an object originally located

on the host from a GPU would automatically trigger the

transfer of the underlying data from the host to the respective

device. In the same manner, accessing the same object from a

different device would trigger a transfer from the GPU to that

device. The runtime system guarantees data coherence among

computing devices, keeping track of up-to-date or stale copies

and handling them appropriately.

The actual data captured by a hetero object should mainly

be accessed and modiﬁed through hetero tasks for optimal

performance. However, the application can also explicitly re-

quest access to the underlying data on the host after specifying

the type of access requested, in order to maintain coherence.

This method will trigger (if needed) an asynchronous transfer

from the device with the most recent version of the data and

immediately return a future. The future can then be used to

query the status of the transfer and provide access to the raw

data. In this state, the data of the hetero object are guaranteed

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsPerformancePortableProgrammingforDistributedHeterogeneousSystemsPolykarposThomadakisDepartmentofComputerScienceOldDominionUniversityNorfolk,Virginiapthom001@odu.eduNikosChrisochoidesDepartmentofComputerScienceOldDominionUniversityNorfolk,Virginianikos@cs.odu.eduAbstractHardwareheterogeneityi...

展开>> 收起<<

Towards Performance Portable Programming for Distributed Heterogeneous Systems Polykarpos Thomadakis.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Performance Portable Programming for Distributed Heterogeneous Systems Polykarpos Thomadakis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: