Towards Performance Portable Programming for Distributed Heterogeneous Systems Polykarpos Thomadakis

2025-05-06 0 0 316.71KB 9 页 10玖币
侵权投诉
Towards Performance Portable Programming for
Distributed Heterogeneous Systems
Polykarpos Thomadakis
Department of Computer Science
Old Dominion University
Norfolk, Virginia
pthom001@odu.edu
Nikos Chrisochoides
Department of Computer Science
Old Dominion University
Norfolk, Virginia
nikos@cs.odu.edu
Abstract—Hardware heterogeneity is here to stay for high-
performance computing. Large-scale systems are currently
equipped with multiple GPU accelerators per compute node
and are expected to incorporate more specialized hardware in
the future. This shift in the computing ecosystem offers many
opportunities for performance improvement; however, it also
increases the complexity of programming for such architectures.
This work introduces a runtime framework that enables effortless
programming for heterogeneous systems while efficiently utilizing
hardware resources. The framework is integrated within a
distributed and scalable runtime system to facilitate performance
portability across heterogeneous nodes. Along with the design,
this paper describes the implementation and optimizations per-
formed, achieving up to 300% improvement in a shared memory
benchmark and up to 10 times in distributed device commu-
nication. Preliminary results indicate that our software incurs
low overhead and achieves 40% improvement in a distributed
Jacobi proxy application while hiding the idiosyncrasies of the
hardware.
Index Terms—Asynchronous task-based runtime, heteroge-
neous systems, GPU computing, performance portability, runtime
framework
I. INTRODUCTION
The recent slowdown in Moore’s Law is leading to large-
scale disruptions in the computing ecosystem. Users and
vendors are transitioning from utilizing computing nodes of
relatively homogeneous CPU architectures to systems led by
multiple GPU devices per node. This trend is expected to con-
tinue in the foreseeable future, incorporating many more types
of heterogeneous devices, including FPGAs, System-on-Chips
(SoCs), and specialized hardware for artificial intelligence [1].
The new computing ecosystem sets the basis to significantly
improve performance, energy efficiency, reliability and secu-
rity; thus, high-performance computing (HPC) systems are
adapted to optimize their performance on traditional workloads
and modern workloads.
Exploiting extreme heterogeneity requires the development
of new techniques and abstractions that handle the increased
complexity in productivity, portability and performance. The
new techniques should allow users to express their applica-
tions’ workflow in a uniform way, hiding the idiosyncrasies
of the underlying architecture while implicitly handling per-
formance portability by optimizing scheduling, load balancing
and data transfers between the heterogeneous devices of a
node. Utilizing and orchestrating data movement and work
on multiple such nodes, further increases the complexity of
developing heterogeneity-aware applications. Thus, the run-
time system should also facilitate seamless use of distributed
heterogeneous nodes, by providing abstractions for data and
workload that are independent of the underlying hardware.
In this paper, we present a novel runtime system that enables
seamless, efficient and performance portable development of
distributed heterogeneous applications. First, we introduce a
heterogeneous tasking framework that aims to optimize the
parallel execution of heterogeneous tasks on a single node.
The tasking framework provides a programming model that
automatically leverages heterogeneous devices. In contrast to
other systems, our framework does not require the applica-
tion to pick a device where a task should run; instead, the
application only picks a device type and the framework is
responsible to schedule the task to the optimal device. Second,
we extend a homogeneous distributed system, namely the
Parallel Runtime Environment for Multicomputer Applications
(PREMA) [2], to handle heterogeneous devices by integrating
it with the heterogeneous tasking framework. Along with the
design and implementation of the final product, we present
optimizations that contributed to achieving high performance.
The evaluation results with microbenchmarks and a proxy
application show that our system incurs low overhead with
scalable performance.
The major contributions of this paper are as follows.
A stand-alone tasking framework offering performance
portability over multiple heterogeneous devices.
Integration of the aforementioned tasking framework into
a distributed runtime to leverage complex distributed
heterogeneous computing systems.
A series of performance optimizations that achieve sig-
nificant improvements.
Performance analysis on a distributed proxy application.
II. BACKGROUND
The Parallel Runtime Environment for Multicomputer Ap-
plications (PREMA) [2] is a runtime framework, designed
to handle the needs of applications targeting extreme-scale
homogeneous computing platforms. It manages the burden of
latency-hiding, shared and distributed memory scheduling/load
arXiv:2210.01238v1 [cs.DC] 3 Oct 2022
balancing, and provides a global address space that drastically
decreases the complexity of developing extreme-scale applica-
tions. PREMA consists of three software layers that provide
different features according to the principle of separation of
concerns.
The first layer, the Data Movement and Control
Substrate (DMCS), offers an asynchronous message-driven
execution model where messages are associated with a
task/function (referred to as handler in this context) that is
invoked implicitly on the receiver upon their arrival, similar to
Active Messages [3]. Communication and handler invocations
are by default asynchronous, and one-sided, requiring no
explicit involvement of the receiver. The Mobile Object Layer
(MOL) extends DMCS with the introduction of the mobile
objects. A mobile object is a location-independent container
that is provided by the runtime system to capture coarse-
grained application data. Mobile objects can be targeted for
remote handler executions uniformly and independently of
their location. Adapting to a programming model where the
workflow is expressed as interactions (i.e., handler invocations)
between mobile objects leads to a more naturally asynchronous
design for the application and allows PREMA to better handle
latencies while exposing a uniform programming interface that
hides the structure of the underlying platform. Mobile objects
are also used to provide applications with implicit distributed
load balancing through the Implicit Load Balancing (ILB)
layer but this feature is not utilized in the context of this work.
In order to provide more flexibility for the runtime system
to overlap latencies, an application is encouraged to perform
over-decomposition of its data. In over-decomposition, the data
domain of an application running on a platform of Pprocess-
ing elements is decomposed into Nchunks where NP.
Decomposing the data domain into many more pieces than
the available processing elements gives the runtime system
more options on scheduling computational tasks and filling the
idle time stemming from data movement and communication
operations.
III. RELATED WORK
Several systems have been adapted to efficiently utilize
GPUs in their workflow, while new ones have emerged in
an attempt to create new standards for their use. Systems like
Charm++ [4], HPX [5] and X10 [6] have introduced new in-
terfaces to provide support for GPUs. However, these systems
let the users explicitly handle issues like requesting memory
transfers, managing device platforms, task allocations, work
queues to optimize performance. In contrast, the proposed
work provides a uniform abstraction for heterogeneous tasks
and data, and implicitly handles scheduling, load balancing
and latency overlapping independently of the target device
backend. StarPU [7], OmpSs [8] and ParSec [9] offer dif-
ferent high-level approaches to efficiently utilize distributed
heterogeneous systems. However, their programming model
is mostly suitable for applications whose workflow follow a
regular pattern and can be inferred mostly statically. PREMA,
on the other hand, adopts a dynamic, message-driven program-
ming model that is more suitable for irregular applications.
SYCL and DPC++/oneAPI [10], as well as the newest
version of OpenMP are recent attempts to provide performance
portable interfaces in modern C++ that can target hetero-
geneous devices. However, users still need to handle load
balancing, scheduling and work queues for multi-device sys-
tems and need to combine them with another runtime solution
that targets distributed nodes. In fact, such systems could be
implemented with our heterogeneous tasking framework as a
interoperability backend.
IV. DESIGN AND IMPLEMENTATION
A. Programming Model
The programming model of the heterogeneous tasking
framework builds upon two simple abstractions: the het-
erogeneous objects (hetero objects) and heterogeneous tasks
(hetero tasks). A hetero object uniformly represents a user-
defined data object residing on one or more computing devices
of a heterogeneous compute node (e.g., CPUs, GPUs, FPGAs).
Applications treat such objects as opaque containers for data
without being aware of their physical location. A hetero task
encapsulates a non-preemptive computing kernel that runs
to completion and implements a medium-grained parallel
computation. Like hetero objects, hetero tasks are defined and
handled by the application in a uniform way, independent of
the device they will execute on.
1) Heterogeneous Objects: Handling copies of the same
data on different heterogeneous devices can lead to error-
prone, and difficult to understand and maintain application
code. In general, applications need to handle data transfers
between them, use the correct pointer for the respective
device and also keep track of their coherence. A hetero object
is an abstraction that automatically handles such concerns,
maintaining the different copies of the same data in a single
reference. The underlying system handles hetero objects to
guarantee that the most recent version of the data will be
available at the target device at the time that they will be
needed. For example, accessing an object originally located
on the host from a GPU would automatically trigger the
transfer of the underlying data from the host to the respective
device. In the same manner, accessing the same object from a
different device would trigger a transfer from the GPU to that
device. The runtime system guarantees data coherence among
computing devices, keeping track of up-to-date or stale copies
and handling them appropriately.
The actual data captured by a hetero object should mainly
be accessed and modified through hetero tasks for optimal
performance. However, the application can also explicitly re-
quest access to the underlying data on the host after specifying
the type of access requested, in order to maintain coherence.
This method will trigger (if needed) an asynchronous transfer
from the device with the most recent version of the data and
immediately return a future. The future can then be used to
query the status of the transfer and provide access to the raw
data. In this state, the data of the hetero object are guaranteed
摘要:

TowardsPerformancePortableProgrammingforDistributedHeterogeneousSystemsPolykarposThomadakisDepartmentofComputerScienceOldDominionUniversityNorfolk,Virginiapthom001@odu.eduNikosChrisochoidesDepartmentofComputerScienceOldDominionUniversityNorfolk,Virginianikos@cs.odu.eduAbstract—Hardwareheterogeneityi...

展开>> 收起<<
Towards Performance Portable Programming for Distributed Heterogeneous Systems Polykarpos Thomadakis.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:316.71KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注