
balancing, and provides a global address space that drastically
decreases the complexity of developing extreme-scale applica-
tions. PREMA consists of three software layers that provide
different features according to the principle of separation of
concerns.
The first layer, the Data Movement and Control
Substrate (DMCS), offers an asynchronous message-driven
execution model where messages are associated with a
task/function (referred to as handler in this context) that is
invoked implicitly on the receiver upon their arrival, similar to
Active Messages [3]. Communication and handler invocations
are by default asynchronous, and one-sided, requiring no
explicit involvement of the receiver. The Mobile Object Layer
(MOL) extends DMCS with the introduction of the mobile
objects. A mobile object is a location-independent container
that is provided by the runtime system to capture coarse-
grained application data. Mobile objects can be targeted for
remote handler executions uniformly and independently of
their location. Adapting to a programming model where the
workflow is expressed as interactions (i.e., handler invocations)
between mobile objects leads to a more naturally asynchronous
design for the application and allows PREMA to better handle
latencies while exposing a uniform programming interface that
hides the structure of the underlying platform. Mobile objects
are also used to provide applications with implicit distributed
load balancing through the Implicit Load Balancing (ILB)
layer but this feature is not utilized in the context of this work.
In order to provide more flexibility for the runtime system
to overlap latencies, an application is encouraged to perform
over-decomposition of its data. In over-decomposition, the data
domain of an application running on a platform of Pprocess-
ing elements is decomposed into Nchunks where NP.
Decomposing the data domain into many more pieces than
the available processing elements gives the runtime system
more options on scheduling computational tasks and filling the
idle time stemming from data movement and communication
operations.
III. RELATED WORK
Several systems have been adapted to efficiently utilize
GPUs in their workflow, while new ones have emerged in
an attempt to create new standards for their use. Systems like
Charm++ [4], HPX [5] and X10 [6] have introduced new in-
terfaces to provide support for GPUs. However, these systems
let the users explicitly handle issues like requesting memory
transfers, managing device platforms, task allocations, work
queues to optimize performance. In contrast, the proposed
work provides a uniform abstraction for heterogeneous tasks
and data, and implicitly handles scheduling, load balancing
and latency overlapping independently of the target device
backend. StarPU [7], OmpSs [8] and ParSec [9] offer dif-
ferent high-level approaches to efficiently utilize distributed
heterogeneous systems. However, their programming model
is mostly suitable for applications whose workflow follow a
regular pattern and can be inferred mostly statically. PREMA,
on the other hand, adopts a dynamic, message-driven program-
ming model that is more suitable for irregular applications.
SYCL and DPC++/oneAPI [10], as well as the newest
version of OpenMP are recent attempts to provide performance
portable interfaces in modern C++ that can target hetero-
geneous devices. However, users still need to handle load
balancing, scheduling and work queues for multi-device sys-
tems and need to combine them with another runtime solution
that targets distributed nodes. In fact, such systems could be
implemented with our heterogeneous tasking framework as a
interoperability backend.
IV. DESIGN AND IMPLEMENTATION
A. Programming Model
The programming model of the heterogeneous tasking
framework builds upon two simple abstractions: the het-
erogeneous objects (hetero objects) and heterogeneous tasks
(hetero tasks). A hetero object uniformly represents a user-
defined data object residing on one or more computing devices
of a heterogeneous compute node (e.g., CPUs, GPUs, FPGAs).
Applications treat such objects as opaque containers for data
without being aware of their physical location. A hetero task
encapsulates a non-preemptive computing kernel that runs
to completion and implements a medium-grained parallel
computation. Like hetero objects, hetero tasks are defined and
handled by the application in a uniform way, independent of
the device they will execute on.
1) Heterogeneous Objects: Handling copies of the same
data on different heterogeneous devices can lead to error-
prone, and difficult to understand and maintain application
code. In general, applications need to handle data transfers
between them, use the correct pointer for the respective
device and also keep track of their coherence. A hetero object
is an abstraction that automatically handles such concerns,
maintaining the different copies of the same data in a single
reference. The underlying system handles hetero objects to
guarantee that the most recent version of the data will be
available at the target device at the time that they will be
needed. For example, accessing an object originally located
on the host from a GPU would automatically trigger the
transfer of the underlying data from the host to the respective
device. In the same manner, accessing the same object from a
different device would trigger a transfer from the GPU to that
device. The runtime system guarantees data coherence among
computing devices, keeping track of up-to-date or stale copies
and handling them appropriately.
The actual data captured by a hetero object should mainly
be accessed and modified through hetero tasks for optimal
performance. However, the application can also explicitly re-
quest access to the underlying data on the host after specifying
the type of access requested, in order to maintain coherence.
This method will trigger (if needed) an asynchronous transfer
from the device with the most recent version of the data and
immediately return a future. The future can then be used to
query the status of the transfer and provide access to the raw
data. In this state, the data of the hetero object are guaranteed