
data width conversion to be performed at either end of the higher
clocked domain. In high-level synthesis (HLS) development ows
in particular, multi-pumping is either not supported altogether or
severely limited in scope, resulting in this optimization rarely being
exploited for FPGA development in practice.
In this work, we show how the multi-pumping optimization can
be automatically applied using data movement analysis. By captur-
ing all data movement to and from computational subdomains, we
can identify if they can be multi-pumped, introduce the new clock
domain, and insert the necessary domain crossing logic. We demon-
strate automatic multi-pumping on several applications compiled
from a Python frontend implementation to FPGA architectures on
a Xilinx accelerator board. We show how resource usage of critical
FPGA resources, such as DSPs and BRAM, is reduced by 50% when
critical subdomains are double-pumped, and how this can be used
in relevant applications to increase the overall performance of the
design by exploiting the resources freed up by multi-pumping.
In particular, the main contributions of this paper are:
•
A novel view of the multi-pumping optimization as temporal
vectorization.
•
Automatic application of the multi-pumping optimization
to the broader scope of computational subdomains, rather
than individual components.
•
The ability for software developers to exploit the multi-
pumping optimization in high-level code by providing auto-
matic HLS and RTL integration.
•
Demonstrating the benets of multi-pumping optimization
in performance increase or resource reduction on four dif-
ferent use cases.
2 MULTI-PUMPING
Programming FPGAs with HLS revolves around designing deep
hardware pipelines, exploiting the spatial parallelism oered by the
device. Optimizing compilers and performance engineers leverage
classical high-performance computing and FPGA-oriented trans-
formations to achieve this goal [
8
]. Resource utilization is a metric
that must be considered in optimizing code for FPGA, as space con-
sumption can be one of the critical factors limiting the performance
of FPGA large-scale designs.
Traditionally, resource-sharing techniques have been used to
reduce area consumption, but these usually come at the expense of
degraded circuit performance. Multi-pumping aims to overcome
the limitations of other solutions by exploiting the capability of
the hardware fabric to run dierent components at dierent clock
rates [
4
,
5
,
13
]. FPGA designs created with modern HLS tools typi-
cally run at 200-350 Mhz, while other FPGA components, such as
DSPs or on-chip memory, can be clocked at a higher frequency. For
example, the DSP48 block of a Xilinx Alveo U280 can be clocked up
to 891 MHz [
17
], almost three times higher than the usual design fre-
quency achieved by HLS. While reaching such higher frequencies
is infeasible (due to routing and timing closure requirements), it is
clear that internal components are not fully exploited in high-level
FPGA designs.
C1
C1
x
y
Issue x
Issue y
C0
Pack z
clk0
x
y
C0
z
clk0
clk1
z
clk0
x
y
clk1
Issue x
Issue y
C0
Pack z
z
[0,2]
[0,1] [2,3] [4,5] [6,7]
[0,1] [2,3] [4,5] [6,7]
0+0 2+2 4+4
1+1 3+3 5+5
[0,2] [4,6]
[0,1] [2,3] [4,5] [6,7]
[0,1] [2,3] [4,5] [6,7]
0 1 2 3 4 5
0 1 2 3 4 5
0+0 1+1 2+2 3+3 4+4
0 2 4 6
0 1 2 3 4 5 6 7 8 9
[0,1,2,3] [4,5,6,7] [8,9,10,11] [12,13,14,15]
[0,1,2,3] [4,5,6,7] [8,9,10,11] [12,13,14,15]
[0,1] [2,3] [4,5] [6,7] [8,9] [10,11]
[0,1] [2,3] [4,5] [6,7] [8,9] [10,11]
0+0 2+2 4+4 6+6 8+8
1+1 3+3 5+5 7+7 9+9
[0,2] [4,6] [8,10] [12,14]
[0,2,4,6]
Figure 2: Waveforms depicting original implementation and
multi-pumping approaches for vector addition, with 𝑀=
2, 𝑣 =2.
2.1 Exploiting multiple clock domains
FPGA designs usually have a single clock domain, where the entire
design shares the same clock signal. To apply multi-pumping, we
need to have two clock domains, one for the slowly clocked com-
ponents, such as reader/writer to external memory, and one highly
clocked for the internal compute components.
Consider the case of a
𝑉
-way vectorized vector addition
𝑧=𝑥+𝑦
,
where
𝑉
elements of
𝑥
and
𝑦
are read every tick of the clock
𝑐𝑙𝑘0
.
To process the entire vector, the internal components
𝐶
, adding
together a single element of
𝑥
and
𝑦
, have to be replicated
𝑉
times.
Picture
➀
in Figure 2 shows a waveform describing this behavior
for
𝑉=
2. On every clock cycle, the circuit can compute two output
results.
Let us assume that
𝐶
can be clocked at a frequency that is
𝑀
times larger than the frequency of
𝑐𝑙𝑘0
. The multi-pumping opti-
mization can be applied in two dierent ways, each aecting either
the external or the internal, relative to the compute block being
optimized, data paths of the design. The rst approach is where the
widths of the internal data paths remain unchanged while the ex-
ternal widths are widened by the factor
𝑀
. The internal computing
part is driven by a clock signal
𝑐𝑙𝑘1
, clocked
𝑀
times higher than
𝑐𝑙𝑘0
. This scenario is depicted in waveform
➁
of Figure 2, assuming
𝑀=
2. Data entering the multi-pumped domain must be converted
from one wide vector of size
𝑀𝑉
to
𝑀
narrow vectors of size
𝑉
— and the inverse for leaving the multi-pumped domain (issuers
and packers in Figure 2). The resulting design obtains increased
throughput by a factor
𝑀
, at the same resource consumption as the
original implementation. In the example, the circuit computes four
output elements per clock cycle 𝑐𝑙𝑘0.
A second approach would be to divide the width of the paths
internally in the compute blocks by the factor
𝑀
, while the widths
of the external paths remain unchanged (waveform
➂
in Figure 2).
The internal compute part runs according to
𝑐𝑙𝑘1
, but we no longer
2