Temporal Vectorization A Compiler Approach to Automatic Multi-Pumping

2025-05-02 0 0 1.3MB 9 页 10玖币
侵权投诉
Temporal Vectorization: A Compiler Approach to Automatic
Multi-Pumping
Carl-Johannes Johnsen
University of Copenhagen
Department of Computer Science
Copenhagen, Denmark
carl-johannes@di.ku.dk
Tiziano De Matteis
ETH Zurich
Department of Computer Science
Zurich, Switzerland
tdematt@inf.ethz.ch
Tal Ben-Nun
ETH Zurich
Department of Computer Science
Zurich, Switzerland
talbn@inf.ethz.ch
Johannes de Fine Licht
ETH Zurich
Department of Computer Science
Zurich, Switzerland
denelicht@inf.ethz.ch
Torsten Hoeer
ETH Zurich
Department of Computer Science
Zurich, Switzerland
htor@inf.ethz.ch
ABSTRACT
The multi-pumping resource sharing technique can overcome the
limitations commonly found in single-clocked FPGA designs by
allowing hardware components to operate at a higher clock fre-
quency than the surrounding system. However, this optimization
cannot be expressed in high levels of abstraction, such as HLS,
requiring the use of hand-optimized RTL. In this paper we show
how to leverage multiple clock domains for computational subdo-
mains on recongurable devices through data movement analysis
on high-level programs. We oer a novel view on multi-pumping as
a compiler optimization — a superclass of traditional vectorization.
As multiple data elements are fed and consumed, the computations
are packed temporally rather than spatially. The optimization is
applied automatically using an intermediate representation that
maps high-level code to HLS. Internally, the optimization injects
modules into the generated designs, incorporating RTL for ne-
grained control over the clock domains. We obtain a reduction of
resource consumption by up to 50% on critical components and 23%
on average. For scalable designs, this can enable further parallelism,
increasing overall performance.
1 INTRODUCTION
Designing application-specic hardware is fundamentally resource-
constrained. The performance of a circuit implementing a parallel
computation is the product of the degree of parallelism and the
frequency at which it is clocked. We can improve the performance
by either increasing the circuit’s frequency, or by introducing addi-
tional parallelism if opportunities exist in the application. Viewed
dierently, if we x the performance of a circuit, we can reduce the
resource usage by proportionally increasing the frequency. This
resource reduction can either be used to increase the parallelism of
the computation, or to implement other operations.
In addition to power and thermal limits, increasing the frequency
of a design is not always possible due to timing constraints. How-
ever, not all subdomains need to run at the same frequency. Some
subdomains might not be critical for performance or contribute sig-
nicantly to overall resource utilization, and thus do not bottleneck
throughput at lower frequencies. For subdomains that are primar-
ily concerned with moving data, such as I/O interfaces accessing
Vectorization
Each output depends
on one input
Example: vec_add
Outputs may be dependent
on a combination of inputs
Examples: shue, hadd,
reduce
Outputs may be dependent
on a combination of
inputs and outputs
Examples: fold, stencil
Temporal
Vectorization
Figure 1: Temporal Vectorization
DRAM, PCIe, or networking, we can get away with increasing the
width of the data path rather than increasing the frequency.
We can thus increase the frequency of only the resource and/or
performance-critical subdomains, feeding more data per cycle from
the slower clock domain. This is referred to as “multi-pumping” [
4
].
Computational components are usually densely connected by short
paths, while the data paths connecting them account for long data
paths across the chip. From a data movement perspective, multi-
pumping can be seen as a form of “temporal vectorization”: the long
data paths leading to and/or from the computation are widened, but
the densely connected logic performing the computation itself is left
unchanged, while conceptually being “vectorized” across multiple
clock cycles. As a result, the densely connected short paths now
only have to meet timing at a higher frequency locally, while the
long data paths do not need additional buering of the signal that
would otherwise be required at higher frequencies.
Vectorization is traditionally applied on computations where
each output depends only on one input (such as in the case of
vector addition), or when the output depends on a combination
of inputs (such as, in horizontal addition or reductions). Temporal
vectorization relaxes these requirements: it must still be possible
to parallelize the memory source/destination, but it does not im-
pose any requirements on the computation – the computation does
not even need to be analyzable, and dependencies between opera-
tions are allowed without any additional handling. For this reason,
temporal vectorization can bee seen as a superclass of traditional
vectorization (see Figure 1).
Introducing multi-pumping to a design is an invasive procedure
that requires signicant eort, requiring clock domain crossing and
arXiv:2210.04598v1 [cs.DC] 19 Sep 2022
data width conversion to be performed at either end of the higher
clocked domain. In high-level synthesis (HLS) development ows
in particular, multi-pumping is either not supported altogether or
severely limited in scope, resulting in this optimization rarely being
exploited for FPGA development in practice.
In this work, we show how the multi-pumping optimization can
be automatically applied using data movement analysis. By captur-
ing all data movement to and from computational subdomains, we
can identify if they can be multi-pumped, introduce the new clock
domain, and insert the necessary domain crossing logic. We demon-
strate automatic multi-pumping on several applications compiled
from a Python frontend implementation to FPGA architectures on
a Xilinx accelerator board. We show how resource usage of critical
FPGA resources, such as DSPs and BRAM, is reduced by 50% when
critical subdomains are double-pumped, and how this can be used
in relevant applications to increase the overall performance of the
design by exploiting the resources freed up by multi-pumping.
In particular, the main contributions of this paper are:
A novel view of the multi-pumping optimization as temporal
vectorization.
Automatic application of the multi-pumping optimization
to the broader scope of computational subdomains, rather
than individual components.
The ability for software developers to exploit the multi-
pumping optimization in high-level code by providing auto-
matic HLS and RTL integration.
Demonstrating the benets of multi-pumping optimization
in performance increase or resource reduction on four dif-
ferent use cases.
2 MULTI-PUMPING
Programming FPGAs with HLS revolves around designing deep
hardware pipelines, exploiting the spatial parallelism oered by the
device. Optimizing compilers and performance engineers leverage
classical high-performance computing and FPGA-oriented trans-
formations to achieve this goal [
8
]. Resource utilization is a metric
that must be considered in optimizing code for FPGA, as space con-
sumption can be one of the critical factors limiting the performance
of FPGA large-scale designs.
Traditionally, resource-sharing techniques have been used to
reduce area consumption, but these usually come at the expense of
degraded circuit performance. Multi-pumping aims to overcome
the limitations of other solutions by exploiting the capability of
the hardware fabric to run dierent components at dierent clock
rates [
4
,
5
,
13
]. FPGA designs created with modern HLS tools typi-
cally run at 200-350 Mhz, while other FPGA components, such as
DSPs or on-chip memory, can be clocked at a higher frequency. For
example, the DSP48 block of a Xilinx Alveo U280 can be clocked up
to 891 MHz [
17
], almost three times higher than the usual design fre-
quency achieved by HLS. While reaching such higher frequencies
is infeasible (due to routing and timing closure requirements), it is
clear that internal components are not fully exploited in high-level
FPGA designs.
C1
C1
x
y
Issue x
Issue y
C0
Pack z
clk0
x
y
C0
z
clk0
clk1
z
clk0
x
y
clk1
Issue x
Issue y
C0
Pack z
z
[0,2]
[0,1] [2,3] [4,5] [6,7]
[0,1] [2,3] [4,5] [6,7]
0+0 2+2 4+4
1+1 3+3 5+5
[0,2] [4,6]
[0,1] [2,3] [4,5] [6,7]
[0,1] [2,3] [4,5] [6,7]
0 1 2 3 4 5
0 1 2 3 4 5
0+0 1+1 2+2 3+3 4+4
0 2 4 6
0 1 2 3 4 5 6 7 8 9
[0,1,2,3] [4,5,6,7] [8,9,10,11] [12,13,14,15]
[0,1,2,3] [4,5,6,7] [8,9,10,11] [12,13,14,15]
[0,1] [2,3] [4,5] [6,7] [8,9] [10,11]
[0,1] [2,3] [4,5] [6,7] [8,9] [10,11]
0+0 2+2 4+4 6+6 8+8
1+1 3+3 5+5 7+7 9+9
[0,2] [4,6] [8,10] [12,14]
[0,2,4,6]
1
2
3
Figure 2: Waveforms depicting original implementation and
multi-pumping approaches for vector addition, with 𝑀=
2, 𝑣 =2.
2.1 Exploiting multiple clock domains
FPGA designs usually have a single clock domain, where the entire
design shares the same clock signal. To apply multi-pumping, we
need to have two clock domains, one for the slowly clocked com-
ponents, such as reader/writer to external memory, and one highly
clocked for the internal compute components.
Consider the case of a
𝑉
-way vectorized vector addition
𝑧=𝑥+𝑦
,
where
𝑉
elements of
𝑥
and
𝑦
are read every tick of the clock
𝑐𝑙𝑘0
.
To process the entire vector, the internal components
𝐶
, adding
together a single element of
𝑥
and
𝑦
, have to be replicated
𝑉
times.
Picture
in Figure 2 shows a waveform describing this behavior
for
𝑉=
2. On every clock cycle, the circuit can compute two output
results.
Let us assume that
𝐶
can be clocked at a frequency that is
𝑀
times larger than the frequency of
𝑐𝑙𝑘0
. The multi-pumping opti-
mization can be applied in two dierent ways, each aecting either
the external or the internal, relative to the compute block being
optimized, data paths of the design. The rst approach is where the
widths of the internal data paths remain unchanged while the ex-
ternal widths are widened by the factor
𝑀
. The internal computing
part is driven by a clock signal
𝑐𝑙𝑘1
, clocked
𝑀
times higher than
𝑐𝑙𝑘0
. This scenario is depicted in waveform
of Figure 2, assuming
𝑀=
2. Data entering the multi-pumped domain must be converted
from one wide vector of size
𝑀𝑉
to
𝑀
narrow vectors of size
𝑉
— and the inverse for leaving the multi-pumped domain (issuers
and packers in Figure 2). The resulting design obtains increased
throughput by a factor
𝑀
, at the same resource consumption as the
original implementation. In the example, the circuit computes four
output elements per clock cycle 𝑐𝑙𝑘0.
A second approach would be to divide the width of the paths
internally in the compute blocks by the factor
𝑀
, while the widths
of the external paths remain unchanged (waveform
in Figure 2).
The internal compute part runs according to
𝑐𝑙𝑘1
, but we no longer
2
摘要:

TemporalVectorization:ACompilerApproachtoAutomaticMulti-PumpingCarl-JohannesJohnsenUniversityofCopenhagenDepartmentofComputerScienceCopenhagen,Denmarkcarl-johannes@di.ku.dkTizianoDeMatteisETHZurichDepartmentofComputerScienceZurich,Switzerlandtdematt@inf.ethz.chTalBen-NunETHZurichDepartmentofComputer...

展开>> 收起<<
Temporal Vectorization A Compiler Approach to Automatic Multi-Pumping.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.3MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注