Temporal Vectorization A Compiler Approach to Automatic Multi-Pumping

2025-05-02 0 0 1.3MB 9 页 10玖币

侵权投诉

Temporal Vectorization: A Compiler Approach to Automatic

Multi-Pumping

Carl-Johannes Johnsen

University of Copenhagen

Department of Computer Science

Copenhagen, Denmark

carl-johannes@di.ku.dk

Tiziano De Matteis

ETH Zurich

Department of Computer Science

Zurich, Switzerland

tdematt@inf.ethz.ch

Tal Ben-Nun

ETH Zurich

Department of Computer Science

Zurich, Switzerland

talbn@inf.ethz.ch

Johannes de Fine Licht

ETH Zurich

Department of Computer Science

Zurich, Switzerland

denelicht@inf.ethz.ch

Torsten Hoeer

ETH Zurich

Department of Computer Science

Zurich, Switzerland

htor@inf.ethz.ch

ABSTRACT

The multi-pumping resource sharing technique can overcome the

limitations commonly found in single-clocked FPGA designs by

allowing hardware components to operate at a higher clock fre-

quency than the surrounding system. However, this optimization

cannot be expressed in high levels of abstraction, such as HLS,

requiring the use of hand-optimized RTL. In this paper we show

how to leverage multiple clock domains for computational subdo-

mains on recongurable devices through data movement analysis

on high-level programs. We oer a novel view on multi-pumping as

a compiler optimization — a superclass of traditional vectorization.

As multiple data elements are fed and consumed, the computations

are packed temporally rather than spatially. The optimization is

applied automatically using an intermediate representation that

maps high-level code to HLS. Internally, the optimization injects

modules into the generated designs, incorporating RTL for ne-

grained control over the clock domains. We obtain a reduction of

resource consumption by up to 50% on critical components and 23%

on average. For scalable designs, this can enable further parallelism,

increasing overall performance.

1 INTRODUCTION

Designing application-specic hardware is fundamentally resource-

constrained. The performance of a circuit implementing a parallel

computation is the product of the degree of parallelism and the

frequency at which it is clocked. We can improve the performance

by either increasing the circuit’s frequency, or by introducing addi-

tional parallelism if opportunities exist in the application. Viewed

dierently, if we x the performance of a circuit, we can reduce the

resource usage by proportionally increasing the frequency. This

resource reduction can either be used to increase the parallelism of

the computation, or to implement other operations.

In addition to power and thermal limits, increasing the frequency

of a design is not always possible due to timing constraints. How-

ever, not all subdomains need to run at the same frequency. Some

subdomains might not be critical for performance or contribute sig-

nicantly to overall resource utilization, and thus do not bottleneck

throughput at lower frequencies. For subdomains that are primar-

ily concerned with moving data, such as I/O interfaces accessing

Vectorization

Each output depends

on one input

Example: vec_add

Outputs may be dependent

on a combination of inputs

Examples: shuﬄe, hadd,

reduce

Outputs may be dependent

on a combination of

inputs and outputs

Examples: fold, stencil

Temporal

Vectorization

Figure 1: Temporal Vectorization

DRAM, PCIe, or networking, we can get away with increasing the

width of the data path rather than increasing the frequency.

We can thus increase the frequency of only the resource and/or

performance-critical subdomains, feeding more data per cycle from

the slower clock domain. This is referred to as “multi-pumping” [

Computational components are usually densely connected by short

paths, while the data paths connecting them account for long data

paths across the chip. From a data movement perspective, multi-

pumping can be seen as a form of “temporal vectorization”: the long

data paths leading to and/or from the computation are widened, but

the densely connected logic performing the computation itself is left

unchanged, while conceptually being “vectorized” across multiple

clock cycles. As a result, the densely connected short paths now

only have to meet timing at a higher frequency locally, while the

long data paths do not need additional buering of the signal that

would otherwise be required at higher frequencies.

Vectorization is traditionally applied on computations where

each output depends only on one input (such as in the case of

vector addition), or when the output depends on a combination

of inputs (such as, in horizontal addition or reductions). Temporal

vectorization relaxes these requirements: it must still be possible

to parallelize the memory source/destination, but it does not im-

pose any requirements on the computation – the computation does

not even need to be analyzable, and dependencies between opera-

tions are allowed without any additional handling. For this reason,

temporal vectorization can bee seen as a superclass of traditional

vectorization (see Figure 1).

Introducing multi-pumping to a design is an invasive procedure

that requires signicant eort, requiring clock domain crossing and

arXiv:2210.04598v1 [cs.DC] 19 Sep 2022

data width conversion to be performed at either end of the higher

clocked domain. In high-level synthesis (HLS) development ows

in particular, multi-pumping is either not supported altogether or

severely limited in scope, resulting in this optimization rarely being

exploited for FPGA development in practice.

In this work, we show how the multi-pumping optimization can

be automatically applied using data movement analysis. By captur-

ing all data movement to and from computational subdomains, we

can identify if they can be multi-pumped, introduce the new clock

domain, and insert the necessary domain crossing logic. We demon-

strate automatic multi-pumping on several applications compiled

from a Python frontend implementation to FPGA architectures on

a Xilinx accelerator board. We show how resource usage of critical

FPGA resources, such as DSPs and BRAM, is reduced by 50% when

critical subdomains are double-pumped, and how this can be used

in relevant applications to increase the overall performance of the

design by exploiting the resources freed up by multi-pumping.

In particular, the main contributions of this paper are:

•

A novel view of the multi-pumping optimization as temporal

vectorization.

•

Automatic application of the multi-pumping optimization

to the broader scope of computational subdomains, rather

than individual components.

•

The ability for software developers to exploit the multi-

pumping optimization in high-level code by providing auto-

matic HLS and RTL integration.

•

Demonstrating the benets of multi-pumping optimization

in performance increase or resource reduction on four dif-

ferent use cases.

2 MULTI-PUMPING

Programming FPGAs with HLS revolves around designing deep

hardware pipelines, exploiting the spatial parallelism oered by the

device. Optimizing compilers and performance engineers leverage

classical high-performance computing and FPGA-oriented trans-

formations to achieve this goal [

]. Resource utilization is a metric

that must be considered in optimizing code for FPGA, as space con-

sumption can be one of the critical factors limiting the performance

of FPGA large-scale designs.

Traditionally, resource-sharing techniques have been used to

reduce area consumption, but these usually come at the expense of

degraded circuit performance. Multi-pumping aims to overcome

the limitations of other solutions by exploiting the capability of

the hardware fabric to run dierent components at dierent clock

rates [

]. FPGA designs created with modern HLS tools typi-

cally run at 200-350 Mhz, while other FPGA components, such as

DSPs or on-chip memory, can be clocked at a higher frequency. For

example, the DSP48 block of a Xilinx Alveo U280 can be clocked up

to 891 MHz [

], almost three times higher than the usual design fre-

quency achieved by HLS. While reaching such higher frequencies

is infeasible (due to routing and timing closure requirements), it is

clear that internal components are not fully exploited in high-level

FPGA designs.

Issue x

Issue y

Pack z

clk0

clk1

clk0

clk1

Issue x

Issue y

Pack z

[0,2]

[0,1] [2,3] [4,5] [6,7]

0+0 2+2 4+4

1+1 3+3 5+5

[0,2] [4,6]

[0,1] [2,3] [4,5] [6,7]

0 1 2 3 4 5

0+0 1+1 2+2 3+3 4+4

0 2 4 6

0 1 2 3 4 5 6 7 8 9

[0,1,2,3] [4,5,6,7] [8,9,10,11] [12,13,14,15]

[0,1] [2,3] [4,5] [6,7] [8,9] [10,11]

0+0 2+2 4+4 6+6 8+8

1+1 3+3 5+5 7+7 9+9

[0,2] [4,6] [8,10] [12,14]

[0,2,4,6]

Figure 2: Waveforms depicting original implementation and

multi-pumping approaches for vector addition, with 𝑀=

2, 𝑣 =2.

2.1 Exploiting multiple clock domains

FPGA designs usually have a single clock domain, where the entire

design shares the same clock signal. To apply multi-pumping, we

need to have two clock domains, one for the slowly clocked com-

ponents, such as reader/writer to external memory, and one highly

clocked for the internal compute components.

Consider the case of a

𝑉

-way vectorized vector addition

𝑧=𝑥+𝑦

where

𝑉

elements of

𝑥

and

𝑦

are read every tick of the clock

𝑐𝑙𝑘0

To process the entire vector, the internal components

𝐶

, adding

together a single element of

𝑥

and

𝑦

, have to be replicated

𝑉

times.

Picture

➀

in Figure 2 shows a waveform describing this behavior

for

𝑉=

2. On every clock cycle, the circuit can compute two output

results.

Let us assume that

𝐶

can be clocked at a frequency that is

𝑀

times larger than the frequency of

𝑐𝑙𝑘0

. The multi-pumping opti-

mization can be applied in two dierent ways, each aecting either

the external or the internal, relative to the compute block being

optimized, data paths of the design. The rst approach is where the

widths of the internal data paths remain unchanged while the ex-

ternal widths are widened by the factor

𝑀

. The internal computing

part is driven by a clock signal

𝑐𝑙𝑘1

, clocked

𝑀

times higher than

𝑐𝑙𝑘0

. This scenario is depicted in waveform

➁

of Figure 2, assuming

𝑀=

2. Data entering the multi-pumped domain must be converted

from one wide vector of size

𝑀𝑉

𝑀

narrow vectors of size

𝑉

— and the inverse for leaving the multi-pumped domain (issuers

and packers in Figure 2). The resulting design obtains increased

throughput by a factor

𝑀

, at the same resource consumption as the

original implementation. In the example, the circuit computes four

output elements per clock cycle 𝑐𝑙𝑘0.

A second approach would be to divide the width of the paths

internally in the compute blocks by the factor

𝑀

, while the widths

of the external paths remain unchanged (waveform

➂

in Figure 2).

The internal compute part runs according to

𝑐𝑙𝑘1

, but we no longer

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TemporalVectorization:ACompilerApproachtoAutomaticMulti-PumpingCarl-JohannesJohnsenUniversityofCopenhagenDepartmentofComputerScienceCopenhagen,Denmarkcarl-johannes@di.ku.dkTizianoDeMatteisETHZurichDepartmentofComputerScienceZurich,Switzerlandtdematt@inf.ethz.chTalBen-NunETHZurichDepartmentofComputer...

展开>> 收起<<

Temporal Vectorization A Compiler Approach to Automatic Multi-Pumping.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Temporal Vectorization A Compiler Approach to Automatic Multi-Pumping

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: