TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine Nick Brown1 Brandon Echols2 Justs Zarins1 and Tobias Grosser3

2025-05-02 0 0 285.28KB 12 页 10玖币

侵权投诉

TensorFlow as a DSL for stencil-based

computation on the Cerebras Wafer Scale Engine

Nick Brown1, Brandon Echols2, Justs Zarins1, and Tobias Grosser3

1EPCC, University of Edinburgh, Bayes Centre, Edinburgh, UK

2Lawrence Livermore National Laboratory, Livermore, California, USA

3School of Informatics, University of Edinburgh, Informatics Forum, Edinburgh, UK

Abstract. The Cerebras Wafer Scale Engine (WSE) is an accelerator

that combines hundreds of thousands of AI-cores onto a single chip.

Whilst this technology has been designed for machine learning work-

loads, the signiﬁcant amount of available raw compute means that it is

also a very interesting potential target for accelerating traditional HPC

computational codes. Many of these algorithms are stencil-based, where

update operations involve contributions from neighbouring elements, and

in this paper we explore the suitability of this technology for such codes

from the perspective of an early adopter of the technology, compared

to CPUs and GPUs. Using TensorFlow as the interface, we explore the

performance and demonstrate that, whilst there is still work to be done

around exposing the programming interface to users, performance of the

WSE is impressive as it out performs four V100 GPUs by two and a half

times and two Intel Xeon Platinum CPUs by around 114 times in our

experiments. There is signiﬁcant potential therefore for this technology

to play an important role in accelerating HPC codes on future exascale

supercomputers.

1 Introduction

Scientists and engineers are forever demanding the ability to model larger sys-

tems at reduced time to solution. This ambition is driving the HPC community

towards exascale, and given the popularity of accelerators in current generation

supercomputers it is safe to assume that they will form a major component of

future exascale machines. Whilst GPUs have become dominant in HPC, an im-

portant question is the role that other more novel technologies might also play

in increasing the capabilities of scientiﬁc simulation software. One such technol-

ogy is Cerebras’ Wafer Scale Engine (WSE) which is an accelerator containing

hundreds of thousands of relatively simple, AI, cores. Whilst the major target

for Cerebras to this point has been accelerating machine learning workloads,

as the cores are optimised for processing sparse tensor operations this means

they are capable of executing general purpose workloads, and furthermore com-

bined with massive on-chip memory bandwidth and interconnect performance.

Put simply, the WSE has signiﬁcant potential for accelerating traditional HPC

computational kernels in addition to machine learning models.

arXiv:2210.04795v1 [cs.DC] 26 Aug 2022

2 N. Brown et al.

There are currently a handful of Cerebras machines which are publicly avail-

able, making testing and exploration of the architecture diﬃcult. Furthermore,

the software stack is optimised for machine learning workloads, and whilst Cere-

bras are making impressive progress in this regard, for instance the recent an-

nouncement of their SDK [5], at the time of writing machine interaction is com-

monly driven via high level machine learning tools. It is currently a very exciting

time for the WSE, with Cerebras making numerous advances in both their soft-

ware and future hardware oﬀering. Consequently, whilst the technology is still

in a relatively early state, at this stage understanding its overall suitability for

HPC workloads compared with other hardware is worthwhile, especially as the

Cerebras oﬀering is set to mature and grow in coming years.

In this paper we explore the suitability of the Cerebras WSE for accelerating

stencil-based computational algorithms. Section 2 introduces the background to

this work by describing the WSE in more detail and how one interacts with the

machine, along with other related work on the WSE. In Section 3 we explore how

one must currently program the architecture for computational workloads and

then, by running on a Cerebras CS-1, in Section 4 use a stencil-based benchmark

to compare the performance properties of the WSE against four V100 GPUs and

two 18-core Intel Xeon Platinum CPUs, before concluding in Section 5.

2 Background and related work

The Cerebras WSE has been used by various organisations, including large global

corporations, for accelerating machine learning. Already there have been numer-

ous notable successes from running AI models on the WSE including new drug

discovery [2], advancing treatments for ﬁghting cancer [3], and helping to tackle

the COVID-19 pandemic [6]. The beneﬁts of accelerating machine learning work-

loads has been well proven, however there are far fewer studies concerned with

using the WSE to run more traditional computational tasks.

One such study was undertaken in [4] where the authors ported the BiCGSTAB

solver, a Krylov Subspace method for solving systems of linear equations, and

also a simple CFD benchmark onto the Cerebras CS-1. Whilst their raw results

were impressive, the authors used Cerebras’ low level interface for this work,

programming each individual core separately and manually conﬁguring the on-

chip network. This required a very deep understanding of the architecture, and

furthermore as the work was undertaken in part by Cerebras employees they

had access to this proprietary tooling which is not publicly available to users.

In this work we focus on stencil-based algorithms because of their suitability

for mapping to the WSE architecture and TensorFlow programming interface

(see Section 3). When calculating the value of a grid cell stencils represent a ﬁxed

pattern of contributions from neighbouring elements. Most commonly operating

in iterations, at each iteration the value held in each grid cell will be updated

based upon some weighted contribution of values held in neighbouring cells. This

form of algorithm is widespread in scientiﬁc computing and hence represents the

underlying computational pattern in use by a large number of HPC codes.

TensorFlow as a DSL on Cerebras WSE for stencil-based computation codes 3

2.1 Cerebras Wafer Scale Engine

The Cerebras Wafer Scale Engine (WSE) is a MIMD accelerator and on the

CS-1, the hardware used for this work, there are approximately 350000 process-

ing cores running concurrently and able to executing diﬀerent instructions on

diﬀerent data elements. The WSE provides more ﬂexibility than a GPU, for

instance, where on that accelerator groups of cores must operate in lock-step

within a warp. At the physical level the WSE is composed of a wafer containing

84 dies, with each die comprising 4539 individual tiles. Each tile holds a single

processing element, which is a computational core, a router, and 48KB of SRAM

memory. In total there is approximately 18GB of SRAM memory on the CS-1

but this is distributed on a processing element by processing element basis. Each

computational core supports operations on 16-bit integers, and both 16-bit and

32-bit ﬂoating point numbers, with the IEEE ﬂoating point standard supported

for both ﬂoating point bit sizes and additionally Cerebras’s own CB16. Each

core provides 4-way SIMD for 16-bit ﬂoating point addition, multiplication, and

fused multiply accumulate (FMAC) operations, 2-way SIMD for mixed preci-

sion (16-bit multiplications and 32-bit additions), and one operation per cycle is

possible for 32-bit arithmetic.

The WSE is designed to accelerate computation involved in model training

and inference, with numerous support functions undertaken by the host machine.

The host is connected to the WSE via twelve 100 GbE network connections,

and undertakes activities include model compilation, input data preprocessing,

streaming input and output model data, and managing the overall model train-

ing. The Cerebras machine used for this work is a CS-1 hosted by EPCC and

connected to a host Superdome Flex Server (containing twenty four Intel Xeon

Platinum 8260 CPUs, with each CPU containing 24 physical cores and a total

of 17TB RAM).

2.2 Programming the Wafer Scale Engine

In [4] the authors programmed their kernels for the CS-1 using a bespoke low

level interface, however this is proprietary and not exposed to users. Cerebras

have recently announced the availability of their SDK [5] for general purpose

programming of the WSE and whilst this is a very important step in widening

the workloads that can be executed on the architecture, it requires an investment

of time for programmers to gain the expertise in order to be able to write optimal

code for the WSE using it. Consequently in this work we use the TensorFlow API,

which abstracts the tricky and low level details of decomposing the workload

into tasks, mapping these to cores, and determining the appropriate routing

strategy. Hence whilst our objective is to focus on stencil-based, rather than

machine learning, codes, by encoding our algorithm via TensorFlow it enables us

to undertake performance explorations for this workload, to understand whether

it is worthwhile investing the time in using the Cerebras SDK, and also means

that such algorithms can be ported to the WSE more quickly to undertake such

evaluations.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TensorFlowasaDSLforstencil-basedcomputationontheCerebrasWaferScaleEngineNickBrown1,BrandonEchols2,JustsZarins1,andTobiasGrosser31EPCC,UniversityofEdinburgh,BayesCentre,Edinburgh,UK2LawrenceLivermoreNationalLaboratory,Livermore,California,USA3SchoolofInformatics,UniversityofEdinburgh,InformaticsForum...

展开>> 收起<<

TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine Nick Brown1 Brandon Echols2 Justs Zarins1 and Tobias Grosser3.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine Nick Brown1 Brandon Echols2 Justs Zarins1 and Tobias Grosser3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: