TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine Nick Brown1 Brandon Echols2 Justs Zarins1 and Tobias Grosser3

2025-05-02 0 0 285.28KB 12 页 10玖币
侵权投诉
TensorFlow as a DSL for stencil-based
computation on the Cerebras Wafer Scale Engine
Nick Brown1, Brandon Echols2, Justs Zarins1, and Tobias Grosser3
1EPCC, University of Edinburgh, Bayes Centre, Edinburgh, UK
2Lawrence Livermore National Laboratory, Livermore, California, USA
3School of Informatics, University of Edinburgh, Informatics Forum, Edinburgh, UK
Abstract. The Cerebras Wafer Scale Engine (WSE) is an accelerator
that combines hundreds of thousands of AI-cores onto a single chip.
Whilst this technology has been designed for machine learning work-
loads, the significant amount of available raw compute means that it is
also a very interesting potential target for accelerating traditional HPC
computational codes. Many of these algorithms are stencil-based, where
update operations involve contributions from neighbouring elements, and
in this paper we explore the suitability of this technology for such codes
from the perspective of an early adopter of the technology, compared
to CPUs and GPUs. Using TensorFlow as the interface, we explore the
performance and demonstrate that, whilst there is still work to be done
around exposing the programming interface to users, performance of the
WSE is impressive as it out performs four V100 GPUs by two and a half
times and two Intel Xeon Platinum CPUs by around 114 times in our
experiments. There is significant potential therefore for this technology
to play an important role in accelerating HPC codes on future exascale
supercomputers.
1 Introduction
Scientists and engineers are forever demanding the ability to model larger sys-
tems at reduced time to solution. This ambition is driving the HPC community
towards exascale, and given the popularity of accelerators in current generation
supercomputers it is safe to assume that they will form a major component of
future exascale machines. Whilst GPUs have become dominant in HPC, an im-
portant question is the role that other more novel technologies might also play
in increasing the capabilities of scientific simulation software. One such technol-
ogy is Cerebras’ Wafer Scale Engine (WSE) which is an accelerator containing
hundreds of thousands of relatively simple, AI, cores. Whilst the major target
for Cerebras to this point has been accelerating machine learning workloads,
as the cores are optimised for processing sparse tensor operations this means
they are capable of executing general purpose workloads, and furthermore com-
bined with massive on-chip memory bandwidth and interconnect performance.
Put simply, the WSE has significant potential for accelerating traditional HPC
computational kernels in addition to machine learning models.
arXiv:2210.04795v1 [cs.DC] 26 Aug 2022
2 N. Brown et al.
There are currently a handful of Cerebras machines which are publicly avail-
able, making testing and exploration of the architecture difficult. Furthermore,
the software stack is optimised for machine learning workloads, and whilst Cere-
bras are making impressive progress in this regard, for instance the recent an-
nouncement of their SDK [5], at the time of writing machine interaction is com-
monly driven via high level machine learning tools. It is currently a very exciting
time for the WSE, with Cerebras making numerous advances in both their soft-
ware and future hardware offering. Consequently, whilst the technology is still
in a relatively early state, at this stage understanding its overall suitability for
HPC workloads compared with other hardware is worthwhile, especially as the
Cerebras offering is set to mature and grow in coming years.
In this paper we explore the suitability of the Cerebras WSE for accelerating
stencil-based computational algorithms. Section 2 introduces the background to
this work by describing the WSE in more detail and how one interacts with the
machine, along with other related work on the WSE. In Section 3 we explore how
one must currently program the architecture for computational workloads and
then, by running on a Cerebras CS-1, in Section 4 use a stencil-based benchmark
to compare the performance properties of the WSE against four V100 GPUs and
two 18-core Intel Xeon Platinum CPUs, before concluding in Section 5.
2 Background and related work
The Cerebras WSE has been used by various organisations, including large global
corporations, for accelerating machine learning. Already there have been numer-
ous notable successes from running AI models on the WSE including new drug
discovery [2], advancing treatments for fighting cancer [3], and helping to tackle
the COVID-19 pandemic [6]. The benefits of accelerating machine learning work-
loads has been well proven, however there are far fewer studies concerned with
using the WSE to run more traditional computational tasks.
One such study was undertaken in [4] where the authors ported the BiCGSTAB
solver, a Krylov Subspace method for solving systems of linear equations, and
also a simple CFD benchmark onto the Cerebras CS-1. Whilst their raw results
were impressive, the authors used Cerebras’ low level interface for this work,
programming each individual core separately and manually configuring the on-
chip network. This required a very deep understanding of the architecture, and
furthermore as the work was undertaken in part by Cerebras employees they
had access to this proprietary tooling which is not publicly available to users.
In this work we focus on stencil-based algorithms because of their suitability
for mapping to the WSE architecture and TensorFlow programming interface
(see Section 3). When calculating the value of a grid cell stencils represent a fixed
pattern of contributions from neighbouring elements. Most commonly operating
in iterations, at each iteration the value held in each grid cell will be updated
based upon some weighted contribution of values held in neighbouring cells. This
form of algorithm is widespread in scientific computing and hence represents the
underlying computational pattern in use by a large number of HPC codes.
TensorFlow as a DSL on Cerebras WSE for stencil-based computation codes 3
2.1 Cerebras Wafer Scale Engine
The Cerebras Wafer Scale Engine (WSE) is a MIMD accelerator and on the
CS-1, the hardware used for this work, there are approximately 350000 process-
ing cores running concurrently and able to executing different instructions on
different data elements. The WSE provides more flexibility than a GPU, for
instance, where on that accelerator groups of cores must operate in lock-step
within a warp. At the physical level the WSE is composed of a wafer containing
84 dies, with each die comprising 4539 individual tiles. Each tile holds a single
processing element, which is a computational core, a router, and 48KB of SRAM
memory. In total there is approximately 18GB of SRAM memory on the CS-1
but this is distributed on a processing element by processing element basis. Each
computational core supports operations on 16-bit integers, and both 16-bit and
32-bit floating point numbers, with the IEEE floating point standard supported
for both floating point bit sizes and additionally Cerebras’s own CB16. Each
core provides 4-way SIMD for 16-bit floating point addition, multiplication, and
fused multiply accumulate (FMAC) operations, 2-way SIMD for mixed preci-
sion (16-bit multiplications and 32-bit additions), and one operation per cycle is
possible for 32-bit arithmetic.
The WSE is designed to accelerate computation involved in model training
and inference, with numerous support functions undertaken by the host machine.
The host is connected to the WSE via twelve 100 GbE network connections,
and undertakes activities include model compilation, input data preprocessing,
streaming input and output model data, and managing the overall model train-
ing. The Cerebras machine used for this work is a CS-1 hosted by EPCC and
connected to a host Superdome Flex Server (containing twenty four Intel Xeon
Platinum 8260 CPUs, with each CPU containing 24 physical cores and a total
of 17TB RAM).
2.2 Programming the Wafer Scale Engine
In [4] the authors programmed their kernels for the CS-1 using a bespoke low
level interface, however this is proprietary and not exposed to users. Cerebras
have recently announced the availability of their SDK [5] for general purpose
programming of the WSE and whilst this is a very important step in widening
the workloads that can be executed on the architecture, it requires an investment
of time for programmers to gain the expertise in order to be able to write optimal
code for the WSE using it. Consequently in this work we use the TensorFlow API,
which abstracts the tricky and low level details of decomposing the workload
into tasks, mapping these to cores, and determining the appropriate routing
strategy. Hence whilst our objective is to focus on stencil-based, rather than
machine learning, codes, by encoding our algorithm via TensorFlow it enables us
to undertake performance explorations for this workload, to understand whether
it is worthwhile investing the time in using the Cerebras SDK, and also means
that such algorithms can be ported to the WSE more quickly to undertake such
evaluations.
摘要:

TensorFlowasaDSLforstencil-basedcomputationontheCerebrasWaferScaleEngineNickBrown1,BrandonEchols2,JustsZarins1,andTobiasGrosser31EPCC,UniversityofEdinburgh,BayesCentre,Edinburgh,UK2LawrenceLivermoreNationalLaboratory,Livermore,California,USA3SchoolofInformatics,UniversityofEdinburgh,InformaticsForum...

展开>> 收起<<
TensorFlow as a DSL for stencil-based computation on the Cerebras Wafer Scale Engine Nick Brown1 Brandon Echols2 Justs Zarins1 and Tobias Grosser3.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:285.28KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注