1 AI and ML Accelerator Survey and Trends Albert Reuther Peter Michaleas Michael Jones Vijay Gadepally Siddharth Samsi and Jeremy Kepner

2025-04-28 0 0 793.15KB 10 页 10玖币
侵权投诉
1
AI and ML Accelerator Survey and Trends
Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner
MIT Lincoln Laboratory Supercomputing Center
Lexington, MA, USA
{reuther,pmichaleas,michael.jones,vijayg,sid,kepner}@ll.mit.edu
Abstract—This paper updates the survey of AI accelerators
and processors from past three years. This paper collects and
summarizes the current commercial accelerators that have been
publicly announced with peak performance and power consump-
tion numbers. The performance and power values are plotted on
a scatter graph, and a number of dimensions and observations
from the trends on this plot are again discussed and analyzed.
Two new trends plots based on accelerator release dates are
included in this year’s paper, along with the additional trends
of some neuromorphic, photonic, and memristor-based inference
accelerators.
Index Terms—Machine learning, GPU, TPU, dataflow, accel-
erator, embedded inference, computational performance
I. INTRODUCTION
Just as last year, the pace of new announcements, releases,
and deployments of artificial intelligence (AI) and machine
learning (ML) accelerators from startups and established tech-
nology companies has been modest. This is not unreason-
able; for many companies that have released an accelerator
report having spent three or four years researching, analyzing,
designing, verifying, and validating their accelerator design
trade-offs and building the software stack to program the
accelerator. For those who have released subsequent versions
of their accelerator, they have reported shorter development
cycles, though it is still at least two or three years. The focus of
these accelerators continues to be on accelerating deep neural
network (DNN) models, and the application space spans from
very low power embedded voice recognition and image clas-
sification to data center scale training, while the competition
for defining markets and application areas continues as part
of a much larger industrial and technology shift in modern
computing to machine learning solutions.
AI ecosystems bring together components from embed-
ded computing (edge computing), traditional high perfor-
mance computing (HPC), and high performance data analy-
sis (HPDA) that must work together to effectively provide
capabilities for use by decision makers, warfighters, and
analysts [1]. Figure 1 captures an architectural overview of
such end-to-end AI solutions and their components. On the
left side of Figure 1, structured and unstructured data sources
provide different views of entities and/or phenomenology.
This material is based upon work supported by the Assistant Secretary
of Defense for Research and Engineering under Air Force Contract No.
FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations
expressed in this material are those of the author(s) and do not necessarily
reflect the views of the Assistant Secretary of Defense for Research and
Engineering.
Fig. 1: Canonical AI architecture consists of sensors, data con-
ditioning, algorithms, modern computing, robust AI, human-
machine teaming, and users (missions). Each step is critical
in developing end-to-end AI applications and systems.
These raw data products are fed into a data conditioning step
in which they are fused, aggregated, structured, accumulated,
and converted into information. The information generated by
the data conditioning step feeds into a host of supervised
and unsupervised algorithms such as neural networks, which
extract patterns, predict new events, fill in missing data, or
look for similarities across datasets, thereby converting the
input information to actionable knowledge. This actionable
knowledge is then passed to human beings for decision-
making processes in the human-machine teaming phase. The
phase of human-machine teaming provides the users with
useful and relevant insight turning knowledge into actionable
intelligence or insight.
Underpinning this system are modern computing systems.
Moore’s law trends have ended [2], as have a number of related
laws and trends including Denard’s scaling (power density),
clock frequency, core counts, instructions per clock cycle,
and instructions per Joule (Koomey’s law) [3]. Taking a page
from the system-on-chip (SoC) trends first seen in automotive
applications, robotics, and smartphones, advancements and
innovations are still progressing by developing and integrating
accelerators for often-used operational kernels, methods, or
functions. These accelerators are designed with a different
balance between performance and functional flexibility. This
includes an explosion of innovation in deep machine learning
processors and accelerators [4]–[8]. In this series of survey
papers, we explore the relative benefits of these technologies
since they are of particular importance to applying AI to
domains under significant constraints such as size, weight, and
arXiv:2210.04055v1 [cs.AR] 8 Oct 2022
2
power, both in embedded applications and in data centers.
This paper is an update to IEEE-HPEC papers from the past
three years [9]–[11]. As in past years, this paper continues
with last year’s focus on accelerators and processors that are
geared toward deep neural networks (DNNs) and convolutional
neural networks (CNNs) as they are quite computationally in-
tense [12]. This survey focuses on accelerators and processors
for inference for a variety of reasons including that defense
and national security AI/ML edge applications rely heavily on
inference. And we will consider all of the numerical precision
types that an accelerator supports, but for most of them, their
best inference performance is in int8 or fp16/bf16 (IEEE 16-
bit floating point or Google’s 16-bit brain float).
There are many surveys [13]–[24] and other papers that
cover various aspects of AI accelerators. For instance, the first
paper in this multi-year survey included the peak performance
of FPGAs for certain AI models; however, several of the
aforementioned surveys cover FPGAs in depth so they are
no longer included in this survey. This multi-year survey
effort and this paper focus on gathering a comprehensive list
of AI accelerators with their computational capability, power
efficiency, and ultimately the computational effectiveness of
utilizing accelerators in embedded and data center applica-
tions. Along with this focus, this paper mainly compares
neural network accelerators that are useful for government
and industrial sensor and data processing applications. A few
accelerators and processors that were included in previous
years’ papers have been left out of this year’s survey. They
have been dropped because they have been surpassed by
new accelerators from the same company, they are no longer
offered, or they are no longer relevant to the topic.
II. SURVEY OF PROCESSORS
Many recent advances in AI can be at least partly cred-
ited to advances in computing hardware [6], [7], [25], [26],
enabling computationally heavy machine-learning algorithms
and in particular DNNs. This survey gathers performance and
power information from publicly available materials including
research papers, technical trade press, company benchmarks,
etc. While there are ways to access information from com-
panies and startups (including those in their silent period),
this information is intentionally left out of this survey; such
data will be included in this survey when it becomes publicly
available. The key metrics of this public data are plotted in
Figure 2, which graphs recent processor capabilities (as of July
2022) mapping peak performance vs. power consumption. The
dash-dotted box depicts the very dense region that is zoomed
in and plotted in Figure 3.
The x-axis indicates peak power, and the y-axis indicate
peak giga-operations per second (GOps/s), both on a loga-
rithmic scale. The computational precision of the processing
capability is depicted by the geometric shape used; the com-
putational precision spans from analog and single-bit int1 to
four-byte int32 and two-byte fp16 to eight-byte fp64. The
precisions that show two types denotes the precision of the
multiplication operations on the left and the precision of
the accumulate/addition operations on the right (for example,
fp16.32 corresponds to fp16 for multiplication and fp32 for
accumulate/add). The form factor is depicted by color, which
shows the package for which peak power is reported. Blue
corresponds to a single chip; orange corresponds to a card; and
green corresponds to entire systems (single node desktop and
server systems). This survey is limited to single motherboard,
single memory-space systems. Finally, the hollow geometric
objects are peak performance for inference-only accelerators,
while the solid geometric figures are performance for acceler-
ators that are designed to perform both training and inference.
The survey begins with the same scatter plot that we have
compiled for the past three years. As we did last year, to save
space, we have summarized some of the important metadata
of the accelerators, cards, and systems in Table I, including
the label used in Figure 2 for each of the points on the
graph; many of the points were brought forward from last
year’s plot, and some details of those entries are in [9].
There are several additions which we will cover below. In
Table I, most of the columns and entries are self explana-
tory. However, there are two Technology entries that may
not be: dataflow and PIM. Dataflow processors are custom-
designed processors for neural network inference and training.
Since neural network training and inference computations can
be entirely deterministically laid out, they are amenable to
dataflow processing in which computations, memory accesses,
and inter-ALU communications actions are explicitly/statically
programmed or “placed-and-routed” onto the computational
hardware. Processor in memory (PIM) accelerators integrate
processing elements with memory technology. Among such
PIM accelerators are those based on an analog computing
technology that augments flash memory circuits with in-place
analog multiply-add capabilities. Please refer to the references
for the Mythic and Gyrfalcon accelerators for more details on
this innovative technology.
Finally, a reasonable categorization of accelerators follows
their intended application, and the five categories are shown
as ellipses on the graph, which roughly correspond to perfor-
mance and power consumption: Very Low Power for speech
processing, very small sensors, etc.; Embedded for cameras,
small UAVs and robots, etc.; Autonomous for driver assist
services, autonomous driving, and autonomous robots; Data
Center Chips and Cards; and Data Center Systems.
For most of the accelerators, their descriptions and commen-
taries have not changed since last year so please refer to last
two years’ papers for descriptions and commentaries. There
are, however, several new releases that were not covered by
past papers that are covered here.
Acelera, a Dutch embedded system startup, reported
the results of an embedded test chip that they have
produced [35]. They claim both digital and analog design
capabilities, and this test chip was made to test the
extent of the digital design capabilities. They expect to
add analog (probably flash) design elements in upcoming
efforts.
Maxim Integrated has released a system-on-chip
(SoC) for ultra low power applications called the
MAX78000 [74]–[76], which includes an ARM CPU
core, a RISC-V CPU core and an AI accelerator. The
摘要:

1AIandMLAcceleratorSurveyandTrendsAlbertReuther,PeterMichaleas,MichaelJones,VijayGadepally,SiddharthSamsi,andJeremyKepnerMITLincolnLaboratorySupercomputingCenterLexington,MA,USAfreuther,pmichaleas,michael.jones,vijayg,sid,kepnerg@ll.mit.eduAbstract—ThispaperupdatesthesurveyofAIacceleratorsandprocess...

展开>> 收起<<
1 AI and ML Accelerator Survey and Trends Albert Reuther Peter Michaleas Michael Jones Vijay Gadepally Siddharth Samsi and Jeremy Kepner.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:793.15KB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注