1 A Near-Sensor Processing Accelerator for Approximate Local Binary Pattern Networks

2025-04-24 0 0 5.15MB 10 页 10玖币
侵权投诉
1
A Near-Sensor Processing Accelerator for
Approximate Local Binary Pattern Networks
Shaahin Angizi, Mehrdad Morsali, Sepehr Tabrizchi, and Arman Roohi
Abstract—In this work, a high-speed and energy-efficient comparator-based Near-Sensor Local Binary Pattern accelerator
architecture (NS-LBP) is proposed to execute a novel local binary pattern deep neural network. First, inspired by recent LBP networks,
we design an approximate, hardware-oriented, and multiply-accumulate (MAC)-free network named Ap-LBP for efficient feature
extraction, further reducing the computation complexity. Then, we develop NS-LBP as a processing-in-SRAM unit and a parallel
in-memory LBP algorithm to process images near the sensor in a cache, remarkably reducing the power consumption of data
transmission to an off-chip processor. Our circuit-to-application co-simulation results on MNIST and SVHN data-sets demonstrate
minor accuracy degradation compared to baseline CNN and LBP-network models, while NS-LBP achieves 1.25 GHz and
energy-efficiency of 37.4 TOPS/W. NS-LBP reduces energy consumption by 2.2×and execution time by a factor of 4×compared to
the best recent LBP-based networks.
Index Terms—Processing-in-memory, accelerator, near-sensor processing, SRAM.
F
1 INTRODUCTION
INTERNET of things’ (IoT) nodes consist of sensory sys-
tems, which enable massive data collection from the envi-
ronment and people to process with on-/off-chip processors
(1018 ops). In most cases, large portions of the captured sen-
sory data are redundant and unstructured. Data conversion
and transmission of large raw data to a back-end processor
imposes high energy consumption, high latency, and low-
speed feature extraction on the edge [1]. To overcome these
issues, computing architectures will need to shift from a
cloud-centric approach to a thing-centric (data-centric) ap-
proach, where the IoT node processes the sensed data. This
paves the way for a new smart sensor processing architec-
ture [2], [3], in which the pixel’s digital output is accelerated
near the sensor leveraging an on-chip processor. Unless a
Processing-in-Memory (PIM) mechanism is exploited [4],
[5] in this method, the von-Neumann computing model
with separate memory and processing blocks connecting via
buses imposes long memory access latency, limited memory
bandwidth, and energy-hungry data transfer restricting the
edge device’s efficiency and working hours [1], [6]. The
main idea of PIM is to incorporate logic computation within
memory units to process data internally.
From the computation perspective, numerous artifi-
cial intelligence applications require intensive multiply-
accumulate (MAC) operations, which contribute to over
90% of various deep Convolutional Neural Networks
(CNN) operations [5]. Various processing-in-SRAM plat-
forms have been developed in recent literature [5], [7]–[10].
Compute cache [7] supports simple bit-parallel operations
This work is supported in part by the National Science Foundation under
Grant No. 2228028, 2216772, and 2216773.
S. Angizi and M. Morsali are with the Department of Electrical and
Computer Engineering, New Jersey Institute of Technology, Newark, NJ,
USA. E-mail: shaahin.angizi@njit.edu.
A. Roohi and S. Tabrizchi are with School of Computing, University of
Nebraska–Lincoln, Lincoln NE, USA. E-mail: aroohi@unl.edu.
(logical and copy) that do not require interaction between
bit-lines. Neural Cache [5] presents an 8T transposable
SRAM bit-cell and supports bit-serial in-cache MAC opera-
tion. Nevertheless, this design imposes a very slow clock fre-
quency and a large cell and Sense Amplifier (SA) area over-
head. In [11], a new approach to improve the performance
of the Neural Cache has been presented based on 6T SRAM,
enabling faster multiplication and addition with a large
SA overhead. While the presented designs show acceptable
performance over various image data-sets by reducing the
number of operations, i.e., MACs, using shallower models,
quantization, pruning, etc., they are essentially developed
to execute the existing CNN algorithms that lead to a gap
between meets and needs. We believe such a discrepancy
can be avoided by co-developing an intrinsically-low computa-
tion network and an efficient PIM platform on the sensor side.
Regarding the model reduction of CNNs, Local Binary Pat-
tern (LBP)-based implementations have attained worldwide
attention for edge devices, resulting in a similar output
inference accuracy [12]–[14]. More interestingly, the amount
of convolution operations is drastically reduced owing to
the sparsity of kernels and conversion to simpler operations
such as addition/subtraction [15] and comparison [16].
In this work, inspired by recent LBP networks, (1) we
first develop a novel approximate, hardware-oriented, and
MAC-free neural network named Ap-LBP in Section 3 to
reduce computation complexity and memory access by dis-
regarding the least significant pixels to perform efficient
feature extraction. The Ap-LBP is leveraged on the sen-
sor side to simplify LBP layers before even mapping the
data into a near-sensor memory; (2) NS-LBP is designed
as a comparator-based processing-in-SRAM architecture, in
conjunction with the LBP parallel in-memory algorithm in
Section 4, which remarkably reduce the power consumption
as well as the latency of data transmission to a back-end
processor; (3) In Section 5, we propose a correlated data par-
titioning and hardware mapping methodology to process
the network locally; and (4) We extensively evaluate NS-
arXiv:2210.06698v1 [cs.AR] 13 Oct 2022
2
LBP performance, energy efficiency, and inference accuracy
trade-off compared to recent designs with a bottom-up
evaluation framework in Section 6.
2 BACKGROUND & MOTIVATION
2.1 Near-Sensor & In-Sensor Processing
Systematic integration of computing and sensor arrays has
been widely studied to eliminate off-chip data transmis-
sion and reduce Analog-to-Digital Converters (ADC) band-
width by combining CMOS image sensor and processors
in one chip as known as Processing-Near-Sensor (PNS) [2],
[3], [17]–[23], or even integrating pixels and computation
unit so-called Processing-In-Sensor (PIS) [24]–[32]. How-
ever, since enhancing the throughput and increasing the
computation load on the resource-limited IoT devices is fol-
lowed by a growth in the temperature and power consump-
tion as well as noises that lead to accuracy degradation [24],
[33], the computational capabilities of PNS/PIS platforms
have been limited to less complex applications [1], [34]–[36].
This includes particular feature extraction tasks, e.g., Haar-
like image filtering [34] and blurring [3]. Various powerful
processing-in-SRAM (in-cache computing) accelerators have
been developed in recent literature that can be employed as
a PNS unit [5], [7]–[11], [37]–[40]. Compute cache [7] sup-
ports simple bit-parallel operations (logical and copy) that
do not require interaction between bit-lines. XNOR-SRAM
[10] accelerates ternary-XNOR-and-accumulate operations
in binary/ternary Deep Neural Networks (DNNs) without
row-by-row data access. C3SRAM [9] leverages capacitive-
coupling computing to perform XNOR-and-accumulate op-
erations for binary DNNs. However, both XNOR-SRAM and
C3SRAM impose huge overhead over the traditional SRAM
array by directly modifying the bit-cells. Neural Cache [5]
presents an 8T transposable SRAM bit-cell and supports bit-
serial in-cache MAC operation. Nevertheless, this design
imposes a very slow clock frequency and a large cell and
SA area overhead. In [11], a new approach to improve the
performance of the Neural Cache has been presented based
on 6T SRAM, enabling faster multiplication and addition
with a large SA overhead. In the PIS domain, a CMOS image
sensor with dual-mode delta-sigma ADCs is designed in
[41] to process 1st-convolutional layer of Binarized-Weight
Neural Networks (BWNNs). RedEye [42] executes the con-
volution operation using charge-sharing tunable capacitors.
Although this design shows energy reduction compared
to a CPU/GPU by sacrificing accuracy, to achieve high
accuracy computation, the required energy per frame in-
creases dramatically by 100×. Macsen [24] processes the 1st-
convolutional layer of BWNNs with the correlated double
sampling procedure achieving 1000fps speed in compu-
tation mode. However, it suffers from humongous area-
overhead and power consumption. There are three main
bottlenecks in IoT imaging systems that this paper aims to
solve: (1) The data access and movement consume most of
the power (>90% [24], [37], [43]) in conventional image
sensors; (2) the computation imposes a large area-overhead
and power consumption in more recent PNS/PIS units and
requires extra memory for intermediate data storage; and (3)
the system is hardwired so their performance is intrinsically
limited to one specific type of algorithm or application
Input
0 0 1
0 C 1
0 1 1
0 1
2
345
6
7
feature
values 3
feature
values 2
feature
values 1
LBP
Encoding low-level fmap approximate
mapping
Shifted ReLU
Input
311 99
575 201
13 172 207
LBP(C) = 00011110 = 30
01
2
345
6
7
(a)
(b)
Ap-LBP Block
Avg.
pooling
Input Joint
Ap-LBP
Block 1
Ap-LBP
Block NJoint Batch
Norm.
b-MLP Output
b-MLP
Avg.
pooling
Input Joint
Ap-LBP
Block 1
Ap-LBP
Block NJoint Batch
Norm.
b-MLP Output
b-MLP
Figure 1: (a) Standard LBP encoding with 3×3descriptor
size, (b) The structure of the Ap-LBP, with NAp-LBP blocks.
domain, which means that such accelerators cannot keep
pace with rapidly evolving software algorithms.
2.2 LBP-based Networks
An LBP kernel is a computationally efficient feature de-
scriptor that scans through the entire image like that of a
convolutional layer in a CNN. The LBP descriptor is formed
by comparing the intensity of surrounding pixels serially
with the central pixel, referred to as Pivot, in the selected
image patch. Neighbors with higher (/lower) intensities are
assigned with a binary value of ‘1’(/‘0’) and finally the
bit stream is sequentially read and mapped to a decimal
number as the feature value assigned to the central pixel, as
shown in Fig. 1(a). The LBP encoding operation of central
pixel C(xc, yc)and its reformulated expression can be math-
ematically described as LBP (C) = Pd2
n=0 cmp(in, ic)×2n
[12], where dis the dimension of the LBP, inand icrep-
resent the intensity of nth neighboring- and central-pixel,
respectively; thus, cmp(in, ic)=1when inic, otherwise
outputs 0. Simulating LBP is accomplished using a ReLU
layer and the difference between pixel values. As part of the
training process, (i) the sparse binarized difference filters are
fixed, (ii) the one-by-one convolution kernels function as a
channel fusion mechanism, and (iii) the parameters in the
batch normalization layer (batch norm) are learned.
The Local Binary Pattern Network (LBPNet) [44] and
Local Binary Convolutional Neural Network (LBCNN) [15]
are two recent LBP networks where the convolutions are
approximated by local binary additions/subtractions and
local binary comparisons, respectively. It should be noted
that LBPNet and LBCNN are quite different, despite their
similarity in their names, as illustrated in Fig. 2. In the
LBCNN, batch norm layers are still heavily utilized, which
are completed in floating-point numbers for the linear trans-
form. Moreover, since the size and computation of 2D batch
norm layers are linear in the size of the feature maps,
model complexity increases dramatically. Therefore, the use
of LBCNNs for resource-constrained edge devices, such as
sensors, is still challenging and impractical. LBPNets, on
the other hand, learn directly about the sparse and discrete
LBP kernels, which are typically as small as a few KBs.
By using LBPNet, the computation of dot products and
sliding windows for convolution can be avoided. Rather,
the input is sampled, compared, and then the results of the
comparisons are stored in determined locations. Therefore,
in LBPNet, only trained patterns of sampling locations are
held, and no MAC operations (convolution-free) are per-
formed, making it a hardware-friendly and suitable model
for edge computing.
摘要:

1ANear-SensorProcessingAcceleratorforApproximateLocalBinaryPatternNetworksShaahinAngizi,MehrdadMorsali,SepehrTabrizchi,andArmanRoohiAbstract—Inthiswork,ahigh-speedandenergy-efcientcomparator-basedNear-SensorLocalBinaryPatternacceleratorarchitecture(NS-LBP)isproposedtoexecuteanovellocalbinarypattern...

展开>> 收起<<
1 A Near-Sensor Processing Accelerator for Approximate Local Binary Pattern Networks.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:5.15MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注