1 A Near-Sensor Processing Accelerator for Approximate Local Binary Pattern Networks

2025-04-24 0 0 5.15MB 10 页 10玖币

A Near-Sensor Processing Accelerator for

Approximate Local Binary Pattern Networks

Shaahin Angizi, Mehrdad Morsali, Sepehr Tabrizchi, and Arman Roohi

Abstract—In this work, a high-speed and energy-efﬁcient comparator-based Near-Sensor Local Binary Pattern accelerator

architecture (NS-LBP) is proposed to execute a novel local binary pattern deep neural network. First, inspired by recent LBP networks,

we design an approximate, hardware-oriented, and multiply-accumulate (MAC)-free network named Ap-LBP for efﬁcient feature

extraction, further reducing the computation complexity. Then, we develop NS-LBP as a processing-in-SRAM unit and a parallel

in-memory LBP algorithm to process images near the sensor in a cache, remarkably reducing the power consumption of data

transmission to an off-chip processor. Our circuit-to-application co-simulation results on MNIST and SVHN data-sets demonstrate

minor accuracy degradation compared to baseline CNN and LBP-network models, while NS-LBP achieves 1.25 GHz and

energy-efﬁciency of 37.4 TOPS/W. NS-LBP reduces energy consumption by 2.2×and execution time by a factor of 4×compared to

the best recent LBP-based networks.

Index Terms—Processing-in-memory, accelerator, near-sensor processing, SRAM.

1 INTRODUCTION

INTERNET of things’ (IoT) nodes consist of sensory sys-

tems, which enable massive data collection from the envi-

ronment and people to process with on-/off-chip processors

(1018 ops). In most cases, large portions of the captured sen-

sory data are redundant and unstructured. Data conversion

and transmission of large raw data to a back-end processor

imposes high energy consumption, high latency, and low-

speed feature extraction on the edge [1]. To overcome these

issues, computing architectures will need to shift from a

cloud-centric approach to a thing-centric (data-centric) ap-

proach, where the IoT node processes the sensed data. This

paves the way for a new smart sensor processing architec-

ture [2], [3], in which the pixel’s digital output is accelerated

near the sensor leveraging an on-chip processor. Unless a

Processing-in-Memory (PIM) mechanism is exploited [4],

[5] in this method, the von-Neumann computing model

with separate memory and processing blocks connecting via

buses imposes long memory access latency, limited memory

bandwidth, and energy-hungry data transfer restricting the

edge device’s efﬁciency and working hours [1], [6]. The

main idea of PIM is to incorporate logic computation within

memory units to process data internally.

From the computation perspective, numerous artiﬁ-

cial intelligence applications require intensive multiply-

accumulate (MAC) operations, which contribute to over

90% of various deep Convolutional Neural Networks

(CNN) operations [5]. Various processing-in-SRAM plat-

forms have been developed in recent literature [5], [7]–[10].

Compute cache [7] supports simple bit-parallel operations

This work is supported in part by the National Science Foundation under

Grant No. 2228028, 2216772, and 2216773.

•S. Angizi and M. Morsali are with the Department of Electrical and

Computer Engineering, New Jersey Institute of Technology, Newark, NJ,

USA. E-mail: shaahin.angizi@njit.edu.

•A. Roohi and S. Tabrizchi are with School of Computing, University of

Nebraska–Lincoln, Lincoln NE, USA. E-mail: aroohi@unl.edu.

(logical and copy) that do not require interaction between

bit-lines. Neural Cache [5] presents an 8T transposable

SRAM bit-cell and supports bit-serial in-cache MAC opera-

tion. Nevertheless, this design imposes a very slow clock fre-

quency and a large cell and Sense Ampliﬁer (SA) area over-

head. In [11], a new approach to improve the performance

of the Neural Cache has been presented based on 6T SRAM,

enabling faster multiplication and addition with a large

SA overhead. While the presented designs show acceptable

performance over various image data-sets by reducing the

number of operations, i.e., MACs, using shallower models,

quantization, pruning, etc., they are essentially developed

to execute the existing CNN algorithms that lead to a gap

between meets and needs. We believe such a discrepancy

can be avoided by co-developing an intrinsically-low computa-

tion network and an efﬁcient PIM platform on the sensor side.

Regarding the model reduction of CNNs, Local Binary Pat-

tern (LBP)-based implementations have attained worldwide

attention for edge devices, resulting in a similar output

inference accuracy [12]–[14]. More interestingly, the amount

of convolution operations is drastically reduced owing to

the sparsity of kernels and conversion to simpler operations

such as addition/subtraction [15] and comparison [16].

In this work, inspired by recent LBP networks, (1) we

ﬁrst develop a novel approximate, hardware-oriented, and

MAC-free neural network named Ap-LBP in Section 3 to

reduce computation complexity and memory access by dis-

regarding the least signiﬁcant pixels to perform efﬁcient

feature extraction. The Ap-LBP is leveraged on the sen-

sor side to simplify LBP layers before even mapping the

data into a near-sensor memory; (2) NS-LBP is designed

as a comparator-based processing-in-SRAM architecture, in

conjunction with the LBP parallel in-memory algorithm in

Section 4, which remarkably reduce the power consumption

as well as the latency of data transmission to a back-end

processor; (3) In Section 5, we propose a correlated data par-

titioning and hardware mapping methodology to process

the network locally; and (4) We extensively evaluate NS-

arXiv:2210.06698v1 [cs.AR] 13 Oct 2022

LBP performance, energy efﬁciency, and inference accuracy

trade-off compared to recent designs with a bottom-up

evaluation framework in Section 6.

2 BACKGROUND & MOTIVATION

2.1 Near-Sensor & In-Sensor Processing

Systematic integration of computing and sensor arrays has

been widely studied to eliminate off-chip data transmis-

sion and reduce Analog-to-Digital Converters (ADC) band-

width by combining CMOS image sensor and processors

in one chip as known as Processing-Near-Sensor (PNS) [2],

[3], [17]–[23], or even integrating pixels and computation

unit so-called Processing-In-Sensor (PIS) [24]–[32]. How-

ever, since enhancing the throughput and increasing the

computation load on the resource-limited IoT devices is fol-

lowed by a growth in the temperature and power consump-

tion as well as noises that lead to accuracy degradation [24],

[33], the computational capabilities of PNS/PIS platforms

have been limited to less complex applications [1], [34]–[36].

This includes particular feature extraction tasks, e.g., Haar-

like image ﬁltering [34] and blurring [3]. Various powerful

processing-in-SRAM (in-cache computing) accelerators have

been developed in recent literature that can be employed as

a PNS unit [5], [7]–[11], [37]–[40]. Compute cache [7] sup-

ports simple bit-parallel operations (logical and copy) that

do not require interaction between bit-lines. XNOR-SRAM

[10] accelerates ternary-XNOR-and-accumulate operations

in binary/ternary Deep Neural Networks (DNNs) without

row-by-row data access. C3SRAM [9] leverages capacitive-

coupling computing to perform XNOR-and-accumulate op-

erations for binary DNNs. However, both XNOR-SRAM and

C3SRAM impose huge overhead over the traditional SRAM

array by directly modifying the bit-cells. Neural Cache [5]

presents an 8T transposable SRAM bit-cell and supports bit-

serial in-cache MAC operation. Nevertheless, this design

imposes a very slow clock frequency and a large cell and

SA area overhead. In [11], a new approach to improve the

performance of the Neural Cache has been presented based

on 6T SRAM, enabling faster multiplication and addition

with a large SA overhead. In the PIS domain, a CMOS image

sensor with dual-mode delta-sigma ADCs is designed in

[41] to process 1st-convolutional layer of Binarized-Weight

Neural Networks (BWNNs). RedEye [42] executes the con-

volution operation using charge-sharing tunable capacitors.

Although this design shows energy reduction compared

to a CPU/GPU by sacriﬁcing accuracy, to achieve high

accuracy computation, the required energy per frame in-

creases dramatically by 100×. Macsen [24] processes the 1st-

convolutional layer of BWNNs with the correlated double

sampling procedure achieving 1000fps speed in compu-

tation mode. However, it suffers from humongous area-

overhead and power consumption. There are three main

bottlenecks in IoT imaging systems that this paper aims to

solve: (1) The data access and movement consume most of

the power (>90% [24], [37], [43]) in conventional image

sensors; (2) the computation imposes a large area-overhead

and power consumption in more recent PNS/PIS units and

requires extra memory for intermediate data storage; and (3)

the system is hardwired so their performance is intrinsically

limited to one speciﬁc type of algorithm or application

Input

0 0 1

0 C 1

0 1 1

0 1

345

feature

values 3

……

feature

values 2

feature

values 1

…

LBP

Encoding low-level fmap approximate

mapping

Shifted ReLU

Input

311 99

575 201

13 172 207

LBP(C) = 00011110 = 30

345

(a)

(b)

Ap-LBP Block

Avg.

pooling

Input Joint

Ap-LBP

Block 1

Ap-LBP

Block NJoint Batch

Norm.

b-MLP Output

…

b-MLP

Avg.

pooling

Input Joint

Ap-LBP

Block 1

Ap-LBP

Block NJoint Batch

Norm.

b-MLP Output

…

b-MLP

Figure 1: (a) Standard LBP encoding with 3×3descriptor

size, (b) The structure of the Ap-LBP, with NAp-LBP blocks.

domain, which means that such accelerators cannot keep

pace with rapidly evolving software algorithms.

2.2 LBP-based Networks

An LBP kernel is a computationally efﬁcient feature de-

scriptor that scans through the entire image like that of a

convolutional layer in a CNN. The LBP descriptor is formed

by comparing the intensity of surrounding pixels serially

with the central pixel, referred to as Pivot, in the selected

image patch. Neighbors with higher (/lower) intensities are

assigned with a binary value of ‘1’(/‘0’) and ﬁnally the

bit stream is sequentially read and mapped to a decimal

number as the feature value assigned to the central pixel, as

shown in Fig. 1(a). The LBP encoding operation of central

pixel C(xc, yc)and its reformulated expression can be math-

ematically described as LBP (C) = Pd−2

n=0 cmp(in, ic)×2n

[12], where dis the dimension of the LBP, inand icrep-

resent the intensity of nth neighboring- and central-pixel,

respectively; thus, cmp(in, ic)=1when in≥ic, otherwise

outputs 0. Simulating LBP is accomplished using a ReLU

layer and the difference between pixel values. As part of the

training process, (i) the sparse binarized difference ﬁlters are

ﬁxed, (ii) the one-by-one convolution kernels function as a

channel fusion mechanism, and (iii) the parameters in the

batch normalization layer (batch norm) are learned.

The Local Binary Pattern Network (LBPNet) [44] and

Local Binary Convolutional Neural Network (LBCNN) [15]

are two recent LBP networks where the convolutions are

approximated by local binary additions/subtractions and

local binary comparisons, respectively. It should be noted

that LBPNet and LBCNN are quite different, despite their

similarity in their names, as illustrated in Fig. 2. In the

LBCNN, batch norm layers are still heavily utilized, which

are completed in ﬂoating-point numbers for the linear trans-

form. Moreover, since the size and computation of 2D batch

norm layers are linear in the size of the feature maps,

model complexity increases dramatically. Therefore, the use

of LBCNNs for resource-constrained edge devices, such as

sensors, is still challenging and impractical. LBPNets, on

the other hand, learn directly about the sparse and discrete

LBP kernels, which are typically as small as a few KBs.

By using LBPNet, the computation of dot products and

sliding windows for convolution can be avoided. Rather,

the input is sampled, compared, and then the results of the

comparisons are stored in determined locations. Therefore,

in LBPNet, only trained patterns of sampling locations are

held, and no MAC operations (convolution-free) are per-

formed, making it a hardware-friendly and suitable model

for edge computing.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1ANear-SensorProcessingAcceleratorforApproximateLocalBinaryPatternNetworksShaahinAngizi,MehrdadMorsali,SepehrTabrizchi,andArmanRoohiAbstractInthiswork,ahigh-speedandenergy-efcientcomparator-basedNear-SensorLocalBinaryPatternacceleratorarchitecture(NS-LBP)isproposedtoexecuteanovellocalbinarypattern...

展开>> 收起<<

1 A Near-Sensor Processing Accelerator for Approximate Local Binary Pattern Networks.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 A Near-Sensor Processing Accelerator for Approximate Local Binary Pattern Networks

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: