
2
LBP performance, energy efficiency, and inference accuracy
trade-off compared to recent designs with a bottom-up
evaluation framework in Section 6.
2 BACKGROUND & MOTIVATION
2.1 Near-Sensor & In-Sensor Processing
Systematic integration of computing and sensor arrays has
been widely studied to eliminate off-chip data transmis-
sion and reduce Analog-to-Digital Converters (ADC) band-
width by combining CMOS image sensor and processors
in one chip as known as Processing-Near-Sensor (PNS) [2],
[3], [17]–[23], or even integrating pixels and computation
unit so-called Processing-In-Sensor (PIS) [24]–[32]. How-
ever, since enhancing the throughput and increasing the
computation load on the resource-limited IoT devices is fol-
lowed by a growth in the temperature and power consump-
tion as well as noises that lead to accuracy degradation [24],
[33], the computational capabilities of PNS/PIS platforms
have been limited to less complex applications [1], [34]–[36].
This includes particular feature extraction tasks, e.g., Haar-
like image filtering [34] and blurring [3]. Various powerful
processing-in-SRAM (in-cache computing) accelerators have
been developed in recent literature that can be employed as
a PNS unit [5], [7]–[11], [37]–[40]. Compute cache [7] sup-
ports simple bit-parallel operations (logical and copy) that
do not require interaction between bit-lines. XNOR-SRAM
[10] accelerates ternary-XNOR-and-accumulate operations
in binary/ternary Deep Neural Networks (DNNs) without
row-by-row data access. C3SRAM [9] leverages capacitive-
coupling computing to perform XNOR-and-accumulate op-
erations for binary DNNs. However, both XNOR-SRAM and
C3SRAM impose huge overhead over the traditional SRAM
array by directly modifying the bit-cells. Neural Cache [5]
presents an 8T transposable SRAM bit-cell and supports bit-
serial in-cache MAC operation. Nevertheless, this design
imposes a very slow clock frequency and a large cell and
SA area overhead. In [11], a new approach to improve the
performance of the Neural Cache has been presented based
on 6T SRAM, enabling faster multiplication and addition
with a large SA overhead. In the PIS domain, a CMOS image
sensor with dual-mode delta-sigma ADCs is designed in
[41] to process 1st-convolutional layer of Binarized-Weight
Neural Networks (BWNNs). RedEye [42] executes the con-
volution operation using charge-sharing tunable capacitors.
Although this design shows energy reduction compared
to a CPU/GPU by sacrificing accuracy, to achieve high
accuracy computation, the required energy per frame in-
creases dramatically by 100×. Macsen [24] processes the 1st-
convolutional layer of BWNNs with the correlated double
sampling procedure achieving 1000fps speed in compu-
tation mode. However, it suffers from humongous area-
overhead and power consumption. There are three main
bottlenecks in IoT imaging systems that this paper aims to
solve: (1) The data access and movement consume most of
the power (>90% [24], [37], [43]) in conventional image
sensors; (2) the computation imposes a large area-overhead
and power consumption in more recent PNS/PIS units and
requires extra memory for intermediate data storage; and (3)
the system is hardwired so their performance is intrinsically
limited to one specific type of algorithm or application
Input
0 0 1
0 C 1
0 1 1
0 1
2
345
6
7
feature
values 3
……
feature
values 2
feature
values 1
…
LBP
Encoding low-level fmap approximate
mapping
Shifted ReLU
Input
311 99
575 201
13 172 207
LBP(C) = 00011110 = 30
01
2
345
6
7
(a)
(b)
Ap-LBP Block
Avg.
pooling
Input Joint
Ap-LBP
Block 1
Ap-LBP
Block NJoint Batch
Norm.
b-MLP Output
…
…
b-MLP
Avg.
pooling
Input Joint
Ap-LBP
Block 1
Ap-LBP
Block NJoint Batch
Norm.
b-MLP Output
…
…
b-MLP
Figure 1: (a) Standard LBP encoding with 3×3descriptor
size, (b) The structure of the Ap-LBP, with NAp-LBP blocks.
domain, which means that such accelerators cannot keep
pace with rapidly evolving software algorithms.
2.2 LBP-based Networks
An LBP kernel is a computationally efficient feature de-
scriptor that scans through the entire image like that of a
convolutional layer in a CNN. The LBP descriptor is formed
by comparing the intensity of surrounding pixels serially
with the central pixel, referred to as Pivot, in the selected
image patch. Neighbors with higher (/lower) intensities are
assigned with a binary value of ‘1’(/‘0’) and finally the
bit stream is sequentially read and mapped to a decimal
number as the feature value assigned to the central pixel, as
shown in Fig. 1(a). The LBP encoding operation of central
pixel C(xc, yc)and its reformulated expression can be math-
ematically described as LBP (C) = Pd−2
n=0 cmp(in, ic)×2n
[12], where dis the dimension of the LBP, inand icrep-
resent the intensity of nth neighboring- and central-pixel,
respectively; thus, cmp(in, ic)=1when in≥ic, otherwise
outputs 0. Simulating LBP is accomplished using a ReLU
layer and the difference between pixel values. As part of the
training process, (i) the sparse binarized difference filters are
fixed, (ii) the one-by-one convolution kernels function as a
channel fusion mechanism, and (iii) the parameters in the
batch normalization layer (batch norm) are learned.
The Local Binary Pattern Network (LBPNet) [44] and
Local Binary Convolutional Neural Network (LBCNN) [15]
are two recent LBP networks where the convolutions are
approximated by local binary additions/subtractions and
local binary comparisons, respectively. It should be noted
that LBPNet and LBCNN are quite different, despite their
similarity in their names, as illustrated in Fig. 2. In the
LBCNN, batch norm layers are still heavily utilized, which
are completed in floating-point numbers for the linear trans-
form. Moreover, since the size and computation of 2D batch
norm layers are linear in the size of the feature maps,
model complexity increases dramatically. Therefore, the use
of LBCNNs for resource-constrained edge devices, such as
sensors, is still challenging and impractical. LBPNets, on
the other hand, learn directly about the sparse and discrete
LBP kernels, which are typically as small as a few KBs.
By using LBPNet, the computation of dot products and
sliding windows for convolution can be avoided. Rather,
the input is sampled, compared, and then the results of the
comparisons are stored in determined locations. Therefore,
in LBPNet, only trained patterns of sampling locations are
held, and no MAC operations (convolution-free) are per-
formed, making it a hardware-friendly and suitable model
for edge computing.