Enabling ISP-less Low-Power Computer Vision Gourav Datta Zeyu Liu Zihan Yin Linyu Sun Akhilesh R. Jaiswal Peter A. Beerel Universiy of Southern California Los Angeles USA

2025-04-24 0 0 3.8MB 10 页 10玖币

侵权投诉

Enabling ISP-less Low-Power Computer Vision

Gourav Datta, Zeyu Liu, Zihan Yin, Linyu Sun, Akhilesh R. Jaiswal, Peter A. Beerel

Universiy of Southern California, Los Angeles, USA

{gdatta, liuzeyu, zihanyin, linyusun, akhilesh, pabeerel}@usc.edu

Abstract

Current computer vision (CV) systems use an image sig-

nal processing (ISP) unit to convert the high resolution raw

images captured by image sensors to visually pleasing RGB

images. Typically, CV models are trained on these RGB im-

ages and have yielded state-of-the-art (SOTA) performance

on a wide range of complex vision tasks, such as object de-

tection. In addition, in order to deploy these models on

resource-constrained low-power devices, recent works have

proposed in-sensor and in-pixel computing approaches that

try to partly/fully bypass the ISP and yield signiﬁcant band-

width reduction between the image sensor and the CV pro-

cessing unit by downsampling the activation maps in the

initial convolutional neural network (CNN) layers. How-

ever, direct inference on the raw images degrades the test

accuracy due to the difference in covariance of the raw im-

ages captured by the image sensors compared to the ISP-

processed images used for training. Moreover, it is difﬁcult

to train deep CV models on raw images, because most (if

not all) large-scale open-source datasets consist of RGB im-

ages. To mitigate this concern, we propose to invert the ISP

pipeline, which can convert the RGB images of any dataset

to its raw counterparts, and enable model training on raw

images. We release the raw version of the COCO dataset,

a large-scale benchmark for generic high-level vision tasks.

For ISP-less CV systems, training on these raw images re-

sult in a ∼7.1% increase in test accuracy on the visual wake

works (VWW) dataset compared to relying on training with

traditional ISP-processed RGB datasets. To further improve

the accuracy of ISP-less CV models and to increase the en-

ergy and bandwidth beneﬁts obtained by in-sensor/in-pixel

computing, we propose an energy-efﬁcient form of ana-

CNN computations. When evaluated on raw images cap-

tured by real sensors from the PASCALRAW dataset, our ap-

proach results in a 8.1% increase in mAP. Lastly, we demon-

strate a further 20.5% increase in mAP by using a novel ap-

plication of few-shot learning with thirty shots each for the

novel PASCALRAW dataset, constituting 3 classes.

1. Introduction

Modern high-resolution cameras generate huge amount

of visual data arranged in the form of raw Bayer color ﬁl-

ter arrays (CFA), also known as a mosaic pattern, as shown

in Fig. 1, that need to be processed for downstream CV

tasks [43, 1]. An ISP unit, consisting of several pipelined

processing stages, is typically used before the CV process-

ing to convert the raw mosaiced images to RGB counter-

parts [20, 42, 26, 29]. The ISP step that converts these

single-channel CFA images to three-channel RGB images

is called demosaicing. Historically, ISP has been proven to

be extremely effective for computational photography ap-

plications, where the goal is to generate images that are

aesthetically pleasing to the human eye [29, 8]. How-

ever, is it important for high-level CV applications, such

as face detection by smart security cameras, where the sen-

sor data is unlikely to be viewed by any human? Exist-

ing works [42, 20, 26] show that most ISP steps can be

discarded with a small drop in the test accuracy for large-

scale image recognition tasks. The removal of the ISP

can potentially enable existing in-sensor [31, 10, 2] and in-

pixel [5, 27, 12, 13, 14] computing paradigms to process

CV computations, such as CNNs partly in the sensor, and

reduce the bandwidth and energy incurred in the data trans-

fer between the sensor and the CV system. Moreover, most

low-power cameras with a few MPixels resolution, do not

have an on-board ISP [3], thereby requiring the ISP to be

implemented off-chip, increasing the energy consumption

of the total CV system.

Although the ISP removal can facilitate model deploy-

ments in resource-constrained edge devices, one key chal-

lenge is that most large-scale datasets, that are used to train

CV models, are ISP-processed. Since there is a large co-

variance shift between the raw and RGB images (please see

Fig. 1 where we show the histogram of the pixel inten-

sity distributions of RGB and raw images), models trained

on ISP-processed RGB images and inferred on raw im-

ages, thereby removing the ISP, exhibit a signiﬁcant drop

in the accuracy. One recent work has leveraged train-

able ﬂow-based invertible neural networks [44] to convert

raw to RGB images and vice-versa using open-source ISP

datasets. These networks have recently yielded SOTA test

arXiv:2210.05451v1 [cs.CV] 11 Oct 2022

Figure 1. Difference in frequency distributions of pixel intensities

between mosaiced raw, demosaiced, and ISP-processed images.

performance in photographic tasks, which we propose to

modify to invert the ISP pipeline, and build the raw ver-

sion of any large-scale ISP processed database for high-

level vision applications, such as object detection. This

raw dataset can then be used to train CV models that can

be efﬁciently deployed on low-power edge devices without

any of the ISP steps, including demosaicing. To further im-

prove the performance of these ISP-less models, we pro-

pose a novel hardware-software co-design approach, where

a form of demosaicing is applied on the raw mosaiced im-

ages inside the pixel array using analog summation during

the pixel read-out operation, i.e., without a dedicated ISP

unit. Our models trained on this demosaiced version of the

visual wake words (VWW) lead to a 8.2% increase in the

test accuracy compared to standard training on RGB images

and inference on raw images (to simulate the ISP removal

and the in-pixel/in-sensor implementation). Even compared

to standard RGB training and inference, our models yield

0.7% (1.6%) higher accuracy (mAP) on the VWW (COCO)

dataset. Lastly, we propose a novel application of few-shot

learning to improve the accuracy of real raw images cap-

tured directly by a camera (which has limited number of

annotations) with our generated raw images constituting the

base dataset.

The key contributions of our paper can be summarized

as follows.

• Inspired by the energy and bandwidth beneﬁts ob-

tained by in-sensor computing approaches and the re-

moval of most ISP steps in a CV pipeline, we present

and release a large-scale raw image database that can

be used to train accurate CV models for low-power

ISP-less edge deployments. This dataset is generated

by reversing the entire ISP pipeline using the recently

proposed ﬂow-based invertible neural networks and

custom mosaicing. We demonstrate the utility of this

dataset to train ISP-less CV models with raw images.

• To improve the accuracy obtained with raw images, we

propose a low-overhead form of in-pixel demosaicing

that can be implemented directly on the pixel array

alongside other CV computations enabled by recent

paradigms of in-pixel/in-sensor computing approaches

and that also reduces the data bandwidth.

• We present a thorough evaluation of our approach with

both simulated (our released dataset) and real (cap-

tured by a real camera) raw images, for a diverse range

of use-cases with different memory/compute budgets.

• To improve the accuracy of real raw images, we pro-

pose a novel application of few-shot learning, with the

simulated raw images having a large number of la-

belled classes constituting the base dataset.

2. Related Works

2.1. ISP Reversal & Removal

Since most ISP steps are irreversible, and depend on the

camera manufacturer’s proprietary color proﬁle [6], it is

difﬁcult to invert the ISP pipeline. To mitigate this chal-

lenge, a few recent works [25, 32, 46] proposed learning-

based methods, but they result in large losses and the re-

covered RAW images may be signiﬁcantly different from

the originals captured by the camera. To reduce this loss,

a more recent work [44] used a stack of kinvertible and

bijective functions f=f1·f2·..fkto invert the ISP

pipeline. For a raw input x, the RGB output yand the in-

verted raw input xis computed as y=f1f2..fk(x)

and x=f−1

kf−1

k−1..f−1

1(y).

The bijective function fiis implemented through afﬁne

coupling layers [44]. In each afﬁne coupling layer, given a

Ddimensional input mand d<D, the output nis

n1:d=m1:d+r(md+1:D)(1)

nd+1:D=md+1:Dexp(s(m1:d)) + t(m1:d)(2)

where sand trepresent scale and translation functions from

Rdto RD−dthat are realized by neural networks, repre-

sents the Hadamard product, and rrepresents an arbitrary

function from RD−dto Rd. The inverse step is

md+1:D= (nd+1:D−t(n1:d)) exp(−s(n1:d)) (3)

m1:d=n1:d−r(md+1:D)(4)

The authors then utilize invertible 1×1convolution, pro-

posed in [23], as the learnable permutation function to revert

the channel order for the subsequent afﬁne coupling layer.

Recent works have also investigated the role of the

ISP in image classiﬁcation and the impact of its’ re-

moval/trimming on accuracy for energy and bandwidth ben-

eﬁts. For example, [20] demonstrated that removal of the

whole ISP during edge inference results in a ∼8.6% loss in

accuracy with MobileNets [36] on ImageNet [15], which

can mostly be recovered by using just the tone-mapping

stage. Another work [42] attempted to integrate the ISP

and CV processing using tone mapping and feature-aware

downscaling blocks that reduce both the number of bits

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EnablingISP-lessLow-PowerComputerVisionGouravDatta,ZeyuLiu,ZihanYin,LinyuSun,AkhileshR.Jaiswal,PeterA.BeerelUniversiyofSouthernCalifornia,LosAngeles,USAfgdatta,liuzeyu,zihanyin,linyusun,akhilesh,pabeerelg@usc.eduAbstractCurrentcomputervision(CV)systemsuseanimagesig-nalprocessing(ISP)unittoconvertthe...

展开>> 收起<<

Enabling ISP-less Low-Power Computer Vision Gourav Datta Zeyu Liu Zihan Yin Linyu Sun Akhilesh R. Jaiswal Peter A. Beerel Universiy of Southern California Los Angeles USA.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Enabling ISP-less Low-Power Computer Vision Gourav Datta Zeyu Liu Zihan Yin Linyu Sun Akhilesh R. Jaiswal Peter A. Beerel Universiy of Southern California Los Angeles USA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: