Revisiting Sparse Convolutional Model for Visual Recognition Xili Dai1Mingyang Li2 Pengyuan Zhai3Shengbang Tong4Xingjian Gao4

2025-04-29 0 0 6.92MB 17 页 10玖币

侵权投诉

Revisiting Sparse Convolutional Model for Visual

Recognition

Xili Dai1∗Mingyang Li2 * Pengyuan Zhai3Shengbang Tong4Xingjian Gao4

Shao-Lun Huang2Zhihui Zhu5Chong You4Yi Ma2,4

1The Hong Kong University of Science and Technology (Guangzhou)

2Tsinghua-Berkeley Shenzhen Institute (TBSI), Tsinghua University

3Harvard University 4University of California, Berkeley 5Ohio State University

Abstract

Despite strong empirical performance for image classiﬁcation, deep neural net-

works are often regarded as “black boxes” and they are difﬁcult to interpret. On

the other hand, sparse convolutional models, which assume that a signal can

be expressed by a linear combination of a few elements from a convolutional

dictionary, are powerful tools for analyzing natural images with good theoretical in-

terpretability and biological plausibility. However, such principled models have not

demonstrated competitive performance when compared with empirically designed

deep networks. This paper revisits the sparse convolutional modeling for image

classiﬁcation and bridges the gap between good empirical performance (of deep

learning) and good interpretability (of sparse convolutional models). Our method

uses differentiable optimization layers that are deﬁned from convolutional sparse

coding as drop-in replacements of standard convolutional layers in conventional

deep neural networks. We show that such models have equally strong empirical

performance on CIFAR-10, CIFAR-100, and ImageNet datasets when compared to

conventional neural networks. By leveraging stable recovery property of sparse

modeling, we further show that such models can be much more robust to input

corruptions as well as adversarial perturbations in testing through a simple proper

trade-off between sparse regularization and data reconstruction terms. Source code

can be found at https://github.com/Delay-Xili/SDNet.

1 Introduction

In recent years, deep learning has been a dominant approach for image classiﬁcation and has

signiﬁcantly advanced the performance over previous shallow models. Despite the phenomenal

empirical success, it has been increasingly realized as well as criticized that deep convolutional

networks (ConvNets) are “black boxes” for which we are yet to develop clear understanding [

]. The

layer operations such as convolution, nonlinearity and normalization are geared towards minimizing

an end-to-end training loss and do not have much data-speciﬁc meaning. As such, the functionality

of each intermediate layer in a trained ConvNets is mostly unclear and the feature maps that they

produce are hard to interpret. The lack of interpretability also contributes to the notorious difﬁculty

in enhancing such learning systems for practical data which are usually corrupted by various forms

of perturbation.

This paper presents a visual recognition framework by introducing layers that have explicit data

modeling to tackle shortcomings of current deep learning systems. We work under the assumption

that the layer input can be represented by a few atoms from a dictionary shared by all data points. This

∗Equal contribution

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12945v1 [cs.CV] 24 Oct 2022

is the classical sparse data modeling that, as shown in a pioneering work of [

], can easily discover

meaningful structures from natural image patches. Backed by its ability in learning interpretable

representations and strong theoretical guarantees [

] (e.g. for handling corrupted

data), sparse modeling has been used broadly in many signal and image processing applications

[

]. However, the empirical performance of sparse methods have been surpassed by deep learning

methods for classiﬁcation of modern image datasets.

Because of the complementary beneﬁts of sparse modeling and deep learning, there exist many

efforts that leverage sparse modeling to gain theoretical insights into ConvNets and/or to develop

computational methods that further improve upon existing ConvNets. One of the pioneering works is

[

] which interpreted a ConvNet as approximately solving a multi-layer convolutional sparse coding

model. Based on this interpretation, the work [

] and its follow-ups [

] presented

alternative algorithms and models in order to further enhance the practical performance of such

learning systems. However, there has been no empirical evidence that such systems can handle modern

image datasets such as ImageNet and obtain comparable performance to deep learning. The only

exception to the best of our knowledge is the work of [

] which exhibited a performance on par

to (on ImageNet) or better than (on CIFAR-10) ResNet. However, the method in [

] 1) requires a

dedicated design of network architecture that may limit its applicability, 2) is computationally orders

of magnitude slower to train, and 3) does not demonstrate beneﬁts in terms of interpretability and

robustness. In a nutshell, sparse modeling is yet to demonstrate practicality that enables its broad

applications.

Paper contributions.

In this paper, we revisit sparse modeling for image classiﬁcation and demon-

strate through a simple design that sparse modeling can be combined with deep learning to obtain

performance on par with standard ConvNets but with better layer-wise interpretability and stability.

Our method encapsulates the sparse modeling into an implicit layer [

] and uses it as a

drop-in replacement for any convolutional layer in standard ConvNets. The layer implements the

convolutional sparse coding (CSC) model of [

], and is referred to as a CSC-layer, where the input

signal is approximated by a sparse linear combination of atoms from a convolutional dictionary. Such

a convolutional dictionary is treated as the parameters of the CSC-layer that are amenable to training

via back-propagation. Then, the overall network with the CSC-layers may be trained in an end-to-end

fashion from labeled data by minimizing the cross-entropy loss as usual. This paper demonstrates

that such a learning framework has the following beneﬁts:

•Performance on standard datasets.

We demonstrate that our network obtains better (on CIFAR-

100) or on par (on CIFAR-10 and ImageNet) performance with similar training time compared with

standard architectures such as ResNet [

]. This provides the ﬁrst evidence on the strong empirical

performance of sparse modeling for deep learning to the best of our knowledge. Compared

to previous sparse methods [

] that obtained similar performance, our method is of orders of

magnitude faster.

•Robustness to input perturbations.

The stable recovery property of sparse convolution model

equips the CSC-layers with the ability to remove perturbation in the layer input and to recover clean

sparse code. As a result, our networks with CSC-layers are more robust to perturbations in the

input images compared with classical neural networks. Unlike existing approaches for obtaining

robustness that require heavy data augmentation [

] or additional training techniques [

], our

method is light-weight and does not require modifying the training procedure at all.

2 Related Work

Implicit layers.

The idea of trainable layers deﬁned from implicit functions can be traced back at

least to the work of [

]. Recently, there is a revival of interests in implicit layers [

] as an attractive alternative to explicit layers in existing neural networks. However, a

majority of the cited works above deﬁne an implicit layer by a ﬁxed point iteration, typically motivated

from existing explicit layers such as residual layers, therefore they do not have clear interpretation

in terms of modeling of the layer input. Consequently, such models do not have the ability to deal

with input perturbations. The only exceptions are differentiable optimization layers [

]

that incorporate complex dependencies between hidden layers through the formulation of convex

optimization. Nevertheless, most of the above works focus on differentiating through the convex

optimization layers (such as disciplined parametrized programming [

]) without specializing in any

particular signal models such as the sparse models considered in this paper nor demonstrating their

performance when encapsulated in multi-layer neural networks.

Sparse prior in deep learning.

Aside from image classiﬁcation, sparse modeling has been intro-

duced to deep learning for many image processing tasks such as super resolution [

], denoising [

]

and so on [

]. These works incorporate sparse modeling by using network architectures

that are motivated by (but are not the same as) an unrolled sparse coding algorithmn LISTA [

]. In

sharp contrast to ours, there is no guarantee that such architectures perform a sparse encoding with

respect to a particular (convolution) dictionary at all. As a result, they lack the capability of handling

input perturbations as in our method. A notable exception is the work of [

] where each layer

performs a precise sparse encoding and exhibits on par or better performance for image classiﬁcation

over ResNet. However, the practical beneﬁt of the sparse modeling in terms of robustness is not

demonstrated. Moreover, [15] adopts a patch-based sparse coding model for images and has a large

computational burden.

Robustness.

It is known that modern neural networks are extremely vulnerable to small perturbations

in the input data. A plethora of techniques have been proposed to address this instability issue,

including stability training [

], adversarial training [

], data augmentation [

etc. Nevertheless, these techniques either need a computational and memory overhead, or require

a selection of appropriate augmentation strategies to cover all possible corruptions. With standard

training only, our model can be made robust to input perturbations in test data by simply adapting

sparse modeling to account for noise. Closely related to our work are [

] which use sparse

modeling to improve adversarial robustness. However, they either only demonstrate performance

on very simple networks [

] or sacriﬁce natural accuracy for robustness [

]. In contrast, our

method is tested on realistic networks and does not affect natural accuracy.

3 Neural Networks with Sparse Modeling

In this section, we show how sparse modeling is incorporated into a deep network via a speciﬁc type

of network layer that we refer to as the convolutional sparse coding (CSC) layer. We describe the

CSC-layer in Sec. 3.1 and explain how we use them for deep learning in Sec. 3.2. Finally, Sec. 3.3

explains how CSC enables robust inference with corrupted test data.

Notations.

Given a single-channel image

ξ∈RH×W

represented as a matrix, we may treat it as a

2D signal deﬁned on the discrete domain

[1, . . . , H]×[1, . . . , W ]

and extended to

Z×Z

by padding

zeros. Given a 2D kernel

α∈Rk×k

, we may treat it as a 2D signal deﬁned on the discrete domain

[−k0··· , k0]×[−k0,··· , k0]

with

k= 2k0+ 1

and extended to

Z×Z

by padding zeros. Then, for

convenience, we use “

∗

” and “

” to denote the convolution and correlation operators, respectively,

between two 2D signals:

(α∗ξ)[i, j].

ξ[i−p, j −q]·α[p, q],

(α?ξ)[i, j].

ξ[i+p, j +q]·α[p, q].

(1)

3.1 Convolutional Sparse Coding (CSC) Layer

Sparse modeling is introduced in the form of an implicit layer of a neural network. Unlike classical

fully-connected or convolutional layers in which input-output relations are deﬁned by an explicit

function, implicit layers are deﬁned from implicit functions. For our case, in particular, we introduce

an implicit layer that is deﬁned from an optimization problem involving the input to the layer as well

as a weight parameter, where the output of the layer is the solution to the optimization problem.

A generative model via sparse convolution.

Concretely, given a multi-dimensional input signal

x∈RM×H×W

to the layer where

H, W

are spatial dimensions and

is the number of channels for

. We assume the signal

is generated by a multi-channel sparse code

z∈RC×H×W

convoluting

with a multi-dimensional kernel

A∈RM×C×k×k

, which is referred to as a convolution dictionary.

Here

is the number of channels for

and the convolution kernel

. To be more precise, we denote

zas z.

= (ζ1,...,ζC)where each ζc∈RH×W(presumably sparse), and denote the kernel Aas

=





α11 α12 α13 . . . α1C

α21 α22 α23 . . . α2C

.....

αM1αM2αM3. . . αMC





∈RM×C×k×k,(2)

where each

αij ∈Rk×k

is a kernel of size

k×k

. Then the signal

is generated via the following

operator A(·)deﬁned by the kernel Aas:

x=A(z).

c=1 α1c?ζc,...,αMc ?ζc∈RM×H×W.(3)

A layer as convolutional sparse coding.

Given a multi-dimensional input signal

x∈RM×H×W

we deﬁne that the function of “a layer” is to perform an (inverse) mapping to a preferably sparse

output

z∗∈RC×H×W

, where

is the number of output channels. Under the above sparse generative

model, we can seek the optimal sparse solution

by solving the following Lasso type optimization

problem:

z∗= arg min

zλkzk1+1

2kx− A(z)k2

2∈RC×H×W.(4)

Figure 1: Illustration of the operator

in the convolu-

tional sparse coding model for the CSC-layer.

The optimization problem in

(4)

is based

on the convolutional sparse coding (CSC)

model [

]

. Hence, we refer to the im-

plicit layer deﬁned by

(4)

as a CSC-layer.

The goal of the CSC model is to reconstruct

the input

via

A(z)

, where the feature

map

speciﬁes the locations and magni-

tudes of the convolutional ﬁlters in

to be

linearly combined (see Figure 1 for an illus-

tration). The reconstruction is not required

to be exact in order to tolerate modeling

discrepancies, and the difference between

and

A(z)

is penalized by its entry-wise

-norm (i.e., the

norm of

x−A(z)

ﬂat-

tened into a vector). Sparse modeling is

introduced by the entry-wise

-norm of

in the objective function, which enforces

to be sparse. The parameter

λ > 0

controls the tradeoff between the sparsity of

and the magnitude of

the residual

x−A(z)

, and is treated as a hyper-parameter that subjects to tuning via cross-validation.

As we will show in Sec. 3.3,

can be used to improve the performance of our model in the test phase

when the input is corrupted.

Based on the input-output mapping of the CSC-layer given in

(4)

, one may perform forward propa-

gation by solving the associated optimization, and perform backward propagation by deriving the

gradient of

z∗

with respect to the input

and parameter

. In this paper, we adopt the fast iterative

shrinkage thresholding algorithm (FISTA) [

] for the forward propagation, which also produces an

unrolled network architecture that can carry out automatic differentiation for backward propagation.

We defer a discussion of the implementation details of the CSC-layer to the Appendix.

3.2 Sparse Dictionary Learning Network Architecture and Training

Convolution layers are basic ingredients of ConvNets that appear in many common network archi-

tectures such as LeNet [

] and ResNet [

]. In this paper, we incorporate sparse modeling into a

given existing/baseline network architecture by replacing certain / all convolution layers with the

CSC-layer. Meanwhile, all other layers such as normalization, nonlinear, and fully connected layers

Typically, convolution operators “

∗

” are used in the deﬁnition of the operator

(see

(3)

), rather than the

correlation operators “

”. We adopt the deﬁnition in

(3)

to be consistent with the convention of modern deep

learning packages.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RevisitingSparseConvolutionalModelforVisualRecognitionXiliDai1MingyangLi2*PengyuanZhai3ShengbangTong4XingjianGao4Shao-LunHuang2ZhihuiZhu5ChongYou4YiMa2,41TheHongKongUniversityofScienceandTechnology(Guangzhou)2Tsinghua-BerkeleyShenzhenInstitute(TBSI),TsinghuaUniversity3HarvardUniversity4Universityof...

展开>> 收起<<

Revisiting Sparse Convolutional Model for Visual Recognition Xili Dai1Mingyang Li2 Pengyuan Zhai3Shengbang Tong4Xingjian Gao4.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Revisiting Sparse Convolutional Model for Visual Recognition Xili Dai1Mingyang Li2 Pengyuan Zhai3Shengbang Tong4Xingjian Gao4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: