Revisiting Sparse Convolutional Model for Visual Recognition Xili Dai1Mingyang Li2 Pengyuan Zhai3Shengbang Tong4Xingjian Gao4

2025-04-29 0 0 6.92MB 17 页 10玖币
侵权投诉
Revisiting Sparse Convolutional Model for Visual
Recognition
Xili Dai1Mingyang Li2 * Pengyuan Zhai3Shengbang Tong4Xingjian Gao4
Shao-Lun Huang2Zhihui Zhu5Chong You4Yi Ma2,4
1The Hong Kong University of Science and Technology (Guangzhou)
2Tsinghua-Berkeley Shenzhen Institute (TBSI), Tsinghua University
3Harvard University 4University of California, Berkeley 5Ohio State University
Abstract
Despite strong empirical performance for image classification, deep neural net-
works are often regarded as “black boxes” and they are difficult to interpret. On
the other hand, sparse convolutional models, which assume that a signal can
be expressed by a linear combination of a few elements from a convolutional
dictionary, are powerful tools for analyzing natural images with good theoretical in-
terpretability and biological plausibility. However, such principled models have not
demonstrated competitive performance when compared with empirically designed
deep networks. This paper revisits the sparse convolutional modeling for image
classification and bridges the gap between good empirical performance (of deep
learning) and good interpretability (of sparse convolutional models). Our method
uses differentiable optimization layers that are defined from convolutional sparse
coding as drop-in replacements of standard convolutional layers in conventional
deep neural networks. We show that such models have equally strong empirical
performance on CIFAR-10, CIFAR-100, and ImageNet datasets when compared to
conventional neural networks. By leveraging stable recovery property of sparse
modeling, we further show that such models can be much more robust to input
corruptions as well as adversarial perturbations in testing through a simple proper
trade-off between sparse regularization and data reconstruction terms. Source code
can be found at https://github.com/Delay-Xili/SDNet.
1 Introduction
In recent years, deep learning has been a dominant approach for image classification and has
significantly advanced the performance over previous shallow models. Despite the phenomenal
empirical success, it has been increasingly realized as well as criticized that deep convolutional
networks (ConvNets) are “black boxes” for which we are yet to develop clear understanding [
1
]. The
layer operations such as convolution, nonlinearity and normalization are geared towards minimizing
an end-to-end training loss and do not have much data-specific meaning. As such, the functionality
of each intermediate layer in a trained ConvNets is mostly unclear and the feature maps that they
produce are hard to interpret. The lack of interpretability also contributes to the notorious difficulty
in enhancing such learning systems for practical data which are usually corrupted by various forms
of perturbation.
This paper presents a visual recognition framework by introducing layers that have explicit data
modeling to tackle shortcomings of current deep learning systems. We work under the assumption
that the layer input can be represented by a few atoms from a dictionary shared by all data points. This
Equal contribution
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12945v1 [cs.CV] 24 Oct 2022
is the classical sparse data modeling that, as shown in a pioneering work of [
2
], can easily discover
meaningful structures from natural image patches. Backed by its ability in learning interpretable
representations and strong theoretical guarantees [
3
,
4
,
5
,
6
,
7
,
8
] (e.g. for handling corrupted
data), sparse modeling has been used broadly in many signal and image processing applications
[
9
]. However, the empirical performance of sparse methods have been surpassed by deep learning
methods for classification of modern image datasets.
Because of the complementary benefits of sparse modeling and deep learning, there exist many
efforts that leverage sparse modeling to gain theoretical insights into ConvNets and/or to develop
computational methods that further improve upon existing ConvNets. One of the pioneering works is
[
10
] which interpreted a ConvNet as approximately solving a multi-layer convolutional sparse coding
model. Based on this interpretation, the work [
10
] and its follow-ups [
11
,
12
,
13
,
14
] presented
alternative algorithms and models in order to further enhance the practical performance of such
learning systems. However, there has been no empirical evidence that such systems can handle modern
image datasets such as ImageNet and obtain comparable performance to deep learning. The only
exception to the best of our knowledge is the work of [
15
,
16
] which exhibited a performance on par
to (on ImageNet) or better than (on CIFAR-10) ResNet. However, the method in [
15
,
16
] 1) requires a
dedicated design of network architecture that may limit its applicability, 2) is computationally orders
of magnitude slower to train, and 3) does not demonstrate benefits in terms of interpretability and
robustness. In a nutshell, sparse modeling is yet to demonstrate practicality that enables its broad
applications.
Paper contributions.
In this paper, we revisit sparse modeling for image classification and demon-
strate through a simple design that sparse modeling can be combined with deep learning to obtain
performance on par with standard ConvNets but with better layer-wise interpretability and stability.
Our method encapsulates the sparse modeling into an implicit layer [
17
,
18
,
19
] and uses it as a
drop-in replacement for any convolutional layer in standard ConvNets. The layer implements the
convolutional sparse coding (CSC) model of [
20
], and is referred to as a CSC-layer, where the input
signal is approximated by a sparse linear combination of atoms from a convolutional dictionary. Such
a convolutional dictionary is treated as the parameters of the CSC-layer that are amenable to training
via back-propagation. Then, the overall network with the CSC-layers may be trained in an end-to-end
fashion from labeled data by minimizing the cross-entropy loss as usual. This paper demonstrates
that such a learning framework has the following benefits:
Performance on standard datasets.
We demonstrate that our network obtains better (on CIFAR-
100) or on par (on CIFAR-10 and ImageNet) performance with similar training time compared with
standard architectures such as ResNet [
21
]. This provides the first evidence on the strong empirical
performance of sparse modeling for deep learning to the best of our knowledge. Compared
to previous sparse methods [
15
] that obtained similar performance, our method is of orders of
magnitude faster.
Robustness to input perturbations.
The stable recovery property of sparse convolution model
equips the CSC-layers with the ability to remove perturbation in the layer input and to recover clean
sparse code. As a result, our networks with CSC-layers are more robust to perturbations in the
input images compared with classical neural networks. Unlike existing approaches for obtaining
robustness that require heavy data augmentation [
22
] or additional training techniques [
23
], our
method is light-weight and does not require modifying the training procedure at all.
2 Related Work
Implicit layers.
The idea of trainable layers defined from implicit functions can be traced back at
least to the work of [
24
]. Recently, there is a revival of interests in implicit layers [
17
,
18
,
19
,
25
,
26
,
27
,
28
,
29
] as an attractive alternative to explicit layers in existing neural networks. However, a
majority of the cited works above define an implicit layer by a fixed point iteration, typically motivated
from existing explicit layers such as residual layers, therefore they do not have clear interpretation
in terms of modeling of the layer input. Consequently, such models do not have the ability to deal
with input perturbations. The only exceptions are differentiable optimization layers [
30
,
17
,
18
,
31
]
that incorporate complex dependencies between hidden layers through the formulation of convex
optimization. Nevertheless, most of the above works focus on differentiating through the convex
optimization layers (such as disciplined parametrized programming [
18
]) without specializing in any
2
particular signal models such as the sparse models considered in this paper nor demonstrating their
performance when encapsulated in multi-layer neural networks.
Sparse prior in deep learning.
Aside from image classification, sparse modeling has been intro-
duced to deep learning for many image processing tasks such as super resolution [
32
], denoising [
33
]
and so on [
34
,
35
,
36
,
37
]. These works incorporate sparse modeling by using network architectures
that are motivated by (but are not the same as) an unrolled sparse coding algorithmn LISTA [
38
]. In
sharp contrast to ours, there is no guarantee that such architectures perform a sparse encoding with
respect to a particular (convolution) dictionary at all. As a result, they lack the capability of handling
input perturbations as in our method. A notable exception is the work of [
15
] where each layer
performs a precise sparse encoding and exhibits on par or better performance for image classification
over ResNet. However, the practical benefit of the sparse modeling in terms of robustness is not
demonstrated. Moreover, [15] adopts a patch-based sparse coding model for images and has a large
computational burden.
Robustness.
It is known that modern neural networks are extremely vulnerable to small perturbations
in the input data. A plethora of techniques have been proposed to address this instability issue,
including stability training [
23
], adversarial training [
39
,
40
,
41
], data augmentation [
42
,
43
,
22
],
etc. Nevertheless, these techniques either need a computational and memory overhead, or require
a selection of appropriate augmentation strategies to cover all possible corruptions. With standard
training only, our model can be made robust to input perturbations in test data by simply adapting
sparse modeling to account for noise. Closely related to our work are [
44
,
45
,
46
] which use sparse
modeling to improve adversarial robustness. However, they either only demonstrate performance
on very simple networks [
45
,
46
] or sacrifice natural accuracy for robustness [
44
]. In contrast, our
method is tested on realistic networks and does not affect natural accuracy.
3 Neural Networks with Sparse Modeling
In this section, we show how sparse modeling is incorporated into a deep network via a specific type
of network layer that we refer to as the convolutional sparse coding (CSC) layer. We describe the
CSC-layer in Sec. 3.1 and explain how we use them for deep learning in Sec. 3.2. Finally, Sec. 3.3
explains how CSC enables robust inference with corrupted test data.
Notations.
Given a single-channel image
ξRH×W
represented as a matrix, we may treat it as a
2D signal defined on the discrete domain
[1, . . . , H]×[1, . . . , W ]
and extended to
Z×Z
by padding
zeros. Given a 2D kernel
αRk×k
, we may treat it as a 2D signal defined on the discrete domain
[k0··· , k0]×[k0,··· , k0]
with
k= 2k0+ 1
and extended to
Z×Z
by padding zeros. Then, for
convenience, we use “
” and “
?
” to denote the convolution and correlation operators, respectively,
between two 2D signals:
(αξ)[i, j].
=X
pX
q
ξ[ip, j q]·α[p, q],
(α?ξ)[i, j].
=X
pX
q
ξ[i+p, j +q]·α[p, q].
(1)
3.1 Convolutional Sparse Coding (CSC) Layer
Sparse modeling is introduced in the form of an implicit layer of a neural network. Unlike classical
fully-connected or convolutional layers in which input-output relations are defined by an explicit
function, implicit layers are defined from implicit functions. For our case, in particular, we introduce
an implicit layer that is defined from an optimization problem involving the input to the layer as well
as a weight parameter, where the output of the layer is the solution to the optimization problem.
A generative model via sparse convolution.
Concretely, given a multi-dimensional input signal
xRM×H×W
to the layer where
H, W
are spatial dimensions and
M
is the number of channels for
x
. We assume the signal
x
is generated by a multi-channel sparse code
zRC×H×W
convoluting
with a multi-dimensional kernel
ARM×C×k×k
, which is referred to as a convolution dictionary.
Here
C
is the number of channels for
z
and the convolution kernel
A
. To be more precise, we denote
3
zas z.
= (ζ1,...,ζC)where each ζcRH×W(presumably sparse), and denote the kernel Aas
A.
=
α11 α12 α13 . . . α1C
α21 α22 α23 . . . α2C
.
.
..
.
..
.
.....
.
.
αM1αM2αM3. . . αMC
RM×C×k×k,(2)
where each
αij Rk×k
is a kernel of size
k×k
. Then the signal
x
is generated via the following
operator A(·)defined by the kernel Aas:
x=A(z).
=
C
X
c=1 α1c?ζc,...,αMc ?ζcRM×H×W.(3)
A layer as convolutional sparse coding.
Given a multi-dimensional input signal
xRM×H×W
,
we define that the function of “a layer” is to perform an (inverse) mapping to a preferably sparse
output
zRC×H×W
, where
C
is the number of output channels. Under the above sparse generative
model, we can seek the optimal sparse solution
z
by solving the following Lasso type optimization
problem:
z= arg min
zλkzk1+1
2kx− A(z)k2
2RC×H×W.(4)
Figure 1: Illustration of the operator
A
in the convolu-
tional sparse coding model for the CSC-layer.
The optimization problem in
(4)
is based
on the convolutional sparse coding (CSC)
model [
20
]
2
. Hence, we refer to the im-
plicit layer defined by
(4)
as a CSC-layer.
The goal of the CSC model is to reconstruct
the input
x
via
A(z)
, where the feature
map
z
specifies the locations and magni-
tudes of the convolutional filters in
A
to be
linearly combined (see Figure 1 for an illus-
tration). The reconstruction is not required
to be exact in order to tolerate modeling
discrepancies, and the difference between
x
and
A(z)
is penalized by its entry-wise
`2
-norm (i.e., the
`2
norm of
xA(z)
flat-
tened into a vector). Sparse modeling is
introduced by the entry-wise
`1
-norm of
z
in the objective function, which enforces
z
to be sparse. The parameter
λ > 0
controls the tradeoff between the sparsity of
z
and the magnitude of
the residual
xA(z)
, and is treated as a hyper-parameter that subjects to tuning via cross-validation.
As we will show in Sec. 3.3,
λ
can be used to improve the performance of our model in the test phase
when the input is corrupted.
Based on the input-output mapping of the CSC-layer given in
(4)
, one may perform forward propa-
gation by solving the associated optimization, and perform backward propagation by deriving the
gradient of
z
with respect to the input
x
and parameter
A
. In this paper, we adopt the fast iterative
shrinkage thresholding algorithm (FISTA) [
47
] for the forward propagation, which also produces an
unrolled network architecture that can carry out automatic differentiation for backward propagation.
We defer a discussion of the implementation details of the CSC-layer to the Appendix.
3.2 Sparse Dictionary Learning Network Architecture and Training
Convolution layers are basic ingredients of ConvNets that appear in many common network archi-
tectures such as LeNet [
48
] and ResNet [
49
]. In this paper, we incorporate sparse modeling into a
given existing/baseline network architecture by replacing certain / all convolution layers with the
CSC-layer. Meanwhile, all other layers such as normalization, nonlinear, and fully connected layers
2
Typically, convolution operators “
” are used in the definition of the operator
A
(see
(3)
), rather than the
correlation operators “
?
”. We adopt the definition in
(3)
to be consistent with the convention of modern deep
learning packages.
4
摘要:

RevisitingSparseConvolutionalModelforVisualRecognitionXiliDai1MingyangLi2*PengyuanZhai3ShengbangTong4XingjianGao4Shao-LunHuang2ZhihuiZhu5ChongYou4YiMa2,41TheHongKongUniversityofScienceandTechnology(Guangzhou)2Tsinghua-BerkeleyShenzhenInstitute(TBSI),TsinghuaUniversity3HarvardUniversity4Universityof...

展开>> 收起<<
Revisiting Sparse Convolutional Model for Visual Recognition Xili Dai1Mingyang Li2 Pengyuan Zhai3Shengbang Tong4Xingjian Gao4.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:6.92MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注