S4ND Modeling Images and Videos as Multidimensional Signals Using State Spaces Eric Nguyeny Karan Goelz Albert Guz

2025-04-24 0 0 3.33MB 22 页 10玖币
侵权投诉
S4ND: Modeling Images and Videos as Multidimensional Signals
Using State Spaces
Eric Nguyen, Karan Goel, Albert Gu,
Gordon W. Downs, Preey Shah, Tri Dao, Stephen A. Baccus§, Christopher R´e
Department of BioEngineering, Stanford University
Department of Computer Science, Stanford University
§Department of Neurobiology, Stanford University
{etnguyen,albertgu,gwdowns,preey,trid,baccus}@stanford.edu
{kgoel,chrismre}@cs.stanford.edu
October 17, 2022
Abstract
Visual data such as images and videos are typically modeled as discretizations of inherently continuous,
multidimensional signals. Existing continuous-signal models attempt to exploit this fact by modeling
the underlying signals of visual (e.g., image) data directly. However, these models have not yet been
able to achieve competitive performance on practical vision tasks such as large-scale image and video
classification. Building on a recent line of work on deep state space models (SSMs), we propose S4ND,
a new multidimensional SSM layer that extends the continuous-signal modeling ability of SSMs to
multidimensional data including images and videos. We show that S4ND can model large-scale visual
data in 1D, 2D, and 3D as continuous multidimensional signals and demonstrates strong performance
by simply swapping Conv2D and self-attention layers with S4ND layers in existing state-of-the-art
models. On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by 1
.
5%
when training with a 1D sequence of patches, and matches ConvNeXt when modeling images in 2D. For
videos, S4ND improves on an inflated 3D ConvNeXt in activity classification on HMDB-51 by 4%. S4ND
implicitly learns global, continuous convolutional kernels that are resolution invariant by construction,
providing an inductive bias that enables generalization across multiple resolutions. By developing a simple
bandlimiting modification to S4 to overcome aliasing, S4ND achieves strong zero-shot (unseen at training
time) resolution performance, outperforming a baseline Conv2D by 40% on CIFAR-10 when trained on
8
×
8 and tested on 32
×
32 images. When trained with progressive resizing, S4ND comes within
1% of
a high-resolution model while training 22% faster.
1 Introduction
Modeling visual data such as images and videos is a canonical problem in deep learning. In the last few years,
many modern deep learning backbones that achieve strong performance on benchmarks like ImageNet [
53
]
have been proposed. These backbones are diverse, and include 1D sequence models such as the Vision
Transformer (ViT) [
13
], which treats images as sequences of patches, and 2D and 3D models that use local
convolutions over images and videos (ConvNets) [15, 24, 25, 32, 37, 42, 49, 56, 58, 60, 64].
A commonality among modern vision models capable of achieving state-of-the-art (SotA) performance is
that they treat visual data as discrete pixels rather than continuous-signals. However, images and videos
are discretizations of multidimensional and naturally continuous signals, sampled at a fixed rate in the
Equal contribution.
1
arXiv:2210.06583v2 [cs.CV] 14 Oct 2022
*
S4 S4
Input u(t)
Kernel K(t) Output y(t)
S4ND
S4 Bandlimiting
!, #
$
1.1
−.2
2.9
0.5
(
1/∆
0
0
2.9
0.5
Discretize
$
Figure 1: (
S4ND
.) (Parameters in red.) (Top) S4ND can be viewed as a depthwise convolution that maps a
multidimensional input (black ) to output (green) through a continuous convolution kernel (blue). (Bottom Left ) The
kernel can be interpreted as a linear combination (controlled by
C
) of basis functions (controlled by
A,B
) with
flexible width (controlled by step size ∆). For structured
C
, the kernel can further factored as a low-rank tensor
product of 1D kernels, and can be interpreted as independent S4 transformations on each dimension. (Bottom Right)
Choosing
A,B
appropriately yields Fourier basis functions with controllable frequencies. To avoid aliasing in the final
discrete kernels, the coefficients of Ccorresponding to high frequencies can simply be masked out.
spatial and temporal dimensions. Ideally, we would want approaches that are capable of recognizing this
distinction between data and signal, and directly model the underlying continuous-signals. This would give
them capabilities like the ability to adapt the model to data sampled at different resolutions.
A natural approach to building such models is to parameterize and learn continuous convolutional kernels,
which can then be sampled differently for data at different resolutions [
16
,
20
,
21
,
50
,
54
]. Among these, deep
state space models (SSM) [
20
], in particular S4 [
21
], have achieved SotA results in modeling sequence data
derived from continuous-signals, such as audio [
17
]. However, a key limitation of SSMs is that they were
developed for 1D signals, and cannot directly be applied to visual data derived from multidimensional “ND”
signals. Given that 1D SSMs outperform other continuous modeling solutions for sequence data [
21
], and
have had preliminary success on image [
21
] and video classification [
31
], we hypothesize that they may be
well suited to modeling visual data when appropriately generalized to the setting of multidimensional signals.
Our main contribution is S4ND, a new deep learning layer that extends S4 to multidimensional signals. The
key idea is to turn the standard SSM (a 1D ODE) into a multidimensional PDE governed by an independent
SSM per dimension. By adding additional structure to this ND SSM, we show that it is equivalent to an
ND continuous convolution that can be factored into a separate 1D SSM convolution per dimension. This
results in a model that is efficient and easy to implement, using the standard 1D S4 layer as a black box.
Furthermore, it can be controlled by S4’s parameterization, allowing it to model both long-range dependencies,
or finite windows with a learnable window size that generalize conventional local convolutions [22].
We show that S4ND can be used as a drop-in replacement in strong modern vision architectures while
matching or improving performance in 1D, 2D, and 3D. With minimal change to the training procedure,
replacing the self-attention in ViT with S4-1D improves top-1 accuracy by 1
.
5%, and replacing the convolution
layers in a 2D ConvNeXt backbone [
42
] with S4-2D preserves its performance on ImageNet-1k [
10
]. Simply
inflating (temporally) this pretrained S4-2D-ConvNeXt backbone to 3D improves video activity classification
results on HMDB-51 [
38
] by 4 points over the pretrained ConvNeXt baseline. Notably, we use S4ND as
global kernels that span the entire input shape, which enable it to have global context (both spatially and
temporally) in every layer of a network.
Additionally, we propose a low-pass bandlimiting modification to S4 that encourages the learned convolutional
kernels to be smooth. While S4ND can be used at any resolution, performance suffers when moving between
resolutions due to aliasing artifacts in the kernel, an issue also noted by prior work on continuous models [
50
].
2
While S4 was capable of transferring between different resolutions on audio data [
21
], visual data presents a
greater challenge due to the scale-invariant properties of images in space and time [
52
], as sampled images
with more distant objects are more likely to contain power at frequencies above the Nyquist cutoff frequency.
Motivated by this, we propose a simple criteria that masks out frequencies in the S4ND kernel that lie above
the Nyquist cutoff frequency.
The continuous-signal modeling capabilities of S4ND open the door to new training recipes, such as the ability
to train and test at different resolutions. On the standard CIFAR-10 [
36
] and Celeb-A [
43
] datasets, S4ND
degrades by as little as 1
.
3% when upsampling from low- to high-resolution data (e.g. 128
×
128
160
×
160),
and can be used to facilitate progressive resizing to speed up training by 22% with
1% drop in final accuracy
compared to training at the high resolution alone. We also validate that our new bandlimiting method is critical
to these capabilities, with ablations showing absolute performance degradation of up to 20%+ without it.
2 Related Work
Image Classification.
There is a long line of work in image classification, with much of the 2010s dominated
by ConvNet backbones [
25
,
37
,
56
,
58
,
60
]. Recently, Transformer backbones, such as ViT [
13
], have achieved
SotA performance on images using self-attention over a sequence of 1D patches [
12
,
39
,
40
,
62
,
71
]. Their
scaling behavior in both model and dataset training size is believed to give them an inherent advantage over
ConvNets [
13
], even with minimal inductive bias. Liu et al.
[42]
introduce ConvNeXt, which modernizes
the standard ResNet architecture [
25
] using modern training techniques, matching the performance of
Transformers on image classification. We select a backbone in the 1D and 2D settings, ViT and ConvNeXt,
to convert into continuous-signal models by replacing the multi-headed self-attention layers in ViT and the
standard Conv2D layers in ConvNeXt with S4ND layers, maintaining their top-1 accuracy on large-scale
image classification.
S4 & Video Classification.
To handle the long-range dependancies inherent in videos, [
31
] used 1D S4 for
video classification on the Long-form Video Understanding dataset [
67
]. They first applied a Transformer
to each frame to obtain a sequence of patch embeddings for each video frame independently, followed by a
standard 1D S4 to model across the concatenated sequence of patches. This is akin to previous methods
that learned spatial and temporal information separately [
33
], for example using ConvNets on single frames,
followed by an LSTM [27] to aggregate temporal information. In contrast, modern video architectures such
as 3D ConvNets and Transformers [
1
,
2
,
15
,
24
,
32
,
35
,
41
,
49
,
64
,
67
] show stronger results when learning
spatiotemporal features simultaneously, which the generalization of S4ND into multidimensions now enables
us to do.
Continuous-signal Models.
Visual data are discretizations of naturally continuous signals that possess
extensive structure in the joint distribution of spatial frequencies, including the properties of scale and
translation invariance. For example, an object in an image generates correlations between lower and higher
frequencies that arises in part from phase alignment at edges [
47
]. As an object changes distances in the
image, these correlations remain the same but the frequencies shift. This relationship can potentially be
learned from a coarsely sampled image and then applied at higher frequency at higher resolution.
A number of continuous-signal models have been proposed for the visual domain to learn these inductive
biases, and have led to additional desirable properties and capabilities. A classic example of continuous-signal
driven processing is the fast Fourier transform, which is routinely used for filtering and data consistency in
computational and medical imaging [
11
]. NeRF represents a static scene as a continuous function, allowing
them to render scenes smoothly from multiple viewpoints [
45
]. CKConv [
51
] learns a continuous representation
to create kernels of arbitrary size for several data types including images, with additional benefits such
as the ability to handle irregularly sampled data. FlexConv [
50
] extends this work with a learned kernel
size, and show that images can be trained at low resolution and tested at high resolution if the aliasing
problem is addressed. S4 [
21
] increased abilities to model long-range dependancies using continuous kernels,
allowing SSMs to achieve SotA on sequential CIFAR [
36
]. However, these methods including 1D S4 have
been applied to relatively low dimensional data, e.g., time series, and small image datasets. S4ND is the
first continuous-signal model applied to high dimensional visual data with the ability to maintain SotA
performance on large-scale image and video classification.
3
Progressive Resizing.
Training times for large-scale image classification can be quite long, a trend that is
exacerbated by the emergence of foundation models [
4
]. A number of strategies have emerged for reducing
overall training time. Fix-Res [
61
] trains entirely at a lower resolution, and then fine-tunes at the higher test
resolution to speed up training in a two-stage process. Mix-and-Match [28] randomly samples low and high
resolutions during training in an interleaved manner. An effective method to reduce training time on images is
to utilize progressive resizing. This involves training at a lower resolution and gradually upsampling in stages.
For instance, fastai
[14]
utilized progressive resizing to train an ImageNet in under 4 hours. EfficientNetV2
[
60
] coupled resizing with a progressively regularization schedule, increasing the regularization as well to
maintain accuracy. In EfficientNetV2 and other described approaches, the models eventually train on the final
test resolution. As a continuous-signal model, we demonstrate that S4ND is naturally suited to progressive
resizing, while being able to generalize to unseen resolutions at test time.
3 Preliminaries
State space models.
S4 investigated state space models, which are linear time-invariant systems that map
signals
u
(
t
)
7→ y
(
t
) and can be represented either as a linear ODE (equation
(1)
) or convolution (equation
(2)). Its parameters are AN×Nand B,CNfor a state size N.
x0(t) = Ax(t) + Bu(t)
y(t) = Cx(t)(1) K(t) = CetAB
y(t)=(Ku)(t)(2)
Basis functions.
For the clearest intuition, we think of the convolution kernel as a linear combination
(controlled by C) of basis kernels Kn(t) (controlled by A,B)
K(t) =
N1
X
k=0
CkKk(t)Kn(t)=(etAB)n(3)
Discretization.
The SSM
(1)
is defined over a continuous-time axis and produces continuous-time convo-
lution kernels
(2)(3)
. Given a discrete input sequence
u0, u1, . . .
sampled uniformly from an underlying signal
u
(
t
) at a step size ∆ (i.e.
uk
=
u
(
k
∆)), the kernel can be sampled to match the rate of the input. Note that
instead of directly sampling the kernel, standard discretization rules should be applied to minimize the error
from the discrete to the continuous-time kernel [
21
]. For inputs given at different resolutions, the model can
then simply change its ∆ value to compute the kernel at different resolutions.
We note that the step size ∆ does not have to be exactly equal to a “true sampling rate” of the underlying
signal, but only the relative rate matters. Concretely, the discrete-time kernel depends only on the product
Aand ∆B, and S4 learns separate parameters ∆,A,B.
S4.
S4 is a special SSM with prescribed (
A,B
) matrices that define well-behaved basis functions, and an
algorithm that allows the convolution kernel to be computed efficiently. Variants of S4 exist that define
different basis functions, such as simple diagonal SSMs [
23
], or one that defines
truncated Fourier functions
Kn
(
t
) =
sin
(2
πnt
) [(0
,
1)] [
22
] (Fig. 1). These versions of S4 have easy-to-interpret basis functions that will
allow us to control the frequencies in the kernel (Section 4.2).
4 Method
We describe the proposed S4ND model for the 2D case only, for ease of notation and presentation. The results
extend readily to general dimensions; full statements and proofs for the general case are in Appendix A.
Section 4.1 describes the multidimensional S4ND layer, and Section 4.2 describes our simple modification to
restrict frequencies in the kernels. Fig. 1 illustrates the complete S4ND layer.
4
4.1 S4ND
We begin by generalizing the (linear time-invariant) SSM
(1)
to higher dimensions. Notationally, we denote
the individual time axes with superscripts in parentheses. Let
u
=
u
(
t(1), t(2)
) and
y
=
y
(
t(1), t(2)
) be the
input and output which are signals
2
, and
x
= (
x(1)
(
t(1), t(2)
)
, x(2)
(
t(1), t(2)
))
N(1)×N(2)
be the
SSM state of dimension N(1) ×N(2), where x(τ):2N(τ).
Definition 1
(Multidimensional SSM)
.
Given parameters
A(τ)N(τ)×N(τ)
,
B(τ)N(τ)×1
,
C
N(1)×N(2) , the 2D SSM is the map u7→ ydefined by the linear PDE with initial condition x(0,0) = 0 :
t(1) x(t(1), t(2))=(A(1)x(1)(t(1), t(2)), x(2)(t(1), t(2))) + B(1)u(t(1), t(2))
t(2) x(t(1), t(2))=(x(1)(t(1), t(2)),A(2)x(2)(t(1), t(2))) + B(2)u(t(1), t(2))
y(t(1), t(2)) = hC, x(t(1), t(2))i
(4)
Note that Definition 1 differs from the usual notion of multidimensional SSM, which is simply a map from
u
(
t
)
n7→ y
(
t
)
m
for higher-dimensional
n, m >
1 but still with 1 time axis. However, Definition 1
is a map from
u
(
t1, t2
)
17→ y
(
t1, t2
)
1
for scalar input/outputs but over multiple time axes. When
thinking of the input
u
(
t(1), t(2)
) as a function over a 2D grid, Definition 1 can be thought of as a simple
linear PDE that just runs a standard 1D SSM over each axis independently.
Analogous to equation (2), the 2D SSM can also be viewed as a multidimensional convolution.
Theorem 1. (4) is a time-invariant system that is equivalent to a 2D convolution y=Kuby the kernel
K(t(1), t(2)) = hC,(et(1)A(1) B(1))(et(2) A(2) B(2))i(5)
This kernel is a linear combination of the
N(1) ×N(2)
basis kernels
{K(1)
n(1)
(
t(1)
)
K(1)
n(2)
(
t(2)
) :
n(1)
[N(1)], n(2) [N(2)]}where K(τ)are the standard 1D SSM kernels (3) for each axis .
However, a limitation of this general form is that the number of basis functions
N(1) ×N(2) ×. . .
grows
exponentially in the dimension, increasing the parameter count (of
C
) and overall computation dramatically.
This can be mitigated by factoring Cas a low-rank tensor.
Corollary 4.1.
Suppose that
CN(1)×N(2)
is a low-rank tensor
C
=
Pr
i=1 C(1)
iC(2)
i
where each
C(τ)
iN(τ). Then the kernel (5) also factors as a tensor product of 1D kernels
K(t(1), t(2)) =
r
X
i=1
K(1)
i(t(1))K(2)
i(t(2)) :=
r
X
i=1
(C(1)
iet(2)A(1) B(1))(C(2)
iet(2)A(2) B(2))
In our experiments, we choose
C
as a rank-1 tensor, but the rank can be freely adjusted to tradeoff parameters
and computation for expressivity. Using the equivalence between
(1)
and
(2)
, Corollary 4.1 also has the
simple interpretion as defining an independent 1D SSM along each axis of the multidimensional input.
4.2 Resolution Change and Bandlimiting
SSMs in 1D have shown strong performance in the audio domain, and can nearly preserve full accuracy when
tested zero-shot on inputs sampled at very different frequencies [
21
]. This capability relies simply on scaling
∆ by the relative change in frequencies (i.e., if the input resolution is doubled, halve the SSM’s ∆ parameter).
However, sampling rates in the spatial domain are often much lower than temporally, leading to potential
aliasing when changing resolutions. A standard technique to avoid aliasing is to apply a low-pass filter to
remove frequencies above the Nyquist cutoff frequency.
5
摘要:

S4ND:ModelingImagesandVideosasMultidimensionalSignalsUsingStateSpacesEricNguyen*y,KaranGoelz,AlbertGuz,GordonW.Downsz,PreeyShahz,TriDaoz,StephenA.Baccusx,ChristopherRezyDepartmentofBioEngineering,StanfordUniversityzDepartmentofComputerScience,StanfordUniversityxDepartmentofNeurobiology,StanfordU...

展开>> 收起<<
S4ND Modeling Images and Videos as Multidimensional Signals Using State Spaces Eric Nguyeny Karan Goelz Albert Guz.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:3.33MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注