S4ND Modeling Images and Videos as Multidimensional Signals Using State Spaces Eric Nguyeny Karan Goelz Albert Guz

2025-04-24 0 0 3.33MB 22 页 10玖币

侵权投诉

S4ND: Modeling Images and Videos as Multidimensional Signals

Using State Spaces

Eric Nguyen∗∗†, Karan Goel∗‡, Albert Gu∗‡,

Gordon W. Downs‡, Preey Shah‡, Tri Dao‡, Stephen A. Baccus§, Christopher R´e‡

†Department of BioEngineering, Stanford University

‡Department of Computer Science, Stanford University

§Department of Neurobiology, Stanford University

{etnguyen,albertgu,gwdowns,preey,trid,baccus}@stanford.edu

{kgoel,chrismre}@cs.stanford.edu

October 17, 2022

Abstract

Visual data such as images and videos are typically modeled as discretizations of inherently continuous,

multidimensional signals. Existing continuous-signal models attempt to exploit this fact by modeling

the underlying signals of visual (e.g., image) data directly. However, these models have not yet been

able to achieve competitive performance on practical vision tasks such as large-scale image and video

classiﬁcation. Building on a recent line of work on deep state space models (SSMs), we propose S4ND,

a new multidimensional SSM layer that extends the continuous-signal modeling ability of SSMs to

multidimensional data including images and videos. We show that S4ND can model large-scale visual

data in 1D, 2D, and 3D as continuous multidimensional signals and demonstrates strong performance

by simply swapping Conv2D and self-attention layers with S4ND layers in existing state-of-the-art

models. On ImageNet-1k, S4ND exceeds the performance of a Vision Transformer baseline by 1

when training with a 1D sequence of patches, and matches ConvNeXt when modeling images in 2D. For

videos, S4ND improves on an inﬂated 3D ConvNeXt in activity classiﬁcation on HMDB-51 by 4%. S4ND

implicitly learns global, continuous convolutional kernels that are resolution invariant by construction,

providing an inductive bias that enables generalization across multiple resolutions. By developing a simple

bandlimiting modiﬁcation to S4 to overcome aliasing, S4ND achieves strong zero-shot (unseen at training

time) resolution performance, outperforming a baseline Conv2D by 40% on CIFAR-10 when trained on

8 and tested on 32

32 images. When trained with progressive resizing, S4ND comes within

∼

1% of

a high-resolution model while training 22% faster.

1 Introduction

Modeling visual data such as images and videos is a canonical problem in deep learning. In the last few years,

many modern deep learning backbones that achieve strong performance on benchmarks like ImageNet [

]

have been proposed. These backbones are diverse, and include 1D sequence models such as the Vision

Transformer (ViT) [

], which treats images as sequences of patches, and 2D and 3D models that use local

convolutions over images and videos (ConvNets) [15, 24, 25, 32, 37, 42, 49, 56, 58, 60, 64].

A commonality among modern vision models capable of achieving state-of-the-art (SotA) performance is

that they treat visual data as discrete pixels rather than continuous-signals. However, images and videos

are discretizations of multidimensional and naturally continuous signals, sampled at a ﬁxed rate in the

∗Equal contribution.

arXiv:2210.06583v2 [cs.CV] 14 Oct 2022

S4 S4

Input u(t)

Kernel K(t) Output y(t)

S4ND

S4 Bandlimiting

!, #

1.1

−.2

2.9

0.5

(

1/∆

2.9

0.5

Discretize

Figure 1: (

S4ND

.) (Parameters in red.) (Top) S4ND can be viewed as a depthwise convolution that maps a

multidimensional input (black ) to output (green) through a continuous convolution kernel (blue). (Bottom Left ) The

kernel can be interpreted as a linear combination (controlled by

) of basis functions (controlled by

A,B

) with

ﬂexible width (controlled by step size ∆). For structured

, the kernel can further factored as a low-rank tensor

product of 1D kernels, and can be interpreted as independent S4 transformations on each dimension. (Bottom Right)

Choosing

A,B

appropriately yields Fourier basis functions with controllable frequencies. To avoid aliasing in the ﬁnal

discrete kernels, the coeﬃcients of Ccorresponding to high frequencies can simply be masked out.

spatial and temporal dimensions. Ideally, we would want approaches that are capable of recognizing this

distinction between data and signal, and directly model the underlying continuous-signals. This would give

them capabilities like the ability to adapt the model to data sampled at diﬀerent resolutions.

A natural approach to building such models is to parameterize and learn continuous convolutional kernels,

which can then be sampled diﬀerently for data at diﬀerent resolutions [

]. Among these, deep

state space models (SSM) [

], in particular S4 [

], have achieved SotA results in modeling sequence data

derived from continuous-signals, such as audio [

]. However, a key limitation of SSMs is that they were

developed for 1D signals, and cannot directly be applied to visual data derived from multidimensional “ND”

signals. Given that 1D SSMs outperform other continuous modeling solutions for sequence data [

], and

have had preliminary success on image [

] and video classiﬁcation [

], we hypothesize that they may be

well suited to modeling visual data when appropriately generalized to the setting of multidimensional signals.

Our main contribution is S4ND, a new deep learning layer that extends S4 to multidimensional signals. The

key idea is to turn the standard SSM (a 1D ODE) into a multidimensional PDE governed by an independent

SSM per dimension. By adding additional structure to this ND SSM, we show that it is equivalent to an

ND continuous convolution that can be factored into a separate 1D SSM convolution per dimension. This

results in a model that is eﬃcient and easy to implement, using the standard 1D S4 layer as a black box.

Furthermore, it can be controlled by S4’s parameterization, allowing it to model both long-range dependencies,

or ﬁnite windows with a learnable window size that generalize conventional local convolutions [22].

We show that S4ND can be used as a drop-in replacement in strong modern vision architectures while

matching or improving performance in 1D, 2D, and 3D. With minimal change to the training procedure,

replacing the self-attention in ViT with S4-1D improves top-1 accuracy by 1

5%, and replacing the convolution

layers in a 2D ConvNeXt backbone [

] with S4-2D preserves its performance on ImageNet-1k [

]. Simply

inﬂating (temporally) this pretrained S4-2D-ConvNeXt backbone to 3D improves video activity classiﬁcation

results on HMDB-51 [

] by 4 points over the pretrained ConvNeXt baseline. Notably, we use S4ND as

global kernels that span the entire input shape, which enable it to have global context (both spatially and

temporally) in every layer of a network.

Additionally, we propose a low-pass bandlimiting modiﬁcation to S4 that encourages the learned convolutional

kernels to be smooth. While S4ND can be used at any resolution, performance suﬀers when moving between

resolutions due to aliasing artifacts in the kernel, an issue also noted by prior work on continuous models [

While S4 was capable of transferring between diﬀerent resolutions on audio data [

], visual data presents a

greater challenge due to the scale-invariant properties of images in space and time [

], as sampled images

with more distant objects are more likely to contain power at frequencies above the Nyquist cutoﬀ frequency.

Motivated by this, we propose a simple criteria that masks out frequencies in the S4ND kernel that lie above

the Nyquist cutoﬀ frequency.

The continuous-signal modeling capabilities of S4ND open the door to new training recipes, such as the ability

to train and test at diﬀerent resolutions. On the standard CIFAR-10 [

] and Celeb-A [

] datasets, S4ND

degrades by as little as 1

3% when upsampling from low- to high-resolution data (e.g. 128

128

→

160

160),

and can be used to facilitate progressive resizing to speed up training by 22% with

∼

1% drop in ﬁnal accuracy

compared to training at the high resolution alone. We also validate that our new bandlimiting method is critical

to these capabilities, with ablations showing absolute performance degradation of up to 20%+ without it.

2 Related Work

Image Classiﬁcation.

There is a long line of work in image classiﬁcation, with much of the 2010s dominated

by ConvNet backbones [

]. Recently, Transformer backbones, such as ViT [

], have achieved

SotA performance on images using self-attention over a sequence of 1D patches [

]. Their

scaling behavior in both model and dataset training size is believed to give them an inherent advantage over

ConvNets [

], even with minimal inductive bias. Liu et al.

[42]

introduce ConvNeXt, which modernizes

the standard ResNet architecture [

] using modern training techniques, matching the performance of

Transformers on image classiﬁcation. We select a backbone in the 1D and 2D settings, ViT and ConvNeXt,

to convert into continuous-signal models by replacing the multi-headed self-attention layers in ViT and the

standard Conv2D layers in ConvNeXt with S4ND layers, maintaining their top-1 accuracy on large-scale

image classiﬁcation.

S4 & Video Classiﬁcation.

To handle the long-range dependancies inherent in videos, [

] used 1D S4 for

video classiﬁcation on the Long-form Video Understanding dataset [

]. They ﬁrst applied a Transformer

to each frame to obtain a sequence of patch embeddings for each video frame independently, followed by a

standard 1D S4 to model across the concatenated sequence of patches. This is akin to previous methods

that learned spatial and temporal information separately [

], for example using ConvNets on single frames,

followed by an LSTM [27] to aggregate temporal information. In contrast, modern video architectures such

as 3D ConvNets and Transformers [

] show stronger results when learning

spatiotemporal features simultaneously, which the generalization of S4ND into multidimensions now enables

us to do.

Continuous-signal Models.

Visual data are discretizations of naturally continuous signals that possess

extensive structure in the joint distribution of spatial frequencies, including the properties of scale and

translation invariance. For example, an object in an image generates correlations between lower and higher

frequencies that arises in part from phase alignment at edges [

]. As an object changes distances in the

image, these correlations remain the same but the frequencies shift. This relationship can potentially be

learned from a coarsely sampled image and then applied at higher frequency at higher resolution.

A number of continuous-signal models have been proposed for the visual domain to learn these inductive

biases, and have led to additional desirable properties and capabilities. A classic example of continuous-signal

driven processing is the fast Fourier transform, which is routinely used for ﬁltering and data consistency in

computational and medical imaging [

]. NeRF represents a static scene as a continuous function, allowing

them to render scenes smoothly from multiple viewpoints [

]. CKConv [

] learns a continuous representation

to create kernels of arbitrary size for several data types including images, with additional beneﬁts such

as the ability to handle irregularly sampled data. FlexConv [

] extends this work with a learned kernel

size, and show that images can be trained at low resolution and tested at high resolution if the aliasing

problem is addressed. S4 [

] increased abilities to model long-range dependancies using continuous kernels,

allowing SSMs to achieve SotA on sequential CIFAR [

]. However, these methods including 1D S4 have

been applied to relatively low dimensional data, e.g., time series, and small image datasets. S4ND is the

ﬁrst continuous-signal model applied to high dimensional visual data with the ability to maintain SotA

performance on large-scale image and video classiﬁcation.

Progressive Resizing.

Training times for large-scale image classiﬁcation can be quite long, a trend that is

exacerbated by the emergence of foundation models [

]. A number of strategies have emerged for reducing

overall training time. Fix-Res [

] trains entirely at a lower resolution, and then ﬁne-tunes at the higher test

resolution to speed up training in a two-stage process. Mix-and-Match [28] randomly samples low and high

resolutions during training in an interleaved manner. An eﬀective method to reduce training time on images is

to utilize progressive resizing. This involves training at a lower resolution and gradually upsampling in stages.

For instance, fastai

[14]

utilized progressive resizing to train an ImageNet in under 4 hours. EﬃcientNetV2

[

] coupled resizing with a progressively regularization schedule, increasing the regularization as well to

maintain accuracy. In EﬃcientNetV2 and other described approaches, the models eventually train on the ﬁnal

test resolution. As a continuous-signal model, we demonstrate that S4ND is naturally suited to progressive

resizing, while being able to generalize to unseen resolutions at test time.

3 Preliminaries

State space models.

S4 investigated state space models, which are linear time-invariant systems that map

signals

(

)

7→ y

(

) and can be represented either as a linear ODE (equation

(1)

) or convolution (equation

(2)). Its parameters are A∈N×Nand B,C∈Nfor a state size N.

x0(t) = Ax(t) + Bu(t)

y(t) = Cx(t)(1) K(t) = CetAB

y(t)=(K∗u)(t)(2)

Basis functions.

For the clearest intuition, we think of the convolution kernel as a linear combination

(controlled by C) of basis kernels Kn(t) (controlled by A,B)

K(t) =

N−1

k=0

CkKk(t)Kn(t)=(etAB)n(3)

Discretization.

The SSM

(1)

is deﬁned over a continuous-time axis and produces continuous-time convo-

lution kernels

(2)(3)

. Given a discrete input sequence

u0, u1, . . .

sampled uniformly from an underlying signal

(

) at a step size ∆ (i.e.

(

∆)), the kernel can be sampled to match the rate of the input. Note that

instead of directly sampling the kernel, standard discretization rules should be applied to minimize the error

from the discrete to the continuous-time kernel [

]. For inputs given at diﬀerent resolutions, the model can

then simply change its ∆ value to compute the kernel at diﬀerent resolutions.

We note that the step size ∆ does not have to be exactly equal to a “true sampling rate” of the underlying

signal, but only the relative rate matters. Concretely, the discrete-time kernel depends only on the product

∆Aand ∆B, and S4 learns separate parameters ∆,A,B.

S4.

S4 is a special SSM with prescribed (

A,B

) matrices that deﬁne well-behaved basis functions, and an

algorithm that allows the convolution kernel to be computed eﬃciently. Variants of S4 exist that deﬁne

diﬀerent basis functions, such as simple diagonal SSMs [

], or one that deﬁnes

truncated Fourier functions

(

) =

sin

πnt

) [(0

1)] [

] (Fig. 1). These versions of S4 have easy-to-interpret basis functions that will

allow us to control the frequencies in the kernel (Section 4.2).

4 Method

We describe the proposed S4ND model for the 2D case only, for ease of notation and presentation. The results

extend readily to general dimensions; full statements and proofs for the general case are in Appendix A.

Section 4.1 describes the multidimensional S4ND layer, and Section 4.2 describes our simple modiﬁcation to

restrict frequencies in the kernels. Fig. 1 illustrates the complete S4ND layer.

4.1 S4ND

We begin by generalizing the (linear time-invariant) SSM

(1)

to higher dimensions. Notationally, we denote

the individual time axes with superscripts in parentheses. Let

(

t(1), t(2)

) and

(

t(1), t(2)

) be the

input and output which are signals

2→

, and

= (

x(1)

(

t(1), t(2)

)

, x(2)

(

t(1), t(2)

))

∈N(1)×N(2)

be the

SSM state of dimension N(1) ×N(2), where x(τ):2→N(τ).

Deﬁnition 1

(Multidimensional SSM)

Given parameters

A(τ)∈N(τ)×N(τ)

B(τ)∈N(τ)×1

C∈

N(1)×N(2) , the 2D SSM is the map u7→ ydeﬁned by the linear PDE with initial condition x(0,0) = 0 :

∂

∂t(1) x(t(1), t(2))=(A(1)x(1)(t(1), t(2)), x(2)(t(1), t(2))) + B(1)u(t(1), t(2))

∂

∂t(2) x(t(1), t(2))=(x(1)(t(1), t(2)),A(2)x(2)(t(1), t(2))) + B(2)u(t(1), t(2))

y(t(1), t(2)) = hC, x(t(1), t(2))i

(4)

Note that Deﬁnition 1 diﬀers from the usual notion of multidimensional SSM, which is simply a map from

(

)

∈n7→ y

(

)

∈m

for higher-dimensional

n, m >

1 but still with 1 time axis. However, Deﬁnition 1

is a map from

(

t1, t2

)

∈17→ y

(

t1, t2

)

∈1

for scalar input/outputs but over multiple time axes. When

thinking of the input

(

t(1), t(2)

) as a function over a 2D grid, Deﬁnition 1 can be thought of as a simple

linear PDE that just runs a standard 1D SSM over each axis independently.

Analogous to equation (2), the 2D SSM can also be viewed as a multidimensional convolution.

Theorem 1. (4) is a time-invariant system that is equivalent to a 2D convolution y=K∗uby the kernel

K(t(1), t(2)) = hC,(et(1)A(1) B(1))⊗(et(2) A(2) B(2))i(5)

This kernel is a linear combination of the

N(1) ×N(2)

basis kernels

{K(1)

n(1)

(

t(1)

)

⊗K(1)

n(2)

(

t(2)

) :

n(1) ∈

[N(1)], n(2) ∈[N(2)]}where K(τ)are the standard 1D SSM kernels (3) for each axis .

However, a limitation of this general form is that the number of basis functions

N(1) ×N(2) ×. . .

grows

exponentially in the dimension, increasing the parameter count (of

) and overall computation dramatically.

This can be mitigated by factoring Cas a low-rank tensor.

Corollary 4.1.

Suppose that

C∈N(1)×N(2)

is a low-rank tensor

i=1 C(1)

i⊗C(2)

where each

C(τ)

i∈N(τ). Then the kernel (5) also factors as a tensor product of 1D kernels

K(t(1), t(2)) =

i=1

K(1)

i(t(1))⊗K(2)

i(t(2)) :=

i=1

(C(1)

iet(2)A(1) B(1))⊗(C(2)

iet(2)A(2) B(2))

In our experiments, we choose

as a rank-1 tensor, but the rank can be freely adjusted to tradeoﬀ parameters

and computation for expressivity. Using the equivalence between

(1)

and

(2)

, Corollary 4.1 also has the

simple interpretion as deﬁning an independent 1D SSM along each axis of the multidimensional input.

4.2 Resolution Change and Bandlimiting

SSMs in 1D have shown strong performance in the audio domain, and can nearly preserve full accuracy when

tested zero-shot on inputs sampled at very diﬀerent frequencies [

]. This capability relies simply on scaling

∆ by the relative change in frequencies (i.e., if the input resolution is doubled, halve the SSM’s ∆ parameter).

However, sampling rates in the spatial domain are often much lower than temporally, leading to potential

aliasing when changing resolutions. A standard technique to avoid aliasing is to apply a low-pass ﬁlter to

remove frequencies above the Nyquist cutoﬀ frequency.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

S4ND:ModelingImagesandVideosasMultidimensionalSignalsUsingStateSpacesEricNguyen*y,KaranGoelz,AlbertGuz,GordonW.Downsz,PreeyShahz,TriDaoz,StephenA.Baccusx,ChristopherRezyDepartmentofBioEngineering,StanfordUniversityzDepartmentofComputerScience,StanfordUniversityxDepartmentofNeurobiology,StanfordU...

展开>> 收起<<

S4ND Modeling Images and Videos as Multidimensional Signals Using State Spaces Eric Nguyeny Karan Goelz Albert Guz.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

S4ND Modeling Images and Videos as Multidimensional Signals Using State Spaces Eric Nguyeny Karan Goelz Albert Guz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: