While S4 was capable of transferring between different resolutions on audio data [
21
], visual data presents a
greater challenge due to the scale-invariant properties of images in space and time [
52
], as sampled images
with more distant objects are more likely to contain power at frequencies above the Nyquist cutoff frequency.
Motivated by this, we propose a simple criteria that masks out frequencies in the S4ND kernel that lie above
the Nyquist cutoff frequency.
The continuous-signal modeling capabilities of S4ND open the door to new training recipes, such as the ability
to train and test at different resolutions. On the standard CIFAR-10 [
36
] and Celeb-A [
43
] datasets, S4ND
degrades by as little as 1
.
3% when upsampling from low- to high-resolution data (e.g. 128
×
128
→
160
×
160),
and can be used to facilitate progressive resizing to speed up training by 22% with
∼
1% drop in final accuracy
compared to training at the high resolution alone. We also validate that our new bandlimiting method is critical
to these capabilities, with ablations showing absolute performance degradation of up to 20%+ without it.
2 Related Work
Image Classification.
There is a long line of work in image classification, with much of the 2010s dominated
by ConvNet backbones [
25
,
37
,
56
,
58
,
60
]. Recently, Transformer backbones, such as ViT [
13
], have achieved
SotA performance on images using self-attention over a sequence of 1D patches [
12
,
39
,
40
,
62
,
71
]. Their
scaling behavior in both model and dataset training size is believed to give them an inherent advantage over
ConvNets [
13
], even with minimal inductive bias. Liu et al.
[42]
introduce ConvNeXt, which modernizes
the standard ResNet architecture [
25
] using modern training techniques, matching the performance of
Transformers on image classification. We select a backbone in the 1D and 2D settings, ViT and ConvNeXt,
to convert into continuous-signal models by replacing the multi-headed self-attention layers in ViT and the
standard Conv2D layers in ConvNeXt with S4ND layers, maintaining their top-1 accuracy on large-scale
image classification.
S4 & Video Classification.
To handle the long-range dependancies inherent in videos, [
31
] used 1D S4 for
video classification on the Long-form Video Understanding dataset [
67
]. They first applied a Transformer
to each frame to obtain a sequence of patch embeddings for each video frame independently, followed by a
standard 1D S4 to model across the concatenated sequence of patches. This is akin to previous methods
that learned spatial and temporal information separately [
33
], for example using ConvNets on single frames,
followed by an LSTM [27] to aggregate temporal information. In contrast, modern video architectures such
as 3D ConvNets and Transformers [
1
,
2
,
15
,
24
,
32
,
35
,
41
,
49
,
64
,
67
] show stronger results when learning
spatiotemporal features simultaneously, which the generalization of S4ND into multidimensions now enables
us to do.
Continuous-signal Models.
Visual data are discretizations of naturally continuous signals that possess
extensive structure in the joint distribution of spatial frequencies, including the properties of scale and
translation invariance. For example, an object in an image generates correlations between lower and higher
frequencies that arises in part from phase alignment at edges [
47
]. As an object changes distances in the
image, these correlations remain the same but the frequencies shift. This relationship can potentially be
learned from a coarsely sampled image and then applied at higher frequency at higher resolution.
A number of continuous-signal models have been proposed for the visual domain to learn these inductive
biases, and have led to additional desirable properties and capabilities. A classic example of continuous-signal
driven processing is the fast Fourier transform, which is routinely used for filtering and data consistency in
computational and medical imaging [
11
]. NeRF represents a static scene as a continuous function, allowing
them to render scenes smoothly from multiple viewpoints [
45
]. CKConv [
51
] learns a continuous representation
to create kernels of arbitrary size for several data types including images, with additional benefits such
as the ability to handle irregularly sampled data. FlexConv [
50
] extends this work with a learned kernel
size, and show that images can be trained at low resolution and tested at high resolution if the aliasing
problem is addressed. S4 [
21
] increased abilities to model long-range dependancies using continuous kernels,
allowing SSMs to achieve SotA on sequential CIFAR [
36
]. However, these methods including 1D S4 have
been applied to relatively low dimensional data, e.g., time series, and small image datasets. S4ND is the
first continuous-signal model applied to high dimensional visual data with the ability to maintain SotA
performance on large-scale image and video classification.
3