What Makes Convolutional Models Great on Long Sequence Modeling Yuhong Li1 Tianle Cai2 Yi Zhang3 Deming Chen1 and Debadeepta Dey3

2025-05-06 0 0 1.17MB 21 页 10玖币
侵权投诉
What Makes Convolutional Models Great on Long Sequence
Modeling?
Yuhong Li1, Tianle Cai2, Yi Zhang3, Deming Chen1, and Debadeepta Dey3
1University of Illinois Urbana-Champaign
2Princeton University
3Microsoft Research
October 18, 2022
Abstract
Convolutional models have been widely used in multiple domains. However, most exist-
ing models only use local convolution, making the model unable to handle long-range depen-
dency efficiently. Attention overcomes this problem by aggregating global information based
on the pair-wise attention score but also makes the computational complexity quadratic to
the sequence length. Recently, Gu et al. [2021a] proposed a model called S4 inspired by
the state space model. S4 can be efficiently implemented as a global convolutional model
whose kernel size equals the input sequence length. With Fast Fourier Transform, S4 can
model much longer sequences than Transformers and achieve significant gains over SoTA on
several long-range tasks. Despite its empirical success, S4 is involved. It requires sophis-
ticated parameterization and initialization schemes that combine the wisdom from several
prior works. As a result, S4 is less intuitive and hard to use for researchers with limited
prior knowledge. Here we aim to demystify S4 and extract basic principles that contribute
to the success of S4 as a global convolutional model. We focus on the structure of the
convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are
sufficient to make up an effective global convolutional model: 1) The parameterization of the
convolutional kernel needs to be efficient in the sense that the number of parameters should
scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure
that the weights for convolving with closer neighbors are larger than the more distant ones.
Based on the two principles, we propose a simple yet effective convolutional model called
Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over
several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech
Command datasets. 2) When plugging SGConv into standard language and vision models,
it shows the potential to improve both efficiency and performance. Code is available at
https://github.com/ctlllll/SGConv.
Equal contribution. Work done during the internship at Microsoft Research.
1
arXiv:2210.09298v1 [cs.LG] 17 Oct 2022
1 Introduction
Handling Long-Range Dependency (LRD) is a key challenge in long-sequence modeling tasks
such as time-series forecasting, language modeling, and pixel-level image generation. Unfortu-
nately, standard deep learning models fail to solve this problem for different reasons: Recurrent
Neural Network (RNN) suffers from vanishing gradient, Transformer has complexity quadratic in
the sequence length, and Convolutional Neural Network (CNN) usually only has a local receptive
field in each layer.
A recently proposed benchmark called Long-Range Arena (LRA) [Tay et al., 2020b] reveals
that all existing models perform poorly in modeling LRD. Notably, on one spatial-level sequence
modeling task called Pathfinder-X from LRA, all models fail except a new Structured State Space
sequence model (S4) [Gu et al., 2021a]. The S4 model is inspired by the state space model widely
used in control theory and can be computed efficiently with a special parameterization based on
the Cauchy kernel. The exact implementation of the S4 model can be viewed as a (depthwise)
global convolutional model with an involved computation global convolution kernel. Thanks to
the global receptive field of the convolution kernel, S4 is able to handle tasks that require LRD,
such as Pathfinder [Tay et al., 2020b], where classic local CNNs fail [Linsley et al., 2018, Kim
et al., 2019]. Also, the use of Fast Fourier Transform (FFT) and techniques from numerical
linear algebra make the computational complexity of S4 tractable compared to the quadratic
complexity of attention. Together, S4 shows the potential of global convolutional models to
model LRD and advances the SoTA on LRA.
Despite its accomplishments, the delicate design of S4 makes it unfriendly even to knowl-
edgable researchers. In particular, the empirical success of S4 relies on 1) A Diagonal Plus Low
Rank (DLPR) parameterization whose efficient implementation requires several numerical linear
algebra tricks, 2) An initialization scheme based on the HiPPO matrix derived in prior work [Gu
et al., 2020]. Therefore, aiming to reduce the complications of the model and highlight minimal
principles, we raise the following questions:
What contributes to the success of the S4 model? Can we establish a simpler model based on
minimal principles to handle long-range dependency?
To answer these questions, we focus on the design of the global convolution kernel. We extract
two simple and intuitive principles that contribute to the success of the S4 kernel. The first
principle is that the parameterization of the global convolution kernel should be efficient in terms
of the sequence length: the number of parameters should scale slowly with the sequence length.
For example, classic CNNs use a fixed kernel size. S4 also uses a fixed number of parameters
to compute the convolution kernel while the number is greater than classic CNNs. Both models
satisfy the first principle as the number of parameters does not scale with input length. The
efficiency of parameterization is also necessary because the naive implementation of a global
convolution kernel with the size of sentence length is intractable for inputs with thousands of
tokens. Too many parameters will also cause overfitting, thus hurting the performance. The
second principle is the decaying structure of the convolution kernel, meaning that the weights
for convolving with closer neighbors are larger than the more distant ones. This structure
appears ubiquitously in signal processing, with the well-known Gaussian filter as an example.
The intuition is clear that closer neighbors provide a more helpful signal. S4 inherently enjoys
2
this decaying property because of the exponential decay of the spectrum of matrix powers (See
Figure 2), and we find this inductive bias improves the model performance (See Section 4.1.2).
Figure 1: Illustration of the parameterization used
in SGConv (Eq. (1)). The convolution kernel is com-
posed of multi-scale sub-kernels. Parameterization
Efficiency. Every larger sub-kernel doubles the size
of the previous sub-kernel while the same number of
parameters are used for every scale, ensuring a loga-
rithmic dependency of the number of parameters to
the input length. Decaying. We use a weighted
combination of sub-kernels where the weights are de-
caying, and smaller weights are assigned to larger
scales.
We show that these two principles are
sufficient for designing a global convo-
lutional model that captures LRD well.
To verify this, we introduce a class of
global convolution kernels with a sim-
ple multiscale structure, as shown in Fig-
ure 1. Specifically, we compose the con-
volution kernel by a sequence of sub-
kernels of increasing sizes, yet every sub-
kernel is upsampled from the same num-
ber of parameters. This parameterization
ensures that the number of parameters
only scales logarithmically to the input
length, which satisfies the first principle.
In addition, we add a decaying weight
to each scale during the combination
step and fulfill the second principle. We
named our methods as Structural Global
Convolution kernels (SGConv). Empiri-
cally, SGConv improves S4 by more than
1% and achieves SoTA results on the
LRA benchmark. On Speech Command
datasets, SGConv achieves comparative
results in the ten-class classification task
and significantly better results in the
35-class classification task upon previous
SoTA. We further show that SGConv is more efficient than S4 and can be used as a general pur-
pose module in different domains. For example, a hybrid model of classic attention and SGConv
shows promising performance on both autoregressive language modeling and sentence classifi-
cation tasks. Moreover, on a large-scale image classification task, replacing the 2D convolution
kernel of the ConvNext model with 1D SGConv matches the performance of the original model.
2 Related work
Efficient Transformers. The Transformer architecture [Vaswani et al., 2017] has been suc-
cessful across a wide range of applications in machine learning. However, the computation
and memory complexity of Transformer scales quadratically with the input length, making it
intractable for modeling long-range interactions in very long sequences. Therefore, several effi-
cient variants of Transformer model have been proposed recently to overcome this issue [Child
et al., 2019, Wang et al., 2020, Kitaev et al., 2019, Zaheer et al., 2020, Tay et al., 2020a, Peng
et al., 2021, Qin et al., 2021, Luo et al., 2021]. Nevertheless, few of these methods performed
well on benchmarks such as Long Range Arena [Tay et al., 2020b], SCROLLS [Shaham et al.,
3
2022], which require long-range modeling ability.
(Re-)parameterization. Parameterization is a crucial but underrated part of architecture
design because different parameterizations usually provide different inductive biases. For exam-
ple, weight normalization [Salimans and Kingma, 2016] parameterizes the norm and direction
of the weight matrices separately and thus reaches faster convergence. On the other hand,
Zagoruyko and Komodakis [2017] proposed a Dirac weight re-parameterization to train deep
networks without explicit skip-connections and matched the performance of ResNets [He et al.,
2016]. In computer vision, a line of works [Ding et al., 2019, Guo et al., 2020, Ding et al., 2021,
Cao et al., 2022] explored using structural re-parameterization to create 2D convolution kernels.
However, most of these works are limited to the vision domain and utilize only short-range
convolution kernels (e.g., 7 ×7) with only one exception [Ding et al., 2022], which scales the
size of convolution to 31 ×31 with an optimized CUDA kernel. Our SGConv kernel is a special
parameterization of global convolution kernels that tackles LRD and showcases the extensibility
of re-parameterized kernels.
State Space Models. The state space model (SSM) uses a set of linear differential equa-
tions to model physical systems with input, output, and state variables. It is widely used
in control, neuroscience, and statistics. Recently, Gu et al. [2021b] introduced a deep SSM-
based model that can outperform prior approaches on several long sequence modeling tasks
with a specially structured state transition matrix. However, the expensive computation and
memory requirements make it impractical. A followup work of Gu et al. [2021b] proposed a
new parameterization of SSM [Gu et al., 2021a], which decomposes the state transition matrix
into the sum of low-rank and normal matrices and implements SSM as a global convolutional
model. Under this parameterization, the authors then combine the techniques of diagonalizing
the Cauchy kernel and performing low-rank corrections with the Woodbury identity to compute
the global convolution kernel. While achieving promising results, S4 is theoretically involved
and practical implementations of S4 require accelerator-specific dedicated code optimization for
the Cauchy kernel computation. This makes it difficult to readily implement in deep learning
frameworks [Abadi et al., 2016, Chen et al., 2015, Chen, 2021, Ma et al., 2019] and hardware
targets.
3 Design of Global Convolutional Models
We summarize the design principles that enable the global convolutional model to be both
efficient and effective. Then we introduce the proposed Structured Global Convolution (SGConv)
based on the highlighted principles.
3.1 Design Principles
The two intuitive design principles that contribute to the success of S4 are efficient parameteri-
zation and decaying structure.
4
(a) Pathfinder-X (b) SC-10
Figure 2: Visualization of S4 kernels on (a) Pathfinder-X and (b) Speech Command 10-class.
The values in the convolution kernel exhibit a decaying behavior. We only plot the first 4096
positions for better illustration.
Efficient Parameterization. Different from local convolution, where the kernel size is fixed,
global convolution requires a kernel size that is the same as the sentence length. Naive param-
eterization of convolution kernel as classic local convolutions is therefore intractable for long
sequences. For instance, the Pathfinder-X task has a length of 16K. It then impractically re-
quires 4Mparameters for a single layer to model the depth-wise global convolution kernel with a
standard channel size of 256. Thus, an efficient convolution kernel parameterization is necessary,
especially when the sentence is extremely long. For example, S4 takes a well-designed Normal
Plus Low-Rank (NPLR) parameterization to model the whole kernel with two special matrices
where the number of parameters is fixed.
Decaying Structure. Apart from the efficiency of the parameterization, we find that a decay-
ing structure of the convolution kernel provides a good inductive bias to long-sequence modeling
and contributes to the performance (See Section 4.1.2 for detailed ablation study). Concretely,
the magnitude of the value in the convolution kernel should decay so that more weight is as-
signed to the close neighbors. S4 model inherently satisfies this property because the spectrum
of the power of a matrix decays exponentially:
Fact 1. For a square matrix A, the spectral radius ρ(Ak)ρ(A)k. In particular, if ρ(A)<1,
ρ(Ak)decays exponential to k.
We can also directly observe the decaying structure of S4 in different tasks in Figure 2.
3.2 SGConv
Putting the two principles altogether, we propose a simple global depth-wise convolution, dubbed
Structured Global Convolution (SGConv), based on multiscale sub-kernels and weighted combi-
nations. (See Figure 1).
Formally, let Lbe the length of the input sequence. We define the parameter set of a
single channel as S=wi|0i < log2L
d+ 1 where wiRdis the parameter for i-th
5
摘要:

WhatMakesConvolutionalModelsGreatonLongSequenceModeling?YuhongLi*1,TianleCai2,YiZhang3,DemingChen1,andDebadeeptaDey31UniversityofIllinoisUrbana-Champaign2PrincetonUniversity3MicrosoftResearchOctober18,2022AbstractConvolutionalmodelshavebeenwidelyusedinmultipledomains.However,mostexist-ingmodelsonly...

展开>> 收起<<
What Makes Convolutional Models Great on Long Sequence Modeling Yuhong Li1 Tianle Cai2 Yi Zhang3 Deming Chen1 and Debadeepta Dey3.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:21 页 大小:1.17MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注