What Makes Convolutional Models Great on Long Sequence Modeling Yuhong Li1 Tianle Cai2 Yi Zhang3 Deming Chen1 and Debadeepta Dey3

2025-05-06 0 0 1.17MB 21 页 10玖币

侵权投诉

What Makes Convolutional Models Great on Long Sequence

Modeling?

Yuhong Li∗1, Tianle Cai∗2, Yi Zhang3, Deming Chen1, and Debadeepta Dey3

1University of Illinois Urbana-Champaign

2Princeton University

3Microsoft Research

October 18, 2022

Abstract

Convolutional models have been widely used in multiple domains. However, most exist-

ing models only use local convolution, making the model unable to handle long-range depen-

dency eﬃciently. Attention overcomes this problem by aggregating global information based

on the pair-wise attention score but also makes the computational complexity quadratic to

the sequence length. Recently, Gu et al. [2021a] proposed a model called S4 inspired by

the state space model. S4 can be eﬃciently implemented as a global convolutional model

whose kernel size equals the input sequence length. With Fast Fourier Transform, S4 can

model much longer sequences than Transformers and achieve signiﬁcant gains over SoTA on

several long-range tasks. Despite its empirical success, S4 is involved. It requires sophis-

ticated parameterization and initialization schemes that combine the wisdom from several

prior works. As a result, S4 is less intuitive and hard to use for researchers with limited

prior knowledge. Here we aim to demystify S4 and extract basic principles that contribute

to the success of S4 as a global convolutional model. We focus on the structure of the

convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are

suﬃcient to make up an eﬀective global convolutional model: 1) The parameterization of the

convolutional kernel needs to be eﬃcient in the sense that the number of parameters should

scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure

that the weights for convolving with closer neighbors are larger than the more distant ones.

Based on the two principles, we propose a simple yet eﬀective convolutional model called

Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over

several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech

Command datasets. 2) When plugging SGConv into standard language and vision models,

it shows the potential to improve both eﬃciency and performance. Code is available at

https://github.com/ctlllll/SGConv.

∗Equal contribution. Work done during the internship at Microsoft Research.

arXiv:2210.09298v1 [cs.LG] 17 Oct 2022

1 Introduction

Handling Long-Range Dependency (LRD) is a key challenge in long-sequence modeling tasks

such as time-series forecasting, language modeling, and pixel-level image generation. Unfortu-

nately, standard deep learning models fail to solve this problem for diﬀerent reasons: Recurrent

Neural Network (RNN) suﬀers from vanishing gradient, Transformer has complexity quadratic in

the sequence length, and Convolutional Neural Network (CNN) usually only has a local receptive

ﬁeld in each layer.

A recently proposed benchmark called Long-Range Arena (LRA) [Tay et al., 2020b] reveals

that all existing models perform poorly in modeling LRD. Notably, on one spatial-level sequence

modeling task called Pathﬁnder-X from LRA, all models fail except a new Structured State Space

sequence model (S4) [Gu et al., 2021a]. The S4 model is inspired by the state space model widely

used in control theory and can be computed eﬃciently with a special parameterization based on

the Cauchy kernel. The exact implementation of the S4 model can be viewed as a (depthwise)

global convolutional model with an involved computation global convolution kernel. Thanks to

the global receptive ﬁeld of the convolution kernel, S4 is able to handle tasks that require LRD,

such as Pathﬁnder [Tay et al., 2020b], where classic local CNNs fail [Linsley et al., 2018, Kim

et al., 2019]. Also, the use of Fast Fourier Transform (FFT) and techniques from numerical

linear algebra make the computational complexity of S4 tractable compared to the quadratic

complexity of attention. Together, S4 shows the potential of global convolutional models to

model LRD and advances the SoTA on LRA.

Despite its accomplishments, the delicate design of S4 makes it unfriendly even to knowl-

edgable researchers. In particular, the empirical success of S4 relies on 1) A Diagonal Plus Low

Rank (DLPR) parameterization whose eﬃcient implementation requires several numerical linear

algebra tricks, 2) An initialization scheme based on the HiPPO matrix derived in prior work [Gu

et al., 2020]. Therefore, aiming to reduce the complications of the model and highlight minimal

principles, we raise the following questions:

What contributes to the success of the S4 model? Can we establish a simpler model based on

minimal principles to handle long-range dependency?

To answer these questions, we focus on the design of the global convolution kernel. We extract

two simple and intuitive principles that contribute to the success of the S4 kernel. The ﬁrst

principle is that the parameterization of the global convolution kernel should be eﬃcient in terms

of the sequence length: the number of parameters should scale slowly with the sequence length.

For example, classic CNNs use a ﬁxed kernel size. S4 also uses a ﬁxed number of parameters

to compute the convolution kernel while the number is greater than classic CNNs. Both models

satisfy the ﬁrst principle as the number of parameters does not scale with input length. The

eﬃciency of parameterization is also necessary because the naive implementation of a global

convolution kernel with the size of sentence length is intractable for inputs with thousands of

tokens. Too many parameters will also cause overﬁtting, thus hurting the performance. The

second principle is the decaying structure of the convolution kernel, meaning that the weights

for convolving with closer neighbors are larger than the more distant ones. This structure

appears ubiquitously in signal processing, with the well-known Gaussian ﬁlter as an example.

The intuition is clear that closer neighbors provide a more helpful signal. S4 inherently enjoys

this decaying property because of the exponential decay of the spectrum of matrix powers (See

Figure 2), and we ﬁnd this inductive bias improves the model performance (See Section 4.1.2).

Figure 1: Illustration of the parameterization used

in SGConv (Eq. (1)). The convolution kernel is com-

posed of multi-scale sub-kernels. Parameterization

Eﬃciency. Every larger sub-kernel doubles the size

of the previous sub-kernel while the same number of

parameters are used for every scale, ensuring a loga-

rithmic dependency of the number of parameters to

the input length. Decaying. We use a weighted

combination of sub-kernels where the weights are de-

caying, and smaller weights are assigned to larger

scales.

We show that these two principles are

suﬃcient for designing a global convo-

lutional model that captures LRD well.

To verify this, we introduce a class of

global convolution kernels with a sim-

ple multiscale structure, as shown in Fig-

ure 1. Speciﬁcally, we compose the con-

volution kernel by a sequence of sub-

kernels of increasing sizes, yet every sub-

kernel is upsampled from the same num-

ber of parameters. This parameterization

ensures that the number of parameters

only scales logarithmically to the input

length, which satisﬁes the ﬁrst principle.

In addition, we add a decaying weight

to each scale during the combination

step and fulﬁll the second principle. We

named our methods as Structural Global

Convolution kernels (SGConv). Empiri-

cally, SGConv improves S4 by more than

1% and achieves SoTA results on the

LRA benchmark. On Speech Command

datasets, SGConv achieves comparative

results in the ten-class classiﬁcation task

and signiﬁcantly better results in the

35-class classiﬁcation task upon previous

SoTA. We further show that SGConv is more eﬃcient than S4 and can be used as a general pur-

pose module in diﬀerent domains. For example, a hybrid model of classic attention and SGConv

shows promising performance on both autoregressive language modeling and sentence classiﬁ-

cation tasks. Moreover, on a large-scale image classiﬁcation task, replacing the 2D convolution

kernel of the ConvNext model with 1D SGConv matches the performance of the original model.

2 Related work

Eﬃcient Transformers. The Transformer architecture [Vaswani et al., 2017] has been suc-

cessful across a wide range of applications in machine learning. However, the computation

and memory complexity of Transformer scales quadratically with the input length, making it

intractable for modeling long-range interactions in very long sequences. Therefore, several eﬃ-

cient variants of Transformer model have been proposed recently to overcome this issue [Child

et al., 2019, Wang et al., 2020, Kitaev et al., 2019, Zaheer et al., 2020, Tay et al., 2020a, Peng

et al., 2021, Qin et al., 2021, Luo et al., 2021]. Nevertheless, few of these methods performed

well on benchmarks such as Long Range Arena [Tay et al., 2020b], SCROLLS [Shaham et al.,

2022], which require long-range modeling ability.

(Re-)parameterization. Parameterization is a crucial but underrated part of architecture

design because diﬀerent parameterizations usually provide diﬀerent inductive biases. For exam-

ple, weight normalization [Salimans and Kingma, 2016] parameterizes the norm and direction

of the weight matrices separately and thus reaches faster convergence. On the other hand,

Zagoruyko and Komodakis [2017] proposed a Dirac weight re-parameterization to train deep

networks without explicit skip-connections and matched the performance of ResNets [He et al.,

2016]. In computer vision, a line of works [Ding et al., 2019, Guo et al., 2020, Ding et al., 2021,

Cao et al., 2022] explored using structural re-parameterization to create 2D convolution kernels.

However, most of these works are limited to the vision domain and utilize only short-range

convolution kernels (e.g., 7 ×7) with only one exception [Ding et al., 2022], which scales the

size of convolution to 31 ×31 with an optimized CUDA kernel. Our SGConv kernel is a special

parameterization of global convolution kernels that tackles LRD and showcases the extensibility

of re-parameterized kernels.

State Space Models. The state space model (SSM) uses a set of linear diﬀerential equa-

tions to model physical systems with input, output, and state variables. It is widely used

in control, neuroscience, and statistics. Recently, Gu et al. [2021b] introduced a deep SSM-

based model that can outperform prior approaches on several long sequence modeling tasks

with a specially structured state transition matrix. However, the expensive computation and

memory requirements make it impractical. A followup work of Gu et al. [2021b] proposed a

new parameterization of SSM [Gu et al., 2021a], which decomposes the state transition matrix

into the sum of low-rank and normal matrices and implements SSM as a global convolutional

model. Under this parameterization, the authors then combine the techniques of diagonalizing

the Cauchy kernel and performing low-rank corrections with the Woodbury identity to compute

the global convolution kernel. While achieving promising results, S4 is theoretically involved

and practical implementations of S4 require accelerator-speciﬁc dedicated code optimization for

the Cauchy kernel computation. This makes it diﬃcult to readily implement in deep learning

frameworks [Abadi et al., 2016, Chen et al., 2015, Chen, 2021, Ma et al., 2019] and hardware

targets.

3 Design of Global Convolutional Models

We summarize the design principles that enable the global convolutional model to be both

eﬃcient and eﬀective. Then we introduce the proposed Structured Global Convolution (SGConv)

based on the highlighted principles.

3.1 Design Principles

The two intuitive design principles that contribute to the success of S4 are eﬃcient parameteri-

zation and decaying structure.

(a) Pathﬁnder-X (b) SC-10

Figure 2: Visualization of S4 kernels on (a) Pathﬁnder-X and (b) Speech Command 10-class.

The values in the convolution kernel exhibit a decaying behavior. We only plot the ﬁrst 4096

positions for better illustration.

Eﬃcient Parameterization. Diﬀerent from local convolution, where the kernel size is ﬁxed,

global convolution requires a kernel size that is the same as the sentence length. Naive param-

eterization of convolution kernel as classic local convolutions is therefore intractable for long

sequences. For instance, the Pathﬁnder-X task has a length of 16K. It then impractically re-

quires 4Mparameters for a single layer to model the depth-wise global convolution kernel with a

standard channel size of 256. Thus, an eﬃcient convolution kernel parameterization is necessary,

especially when the sentence is extremely long. For example, S4 takes a well-designed Normal

Plus Low-Rank (NPLR) parameterization to model the whole kernel with two special matrices

where the number of parameters is ﬁxed.

Decaying Structure. Apart from the eﬃciency of the parameterization, we ﬁnd that a decay-

ing structure of the convolution kernel provides a good inductive bias to long-sequence modeling

and contributes to the performance (See Section 4.1.2 for detailed ablation study). Concretely,

the magnitude of the value in the convolution kernel should decay so that more weight is as-

signed to the close neighbors. S4 model inherently satisﬁes this property because the spectrum

of the power of a matrix decays exponentially:

Fact 1. For a square matrix A, the spectral radius ρ(Ak)≤ρ(A)k. In particular, if ρ(A)<1,

ρ(Ak)decays exponential to k.

We can also directly observe the decaying structure of S4 in diﬀerent tasks in Figure 2.

3.2 SGConv

Putting the two principles altogether, we propose a simple global depth-wise convolution, dubbed

Structured Global Convolution (SGConv), based on multiscale sub-kernels and weighted combi-

nations. (See Figure 1).

Formally, let Lbe the length of the input sequence. We deﬁne the parameter set of a

single channel as S=wi|0≤i < log2L

d+ 1 where wi∈Rdis the parameter for i-th

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhatMakesConvolutionalModelsGreatonLongSequenceModeling?YuhongLi*1,TianleCai2,YiZhang3,DemingChen1,andDebadeeptaDey31UniversityofIllinoisUrbana-Champaign2PrincetonUniversity3MicrosoftResearchOctober18,2022AbstractConvolutionalmodelshavebeenwidelyusedinmultipledomains.However,mostexist-ingmodelsonly...

展开>> 收起<<

What Makes Convolutional Models Great on Long Sequence Modeling Yuhong Li1 Tianle Cai2 Yi Zhang3 Deming Chen1 and Debadeepta Dey3.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

What Makes Convolutional Models Great on Long Sequence Modeling Yuhong Li1 Tianle Cai2 Yi Zhang3 Deming Chen1 and Debadeepta Dey3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: