WavSpA Wavelet Space Attention for Boosting Transformers Long Sequence Learning Ability Yufan Zhuang

2025-05-06 0 0 551.22KB 18 页 10玖币

侵权投诉

WavSpA: Wavelet Space Attention for Boosting

Transformers’ Long Sequence Learning Ability

Yufan Zhuang

UC San Diego

Zihan Wang

UC San Diego

Fangbo Tao

Mindverse

Jingbo Shang

UC San Diego

Abstract

Transformer and its variants are fundamental neural architectures in deep learn-

ing. Recent works show that learning attention in the Fourier space can improve

the long sequence learning capability of Transformers. We argue that wavelet

transform shall be a better choice because it captures both position and frequency

information with linear time complexity. Therefore, in this paper, we systemati-

cally study the synergy between wavelet transform and Transformers. We propose

Wav

elet

ace

ttention (WavSpA) that facilitates attention learning in a learnable

wavelet coefﬁcient space which replaces the attention in Transformers by (1) apply-

ing forward wavelet transform to project the input sequences to multi-resolution

bases, (2) conducting attention learning in the wavelet coefﬁcient space, and (3)

reconstructing the representation in input space via backward wavelet transform.

Extensive experiments on the Long Range Arena demonstrate that learning atten-

tion in the wavelet space using either ﬁxed or adaptive wavelets can consistently

improve Transformer’s performance and also signiﬁcantly outperform learning in

Fourier space. We further show our method can enhance Transformer’s reasoning

extrapolation capability over distance on the LEGO chain-of-reasoning task.

1 Introduction

Transformer [

] has become one of the most inﬂuential neural architectures in deep learning. Large

language models such as ChatGPT [

] have reshaped people’s imagination of what an AI model

can do in making conversation with humans, solving nontrivial math problems, writing code, and

even co-authoring a paper [

]. In image processing, vision transformers have become the backbone

for a wide array of applications [

]. Similarly, on source code understanding, Codex [

] can

ﬁnish people’s code given the helper text of the function or just the function name. All of those

accomplishments are built upon the foundational Transformer.

Nevertheless, the effective handling of long sequences remains a challenge for Transformers due

to the intricate relationships that can exist within such sequences. To address this limitation, recent

research has focused on enhancing the Transformers’ long-range capabilities through attention

learning in transformed sequence spaces. One approach involves low-cost token-mixing, which

utilizes forward Fourier transformation to achieve notable accuracy improvements while maintaining

quasi-linear time complexity [

]. However, without incorporating a backward transformation, the

model might inadvertently mix information from both the input and transformed spaces. To overcome

this limitation, researchers have leveraged the forward and backward Fourier transformations to

learn large ﬁlters with linear weights [

] and non-linearities [

] for vision tasks, exploiting the

equivalence between multiplication in the Fourier space and direct convolution in the input space.

In light of these developments, it is evident that attention learning in transformed sequence spaces

holds signiﬁcant promise for enhancing the effectiveness of Transformers’ handling of long-range

dependencies. We propose

Wav

elet

ace

ttention (WavSpA) that facilitates attention learning in

a learnable wavelet coefﬁcient space, as shown in Figure 1(a). Speciﬁcally, we ﬁrst apply forward

wavelet transform to project the input sequence to multi-resolution bases, then conduct attention

arXiv:2210.01989v3 [cs.CL] 22 May 2023

=⋅ ො𝑥 → 𝐹 ො𝑥 ⋅ =

Forward Wavelet Projection 𝑥 → ො𝑥

via 𝑂(𝑛) Fast Wavelet Transform Attention

Mechanism 𝐹

Backward Wavelet Reconstruction 𝐹 ො𝑥 → 𝑦

via 𝑂(𝑛) Fast Wavelet Transform

Input sequence 𝑥

(length 𝑛)Output sequence 𝑦

(length 𝑛)

Multi-resolution orthogonal

wavelet bases 𝜓𝑖,𝑗

Input

Forward Wavelet

Add & Normalize

Backward Wavelet

Dense MLP

Output

Attention

(a) WavSpA Block (b) Sequence Learning with Forward and Backward Wavelet Transform (WavSpA)

in the wavelet

coefficient space

(Different colors denote different bases; 1+2+4+8=15 bases in total)

Multi-resolution orthogonal

wavelet bases 𝜓𝑖,𝑗

Figure 1: An overview of our proposed WavSpA. (a) The only difference between a Transformer

block and a WavSpA block is the attention computation. (b) The general ﬂow of computation in

WavSpA with learnable forward and backward wavelet transform.

(e.g., full attention [

], random feature kernel [

]) in the wavelet coefﬁcient space, and ﬁnally,

reconstruct the representation in input space via backward wavelet transform. We implement the

transform using Fast Wavelet Transform (FWT) [

] so both transform steps are linear in time,

leading to a small overhead.

Performing attention on a sequence in a wavelet-transformed space can offer several advantages.

Firstly, it can enhance the representation of the input sequence by capturing relevant features and

patterns. By applying the transformation, the sequence is mapped to a new space where certain char-

acteristics might be easier to capture. Attention mechanisms can then be applied in this transformed

space to effectively weigh these transformed features, leading to improved representation learning.

Secondly, it can enable the attention mechanism to capture different types of relationships between

the elements of the sequence, such as associative relationships. By operating in the transformed

space, attention can effectively capture the underlying structure of the data and reason over it, leading

to improved performance on long sequences. Finally, it is orthogonal to existing work that attempts

to replace attention, hence can be combined with any Transformer design.

Besides applying ﬁxed wavelets, we further propose three ways to construct learnable wavelets: direct

wavelet parameterization, orthogonal wavelet parameterization, and wavelet lifting. We give detailed

explanations of the three schemes and discuss their individual advantages and drawbacks.

We conduct extensive experiments on the Long Range Arena (LRA) benchmark to validate and

justify our proposed WavSpA. By combining ﬁxed wavelet space with various representative attention

methods, we observed signiﬁcant performance improvements without introducing additional time

complexities. Furthermore, we analyze the performance of WavSpA’s three parameterization schemes

when coupled with the attention methods, demonstrating even stronger performance boosts. Addi-

tionally, our investigation demonstrated that equipping the Transformer with our proposed WavSpA

resulted in enhanced reasoning extrapolation capacity, as evidenced by improved performance on

the LEGO dataset [

]. These ﬁndings highlight the superior long-range understanding capabilities

achieved by learning in the wavelet coefﬁcient space compared to the input space or Fourier space.

In summary, our major contributions are as follows.

•

We propose WavSpA to facilitate learning in the wavelet space following a forward-backward

paradigm which can be paired with various attention methods and boost their long-range under-

standing capabilities.

•

We further propose three adaptive wavelet parameterization schemes (AdaWavSpA, OrthoWavSpA,

LiftWavSpA) to maximize the ﬂexibility of wavelet transformation.

•

Extensive experiments on the Long-Range Arena benchmark have demonstrated the effectiveness

and also justiﬁed the design of WavSpA.

• We show WavSpA enhances the reasoning extrapolation capacity to longer sequence lengths.

Reproducibility. We will release our code on GitHub.

2 Learning Attention in a Transformed Space

Inspired by recent work, we begin our study with sequence space transformation with Fourier

transforms. FNet [

] replaced the attention with solely forward Fourier transform, it performs well

empirically but mixing Fourier coefﬁcients with the input of the original data space is not an intuitive

approach. Typical space transforms consist of a forward step and a backward step [

]. Hence,

we are interested in comparing sequence learning in a forward-only or in a forward-backward mode.

Table 1: Transformed Spaces vs. Original Space (N/A) on the Long Range Arena Text task. We color

the number green if it surpasses the baseline (i.e., N/A), red vice versa.

Transformation Transformer Linformer Linear Att. Longformer Performer

Original Space (N/A) 64.27 53.94 65.90 62.85 65.40

Fourier - Forward Only [18] 54.65 51.27 65.25 53.51 53.39

Fourier [31, 11] 56.42 57.06 71.66 55.36 65.52

Fixed Daubechies-2 Wavelet 74.82 55.22 71.93 74.99 75.60

Figure 2: We show a chirp signal from 1Hz to 4Hz, its continuous Fourier transform, and its

continuous wavelet transform. From the Fourier transform graph one can only infer the existence of

signal in the range of 1-4Hz without time information, while in the wavelet transform graph, both

time and frequency information are present and one can tell this is a chirp signal.

We conduct pilot studies on the Text task of Long Range Arena [

], combining various attention

mechanisms with Forward Only Fourier transform or Forward Backward Fourier transform. The

results are summarized in Table 1, and experiment details can be found in Section 4. Notably, we

observed that learning with the Forward Backward mode consistently outperformed the Forward Only

mode. While the Fourier transform occasionally outperformed the original space, its improvement

was not consistently observed across all attention mechanisms.

This phenomenon is understandable since Fourier transform maps signals into the frequency domain,

resulting in the loss of time information. In the deep learning context, losing time information is

analogous to losing positional information. And positional information is vital in many tasks, as it

pins down associative relationships amid elements of the sequence. Hence, preserving and leveraging

time information becomes vital for effectively capturing the dependencies within the sequence.

Based on such observation, we propose WavSpA that facilitates attention learning in a wavelet

coefﬁcient space, detailed methodology explained in Section 3. Wavelet transform is a sequence

projection method where both frequency and time information are captured. As an illustration, we

show an example of wavelet transform to demonstrate its ability in time-frequency localization

compared to the Fourier transform (see Figure 2). Furthermore, the wavelet transform is multi-level

where the decomposition levels correspond to low-to-high frequencies. In the deep learning context,

low-frequency signal represents global features and high-frequency signal represents local features,

which has been shown useful in prior attention methods [1, 46].

This multi-level decomposition capability corresponds to the multi-level nature of long inputs such

as human text. As associative relationships in text occur at various levels, starting from individual

words within a sentence. For instance, in the sentence“The cat chased the mouse” the words “cat”

and “mouse” are associated in terms of their roles in the action.

Associative relationships also extend beyond sentence boundaries. Texts are organized in hierarchical

structures, such as paragraphs, sections, and documents, where higher-level associations emerge.

Within a paragraph, sentences are associated, contributing to a coherent idea. In longer texts like news

articles, sections and chapters form hierarchical connections, uniting them under common topics.

This hierarchical structure is not unique to text but also exists in other sequential inputs, including

source code, formulas, and more. Recognizing and understanding this multi-level hierarchy is crucial

as it enables models to capture rich relationships within the sequence, facilitating more advanced

extrapolation reasoning capabilities.

To validate our intuition, we perform experiments on the LRA benchmark (Fixed Daubechies-2

Wavelet row of Table 1), the results indicate wavelet transform can deliver consistent performance

boosts across a wide range of attention mechanisms. Furthermore, we present a comprehensive

comparison of attention learning in Fourier space and Fixed wavelet spaces in Appendix Table 3.

3 WavSpA: Learning Attention in Parametrized Wavelet Space

In this section, we introduce the details of WavSpA. As shown in Figure 1(a), the only difference

between a Transformer block and a WavSpA block is the attention computation. The general ﬂow of

WavSpA is shown in Figure 1(b), which constitutes the forward wavelet transform, the attention in

the middle, and the backward wavelet transform.

We list our notations here — we denote scalars as

, vectors as

, matrices as

; we denote function

f’s transformation in the coefﬁcient space as ˆ

3.1 WavSpA Paradigm

We propose the WavSpA paradigm to conduct attention learning in the wavelet coefﬁcient space

between forward and backward transformation. The forward transformation decomposes the input

sequence into coefﬁcients of a set of wavelet basis. We then conduct attention in the coefﬁcient space.

In the backward transformation, we reconstruct the target representation in the original function space.

For ﬁxed wavelet families, we require the forward-backward transformation pair to be invertible

and exact, meaning that one can perfectly reconstruct the same input from the derived coefﬁcients.

However, this constraint is not always attached to adaptive wavelets.

The general framework is shown below. In practice, we deal with vectors with dimensions of the

attention head dimension. Here, we limit ourselves to 1d functions for a clear illustration. Given input

and output function

x(t), y(t) : R−→ R

on time domain

, wavelet basis

ψ(ω, t)

on both frequency

and time domain ω, t (e.g, the basis for a Daubechies-2 wavelet), and attention module Attention,

(forward) ˆx(ω) = X

x(ti)ψ∗(ω, ti)(1)

(attention) ˆ

h(ω) = Attention ◦ˆx(ω)(2)

(backward) y(t) = X

h(ωj)ψ(ωj, t)(3)

where ψ∗(ω, t)denotes the complex conjugate of ψ.

Learning carried out in this space will correspond to gathering and processing information in a

coarse to ﬁne-grained fashion. Furthermore, wavelet transform enjoys

O(n)

time complexity [

], an

already desirable property compared to Fourier transform’s O(nlog n)complexity.

3.2 Direct Wavelet Parameterization - AdaWavSpA

One key beneﬁt of wavelet transformation is its ﬂexibility in choosing the wavelets for its application,

for example, Daubechies wavelets [

] are optimized to have the most compact support; symlets [

]

are designed to have better symmetric properties. Therefore it is natural to consider parameterization

of the wavelet coefﬁcients and make wavelet transformation part of the learning process.

The direct parameterization scheme is the most intuitive approach. We make the wavelet coefﬁcients

learnable parameters, and update them during training. The key problem here is maintaining the

structure between the scaling coefﬁcients and the wavelet coefﬁcients, i.e. the quadrature mirror ﬁlter

(QMF) relationship [

]. We consider parameterizing the scaling coefﬁcients (

ϕ(n)∈Rn

denotes

wavelet length) and expanding the system according to the QMF relationship to obtain the full set of

wavelet coefﬁcients (ψ(n)∈Rn), shown in equation 4.

ψ(n)

j= (−1)jϕ(n)

−j, j ∈Z(4)

Further strengthening the learning power of adaptive parameterizations, we use different sets (i.e.,

sets) of learnable wavelets for individual hidden dimensions of the input

X∈Rn,d

. At the same

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WavSpA:WaveletSpaceAttentionforBoostingTransformers’LongSequenceLearningAbilityYufanZhuangUCSanDiegoZihanWangUCSanDiegoFangboTaoMindverseJingboShangUCSanDiegoAbstractTransformeranditsvariantsarefundamentalneuralarchitecturesindeeplearn-ing.RecentworksshowthatlearningattentionintheFourierspacecanimpr...

展开>> 收起<<

WavSpA Wavelet Space Attention for Boosting Transformers Long Sequence Learning Ability Yufan Zhuang.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

WavSpA Wavelet Space Attention for Boosting Transformers Long Sequence Learning Ability Yufan Zhuang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: