WavSpA Wavelet Space Attention for Boosting Transformers Long Sequence Learning Ability Yufan Zhuang

2025-05-06 0 0 551.22KB 18 页 10玖币
侵权投诉
WavSpA: Wavelet Space Attention for Boosting
Transformers’ Long Sequence Learning Ability
Yufan Zhuang
UC San Diego
Zihan Wang
UC San Diego
Fangbo Tao
Mindverse
Jingbo Shang
UC San Diego
Abstract
Transformer and its variants are fundamental neural architectures in deep learn-
ing. Recent works show that learning attention in the Fourier space can improve
the long sequence learning capability of Transformers. We argue that wavelet
transform shall be a better choice because it captures both position and frequency
information with linear time complexity. Therefore, in this paper, we systemati-
cally study the synergy between wavelet transform and Transformers. We propose
Wav
elet
Sp
ace
A
ttention (WavSpA) that facilitates attention learning in a learnable
wavelet coefficient space which replaces the attention in Transformers by (1) apply-
ing forward wavelet transform to project the input sequences to multi-resolution
bases, (2) conducting attention learning in the wavelet coefficient space, and (3)
reconstructing the representation in input space via backward wavelet transform.
Extensive experiments on the Long Range Arena demonstrate that learning atten-
tion in the wavelet space using either fixed or adaptive wavelets can consistently
improve Transformer’s performance and also significantly outperform learning in
Fourier space. We further show our method can enhance Transformer’s reasoning
extrapolation capability over distance on the LEGO chain-of-reasoning task.
1 Introduction
Transformer [
39
] has become one of the most influential neural architectures in deep learning. Large
language models such as ChatGPT [
26
] have reshaped people’s imagination of what an AI model
can do in making conversation with humans, solving nontrivial math problems, writing code, and
even co-authoring a paper [
16
]. In image processing, vision transformers have become the backbone
for a wide array of applications [
9
,
29
]. Similarly, on source code understanding, Codex [
3
] can
finish people’s code given the helper text of the function or just the function name. All of those
accomplishments are built upon the foundational Transformer.
Nevertheless, the effective handling of long sequences remains a challenge for Transformers due
to the intricate relationships that can exist within such sequences. To address this limitation, recent
research has focused on enhancing the Transformers’ long-range capabilities through attention
learning in transformed sequence spaces. One approach involves low-cost token-mixing, which
utilizes forward Fourier transformation to achieve notable accuracy improvements while maintaining
quasi-linear time complexity [
18
]. However, without incorporating a backward transformation, the
model might inadvertently mix information from both the input and transformed spaces. To overcome
this limitation, researchers have leveraged the forward and backward Fourier transformations to
learn large filters with linear weights [
31
] and non-linearities [
11
] for vision tasks, exploiting the
equivalence between multiplication in the Fourier space and direct convolution in the input space.
In light of these developments, it is evident that attention learning in transformed sequence spaces
holds significant promise for enhancing the effectiveness of Transformers’ handling of long-range
dependencies. We propose
Wav
elet
Sp
ace
A
ttention (WavSpA) that facilitates attention learning in
a learnable wavelet coefficient space, as shown in Figure 1(a). Specifically, we first apply forward
wavelet transform to project the input sequence to multi-resolution bases, then conduct attention
arXiv:2210.01989v3 [cs.CL] 22 May 2023
= 𝑥 → 𝐹 𝑥 =
Forward Wavelet Projection 𝑥 → 𝑥
via 𝑂(𝑛) Fast Wavelet Transform Attention
Mechanism 𝐹
Backward Wavelet Reconstruction 𝐹 𝑥 → 𝑦
via 𝑂(𝑛) Fast Wavelet Transform
Input sequence 𝑥
(length 𝑛)Output sequence 𝑦
(length 𝑛)
Multi-resolution orthogonal
wavelet bases 𝜓𝑖,𝑗
Input
Forward Wavelet
Add & Normalize
Backward Wavelet
Dense MLP
Output
Attention
(a) WavSpA Block (b) Sequence Learning with Forward and Backward Wavelet Transform (WavSpA)
in the wavelet
coefficient space
(Different colors denote different bases; 1+2+4+8=15 bases in total)
Multi-resolution orthogonal
wavelet bases 𝜓𝑖,𝑗
Figure 1: An overview of our proposed WavSpA. (a) The only difference between a Transformer
block and a WavSpA block is the attention computation. (b) The general flow of computation in
WavSpA with learnable forward and backward wavelet transform.
(e.g., full attention [
39
], random feature kernel [
30
]) in the wavelet coefficient space, and finally,
reconstruct the representation in input space via backward wavelet transform. We implement the
transform using Fast Wavelet Transform (FWT) [
22
] so both transform steps are linear in time,
leading to a small overhead.
Performing attention on a sequence in a wavelet-transformed space can offer several advantages.
Firstly, it can enhance the representation of the input sequence by capturing relevant features and
patterns. By applying the transformation, the sequence is mapped to a new space where certain char-
acteristics might be easier to capture. Attention mechanisms can then be applied in this transformed
space to effectively weigh these transformed features, leading to improved representation learning.
Secondly, it can enable the attention mechanism to capture different types of relationships between
the elements of the sequence, such as associative relationships. By operating in the transformed
space, attention can effectively capture the underlying structure of the data and reason over it, leading
to improved performance on long sequences. Finally, it is orthogonal to existing work that attempts
to replace attention, hence can be combined with any Transformer design.
Besides applying fixed wavelets, we further propose three ways to construct learnable wavelets: direct
wavelet parameterization, orthogonal wavelet parameterization, and wavelet lifting. We give detailed
explanations of the three schemes and discuss their individual advantages and drawbacks.
We conduct extensive experiments on the Long Range Arena (LRA) benchmark to validate and
justify our proposed WavSpA. By combining fixed wavelet space with various representative attention
methods, we observed significant performance improvements without introducing additional time
complexities. Furthermore, we analyze the performance of WavSpA’s three parameterization schemes
when coupled with the attention methods, demonstrating even stronger performance boosts. Addi-
tionally, our investigation demonstrated that equipping the Transformer with our proposed WavSpA
resulted in enhanced reasoning extrapolation capacity, as evidenced by improved performance on
the LEGO dataset [
47
]. These findings highlight the superior long-range understanding capabilities
achieved by learning in the wavelet coefficient space compared to the input space or Fourier space.
In summary, our major contributions are as follows.
We propose WavSpA to facilitate learning in the wavelet space following a forward-backward
paradigm which can be paired with various attention methods and boost their long-range under-
standing capabilities.
We further propose three adaptive wavelet parameterization schemes (AdaWavSpA, OrthoWavSpA,
LiftWavSpA) to maximize the flexibility of wavelet transformation.
Extensive experiments on the Long-Range Arena benchmark have demonstrated the effectiveness
and also justified the design of WavSpA.
We show WavSpA enhances the reasoning extrapolation capacity to longer sequence lengths.
Reproducibility. We will release our code on GitHub.
2 Learning Attention in a Transformed Space
Inspired by recent work, we begin our study with sequence space transformation with Fourier
transforms. FNet [
18
] replaced the attention with solely forward Fourier transform, it performs well
empirically but mixing Fourier coefficients with the input of the original data space is not an intuitive
approach. Typical space transforms consist of a forward step and a backward step [
31
,
11
]. Hence,
we are interested in comparing sequence learning in a forward-only or in a forward-backward mode.
2
Table 1: Transformed Spaces vs. Original Space (N/A) on the Long Range Arena Text task. We color
the number green if it surpasses the baseline (i.e., N/A), red vice versa.
Transformation Transformer Linformer Linear Att. Longformer Performer
Original Space (N/A) 64.27 53.94 65.90 62.85 65.40
Fourier - Forward Only [18] 54.65 51.27 65.25 53.51 53.39
Fourier [31, 11] 56.42 57.06 71.66 55.36 65.52
Fixed Daubechies-2 Wavelet 74.82 55.22 71.93 74.99 75.60
Figure 2: We show a chirp signal from 1Hz to 4Hz, its continuous Fourier transform, and its
continuous wavelet transform. From the Fourier transform graph one can only infer the existence of
signal in the range of 1-4Hz without time information, while in the wavelet transform graph, both
time and frequency information are present and one can tell this is a chirp signal.
We conduct pilot studies on the Text task of Long Range Arena [
35
], combining various attention
mechanisms with Forward Only Fourier transform or Forward Backward Fourier transform. The
results are summarized in Table 1, and experiment details can be found in Section 4. Notably, we
observed that learning with the Forward Backward mode consistently outperformed the Forward Only
mode. While the Fourier transform occasionally outperformed the original space, its improvement
was not consistently observed across all attention mechanisms.
This phenomenon is understandable since Fourier transform maps signals into the frequency domain,
resulting in the loss of time information. In the deep learning context, losing time information is
analogous to losing positional information. And positional information is vital in many tasks, as it
pins down associative relationships amid elements of the sequence. Hence, preserving and leveraging
time information becomes vital for effectively capturing the dependencies within the sequence.
Based on such observation, we propose WavSpA that facilitates attention learning in a wavelet
coefficient space, detailed methodology explained in Section 3. Wavelet transform is a sequence
projection method where both frequency and time information are captured. As an illustration, we
show an example of wavelet transform to demonstrate its ability in time-frequency localization
compared to the Fourier transform (see Figure 2). Furthermore, the wavelet transform is multi-level
where the decomposition levels correspond to low-to-high frequencies. In the deep learning context,
low-frequency signal represents global features and high-frequency signal represents local features,
which has been shown useful in prior attention methods [1, 46].
This multi-level decomposition capability corresponds to the multi-level nature of long inputs such
as human text. As associative relationships in text occur at various levels, starting from individual
words within a sentence. For instance, in the sentence“The cat chased the mouse” the words “cat
and “mouse” are associated in terms of their roles in the action.
Associative relationships also extend beyond sentence boundaries. Texts are organized in hierarchical
structures, such as paragraphs, sections, and documents, where higher-level associations emerge.
Within a paragraph, sentences are associated, contributing to a coherent idea. In longer texts like news
articles, sections and chapters form hierarchical connections, uniting them under common topics.
This hierarchical structure is not unique to text but also exists in other sequential inputs, including
source code, formulas, and more. Recognizing and understanding this multi-level hierarchy is crucial
as it enables models to capture rich relationships within the sequence, facilitating more advanced
extrapolation reasoning capabilities.
3
To validate our intuition, we perform experiments on the LRA benchmark (Fixed Daubechies-2
Wavelet row of Table 1), the results indicate wavelet transform can deliver consistent performance
boosts across a wide range of attention mechanisms. Furthermore, we present a comprehensive
comparison of attention learning in Fourier space and Fixed wavelet spaces in Appendix Table 3.
3 WavSpA: Learning Attention in Parametrized Wavelet Space
In this section, we introduce the details of WavSpA. As shown in Figure 1(a), the only difference
between a Transformer block and a WavSpA block is the attention computation. The general flow of
WavSpA is shown in Figure 1(b), which constitutes the forward wavelet transform, the attention in
the middle, and the backward wavelet transform.
We list our notations here — we denote scalars as
x
, vectors as
x
, matrices as
X
; we denote function
fs transformation in the coefficient space as ˆ
f.
3.1 WavSpA Paradigm
We propose the WavSpA paradigm to conduct attention learning in the wavelet coefficient space
between forward and backward transformation. The forward transformation decomposes the input
sequence into coefficients of a set of wavelet basis. We then conduct attention in the coefficient space.
In the backward transformation, we reconstruct the target representation in the original function space.
For fixed wavelet families, we require the forward-backward transformation pair to be invertible
and exact, meaning that one can perfectly reconstruct the same input from the derived coefficients.
However, this constraint is not always attached to adaptive wavelets.
The general framework is shown below. In practice, we deal with vectors with dimensions of the
attention head dimension. Here, we limit ourselves to 1d functions for a clear illustration. Given input
and output function
x(t), y(t) : RR
on time domain
t
, wavelet basis
ψ(ω, t)
on both frequency
and time domain ω, t (e.g, the basis for a Daubechies-2 wavelet), and attention module Attention,
(forward) ˆx(ω) = X
i
x(ti)ψ(ω, ti)(1)
(attention) ˆ
h(ω) = Attention ˆx(ω)(2)
(backward) y(t) = X
j
ˆ
h(ωj)ψ(ωj, t)(3)
where ψ(ω, t)denotes the complex conjugate of ψ.
Learning carried out in this space will correspond to gathering and processing information in a
coarse to fine-grained fashion. Furthermore, wavelet transform enjoys
O(n)
time complexity [
22
], an
already desirable property compared to Fourier transform’s O(nlog n)complexity.
3.2 Direct Wavelet Parameterization - AdaWavSpA
One key benefit of wavelet transformation is its flexibility in choosing the wavelets for its application,
for example, Daubechies wavelets [
8
] are optimized to have the most compact support; symlets [
7
]
are designed to have better symmetric properties. Therefore it is natural to consider parameterization
of the wavelet coefficients and make wavelet transformation part of the learning process.
The direct parameterization scheme is the most intuitive approach. We make the wavelet coefficients
learnable parameters, and update them during training. The key problem here is maintaining the
structure between the scaling coefficients and the wavelet coefficients, i.e. the quadrature mirror filter
(QMF) relationship [
7
]. We consider parameterizing the scaling coefficients (
ϕ(n)Rn
,
n
denotes
wavelet length) and expanding the system according to the QMF relationship to obtain the full set of
wavelet coefficients (ψ(n)Rn), shown in equation 4.
ψ(n)
j= (1)jϕ(n)
j, j Z(4)
Further strengthening the learning power of adaptive parameterizations, we use different sets (i.e.,
d
sets) of learnable wavelets for individual hidden dimensions of the input
XRn,d
. At the same
4
摘要:

WavSpA:WaveletSpaceAttentionforBoostingTransformers’LongSequenceLearningAbilityYufanZhuangUCSanDiegoZihanWangUCSanDiegoFangboTaoMindverseJingboShangUCSanDiegoAbstractTransformeranditsvariantsarefundamentalneuralarchitecturesindeeplearn-ing.RecentworksshowthatlearningattentionintheFourierspacecanimpr...

展开>> 收起<<
WavSpA Wavelet Space Attention for Boosting Transformers Long Sequence Learning Ability Yufan Zhuang.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:18 页 大小:551.22KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注