
=⋅ ො𝑥 → 𝐹 ො𝑥 ⋅ =
Forward Wavelet Projection 𝑥 → ො𝑥
via 𝑂(𝑛) Fast Wavelet Transform Attention
Mechanism 𝐹
Backward Wavelet Reconstruction 𝐹 ො𝑥 → 𝑦
via 𝑂(𝑛) Fast Wavelet Transform
Input sequence 𝑥
(length 𝑛)Output sequence 𝑦
(length 𝑛)
Multi-resolution orthogonal
wavelet bases 𝜓𝑖,𝑗
Input
Forward Wavelet
Add & Normalize
Backward Wavelet
Dense MLP
Output
Attention
(a) WavSpA Block (b) Sequence Learning with Forward and Backward Wavelet Transform (WavSpA)
in the wavelet
coefficient space
(Different colors denote different bases; 1+2+4+8=15 bases in total)
Multi-resolution orthogonal
wavelet bases 𝜓𝑖,𝑗
Figure 1: An overview of our proposed WavSpA. (a) The only difference between a Transformer
block and a WavSpA block is the attention computation. (b) The general flow of computation in
WavSpA with learnable forward and backward wavelet transform.
(e.g., full attention [
39
], random feature kernel [
30
]) in the wavelet coefficient space, and finally,
reconstruct the representation in input space via backward wavelet transform. We implement the
transform using Fast Wavelet Transform (FWT) [
22
] so both transform steps are linear in time,
leading to a small overhead.
Performing attention on a sequence in a wavelet-transformed space can offer several advantages.
Firstly, it can enhance the representation of the input sequence by capturing relevant features and
patterns. By applying the transformation, the sequence is mapped to a new space where certain char-
acteristics might be easier to capture. Attention mechanisms can then be applied in this transformed
space to effectively weigh these transformed features, leading to improved representation learning.
Secondly, it can enable the attention mechanism to capture different types of relationships between
the elements of the sequence, such as associative relationships. By operating in the transformed
space, attention can effectively capture the underlying structure of the data and reason over it, leading
to improved performance on long sequences. Finally, it is orthogonal to existing work that attempts
to replace attention, hence can be combined with any Transformer design.
Besides applying fixed wavelets, we further propose three ways to construct learnable wavelets: direct
wavelet parameterization, orthogonal wavelet parameterization, and wavelet lifting. We give detailed
explanations of the three schemes and discuss their individual advantages and drawbacks.
We conduct extensive experiments on the Long Range Arena (LRA) benchmark to validate and
justify our proposed WavSpA. By combining fixed wavelet space with various representative attention
methods, we observed significant performance improvements without introducing additional time
complexities. Furthermore, we analyze the performance of WavSpA’s three parameterization schemes
when coupled with the attention methods, demonstrating even stronger performance boosts. Addi-
tionally, our investigation demonstrated that equipping the Transformer with our proposed WavSpA
resulted in enhanced reasoning extrapolation capacity, as evidenced by improved performance on
the LEGO dataset [
47
]. These findings highlight the superior long-range understanding capabilities
achieved by learning in the wavelet coefficient space compared to the input space or Fourier space.
In summary, our major contributions are as follows.
•
We propose WavSpA to facilitate learning in the wavelet space following a forward-backward
paradigm which can be paired with various attention methods and boost their long-range under-
standing capabilities.
•
We further propose three adaptive wavelet parameterization schemes (AdaWavSpA, OrthoWavSpA,
LiftWavSpA) to maximize the flexibility of wavelet transformation.
•
Extensive experiments on the Long-Range Arena benchmark have demonstrated the effectiveness
and also justified the design of WavSpA.
• We show WavSpA enhances the reasoning extrapolation capacity to longer sequence lengths.
Reproducibility. We will release our code on GitHub.
2 Learning Attention in a Transformed Space
Inspired by recent work, we begin our study with sequence space transformation with Fourier
transforms. FNet [
18
] replaced the attention with solely forward Fourier transform, it performs well
empirically but mixing Fourier coefficients with the input of the original data space is not an intuitive
approach. Typical space transforms consist of a forward step and a backward step [
31
,
11
]. Hence,
we are interested in comparing sequence learning in a forward-only or in a forward-backward mode.
2