
Ac
k×1×1
×
UcAs
1×h×w
× ×
c×k×h×w
S2
`
k×h×w
˜
S2
×+
k×h×w
F
w1
1−w1
c1:
c0
scattering
coefficients
separate channel attention spatial attention fusion
Figure 1:
Network architecture illustrating the separation of attention modules on the scattering transform. The
left most block represents the output of the scattering transform on the input. The separate operator isolates a
single channel, e.g.,
C0
, and passes the normalized scattering coefficients,
˜
S2
, through channel attention and
spatial attention before fusion. There are Ctotal attention modules in the network. Figure modified from [12].
performance and interpretability as well as improve confidence of post hoc explainability methods
[
11
,
12
]. Most similar to this work are the studies from [
13
,
8
]. In [
13
], residual layers mix the input
channels before applying attention and [
8
] applies a scattering attention module after each step in a
U-Net. However, our approach differs in that we introduce a separation scheme that applies attention
to individual input channels that directly follow the scattering transform.
2 Methodology
Figure 1 illustrates the primary components of our network, starting with our output of the scattering
transform and showing an attention module separated by input channel. The implementation and
design choice for each part is described in detail below.
Scattering Transform
Scattering representations yield invariant, stable (to noise and deformations),
and informative signal descriptors with cascading wavelet decomposition using a non-linear modulus
followed by spatial averaging. Using the Kymatio package [
14
], we compute a 2D transform with a
predetermined filter bank of Morlet wavelets at
J= 3
scales and
L= 6
orientations. For each input
channel, we apply a second-order transform to obtain the scattering coefficients
S2
. These channels
are processed independently and combined later in the network. Additional details on the scattering
transform can be found in Appendix A.1.
Channel Separation
Local attention methods routinely process their input using all the channel
information at once, e.g., feature maps from RGB color channels. However, the result of the scattering
transform yields a 5-dimensional tensor,
S2
, where each channel,
C
, in the input has their own set of
K
scattering coefficients. Rather than stacking the result and passing them all through the subsequent
layers together, we propose to first separate the input channels and process the coefficients individually.
This creates
C
new attention modules, each with independent weights, that are processed in parallel.
By following this separation scheme we add the benefit of localizing patterns in the input before
joining high-level features. Thus, the interpretation of attention over individual input channels is
improved significantly, especially if the channels have different meaning, e.g., temporal, visible,
infrared, derived products, etc.
Attention Modules
The attention modules encompass three primary components, namely: (i) chan-
nel attention, (ii) spatial attention and (iii) feature fusion. The channel attention features are used to
inform the spatial attention module before fusion via feature recalibration. Specifically, the network
learns to use the spatial information over the
K
channels to selectively highlight the more informative
coefficients from the less useful ones. Not only does this offer a performance improvement to
our network, but it also adds an additional layer of interpretability with channels corresponding
to particular coefficients. The spatial attention features highlight the salient features in the spatial
resolution of independent input channels. This differs from most computer vision problems with
RGB imagery that only have one heat map for the full image. As such, our network provides a more
transparent interpretation of how the spatial information in each input channel is used to form a
prediction. Implementation details of each component can be found in appendices A.2, A.3, and A.4.
2