
adversarial training, gaussian noise augmentation) exhibit dierent sensitivities to perturbations along
Fourier frequency components. [Dif+21] investigates the relationship between frequency sensitivity
and natural corruption robustness for models compressed with various weight pruning techniques, and
introduces ensembling algorithms where the frequency statistics of a test image are compared to those
of various image augmentation methods, and models trained on the augmentations most spectrally
similar to the test image are used in inference. [Sun+22] designs an augmentation procedure that
explicitly introduces variation in both the amplitude and phase of the DFT of input images, nding it
improves certied robustness and robustness and common corruptions. [Ber+21] investigated the extent
to which constraining models to use only the lowest (or highest) Fourier frequency components of input
data provided perturbation robustness, also nding signicant variability across datasets. [AHW21]
tested the extent to which CNNs relied on various frequency bands by measuring model error on inputs
where certain frequencies were removed, again nding a striking amount of variability across datasets.
[Mai+22] analyzed the sensitivity of networks to perturbations in various frequencies, nding signicant
variation across a variety of datasets and model architectures. All of these works suggest that model
frequency sensitivity depends heavily on the underlying training data. — our work began as an attempt
to explain this phenomenon mathematically.
Implicit bias and representation cost of CNNs: Our analysis of (linear) convolutional networks
leverages prior work on implicit bias and representational cost of CNNs, especially [Gun+18]. There it
was found that for a linear one-dimensional convolutional network where inputs and all hidden layers
have one channel (in the notation of section 3,
𝐶=𝐶1=⋯=𝐶𝐿−1 =𝐾=1
) trained on a binary
linear classication task with exponential loss, with linear eective predictor
𝛽
, the Fourier transformed
predictor
̂
𝛽
converges in direction to a rst-order stationary point of an optimization problem of the form
min
̂
𝛽1
2|̂
𝛽|2∕𝐿such that 𝑦𝑛̂
𝛽𝑇𝑥𝑛≥1for all 𝑛. (2.1)
A generalization to arbitrary group-equivariant CNNs (of which the usual CNNs are a special case)
appears in [Law+22, Thm. 1] — while we suspect that some of our results generalize to more general
equivariant networks we leave that to future work. For generalizations in dierent directions see [LL20;
YKM21], and for additional follow up work see [JRG22]. Our general setup in section 3 closely follows
these authors’, and our theorem 4.10 partially conrms a suspicion of [Gun+18, §6] that “with multiple
outputs, as more layers are added, even fully connected networks exhibit a shrinking sparsity penalty on
the singular values of the eective linear matrix predictor ...”
While the aforementioned works study the implicit regularization imposed by gradient descent, we
instead consider explicit regularization imposed by auxiliary
𝓁2
norm penalties in objective functions, and
prove equivalences of minimization problems. In this sense our analysis is more closely related to that
of [DKS21], which considers parametrized families of functions 𝑓(𝑥,𝑤)and denes the representation
cost of a function 𝑔(𝑥)appearing in the parametric family as
𝑅(𝑔)∶=min
𝑤{|𝑤|2
2|𝑓(𝑥,𝑤)=𝑔(𝑥)for all 𝑥}.(2.2)
While this approach lacks the intimate connection with the gradient descent algorithms used to train
modern neural networks, it comes with some benets: for example, results regarding representation
cost are agnostic to the choice of per-sample loss function (in particular they apply to both squared error
and cross entropy loss). In the case where the number of channels
𝐶=𝐶1=⋯=𝐶𝐿−1 =1
(but the
number of output dimensions may be >1), theorem 4.10 can be deduced from [DKS21, Thm. 3].
Data-dependent bias: while in this paper we focus on spatial frequency properties of image data,
there is a large and growing body of work on the impact of frequency properties of training data more
broadly interpreted. [Rah+19] gave a formula for the continuous Fourier transform of a ReLU network
4