TestingpredictionsofrepresentationcosttheorywithCNNs CharlesGodfreyEliseBishoffMylesMckayDavisBrown GraysonJorgensonHenryKvingeEleanorByler

2025-05-02 0 0 5.64MB 33 页 10玖币
侵权投诉
Testing predictions of representation cost theory with CNNs
Charles Godfrey, Elise Bisho, Myles Mckay, Davis Brown,
Grayson Jorgenson, Henry Kvinge & Eleanor Byler
Pacic Northwest National Lab
{first}.{last}@pnnl.gov
Abstract
It is widely acknowledged that trained convolutional neural networks (CNNs) have dierent
levels of sensitivity to signals of dierent frequency. In particular, a number of empirical studies
have documented CNNs sensitivity to low-frequency signals. In this work we show with theory and
experiments that this observed sensitivity is a consequence of the frequency distribution of natural
images, which is known to have most of its power concentrated in low-to-mid frequencies. Our
theoretical analysis relies on representations of the layers of a CNN in frequency space, an idea that
has previously been used to accelerate computations and study implicit bias of network training
algorithms, but to the best of our knowledge has not been applied in the domain of model robustness.
1 Introduction
Since their rise to prominence in the early 1990s, convolutional neural networks (CNNs) have formed
the backbone of image and video recognition, object detection, and speech to text systems ([CPS16;
Fuk80;Kar+14;KSH12;LeC+89]). The success of CNNs has largely been attributed to their "hard
priors" of spatial translation invariance and local receptive elds [GBC16, §9.3]. On the other hand,
more recent research has revealed a number of less desirable and potentially data-dependent biases of
CNNs, such as a tendency to make predictions on the basis of texture features ([Gei+19]). Moreover, it
has been repeatedly observed that CNNs are sensitive to perturbations in targeted ranges of the Fourier
frequency spectrum ([GFW19;SDB19]) and further investigation has shown that these frequency ranges
are dependent on training data ([AHW21;Ber+21;Mai+22;Yin+19]). In this work, we provide a
mathematical explanation for these frequency space phenomena, showing with theory and experiments
that neural network training causes CNNs to be most sensitive to frequencies that are prevalent in the
training data distribution.
Our theoretical results rely on representing an idealized CNN in frequency space, a strategy we
borrow from [Gun+18]. This representation is built on the classical convolution theorem,
ˆ
𝑤𝑥=̂
𝑤̂
𝑥(1.1)
where
̂
𝑥
and
̂
𝑤
denote the Fourier transform of
𝑥
and
𝑤
respectively, and
denotes a convolution.
Equation 1.1 demonstrates that a Fourier transform converts convolutions into products. As such, in a
cartoon” representation of a CNN in frequency space, the convolution layers become coordinate-wise
multiplications (a more precise description is presented in section 3). This suggests that in the presence
of some form of weight decay, the weights
̂
𝑤
for high-power frequencies in the training data distribution
will grow during training, while weights corresponding to low-power frequencies in the training data will
be suppressed. The resulting uneven magnitude of the weights
̂
𝑤
across frequencies can thus account for
1
arXiv:2210.01257v3 [cs.LG] 26 Sep 2023
the observed uneven perturbation-sensitivity of CNNs in frequency space. We formalize this argument
for linear CNNs (without biases) in sections 3 and 4.
One interesting feature of the framework set up in section 4 is that the discrete Fourier transform
(DFT) representation of a linear CNN is precisely a feedforward network with block diagonal weight
matrices, where each block corresponds to a frequency index. We show in theorem 4.10 that a learning
objective for such a network of depth
𝐿
with an
𝓁2
-norm penalty on weights is equivalent to an objective
for the associated linear model with an
𝓁𝑝
penalty on the singular values of each of its blocks, i.e. each
frequency index — this result is new for CNNs with multiple channels and outputs. In particular, the
latter penalty is highly sparsity-encouraging, suggesting as depth increases these linearly-activated CNNs
have an even stronger incentive to prioritize frequencies present in the training data.
It has long been known that the frequency content of natural images is concentrated in low-to-
mid frequencies, in the sense that the power in Fourier frequency
𝑓
is well-described by
1∕|𝑓|𝛼
for a
coecient 𝛼1([LMH01]). Hence, when specialized to training data distributions of natural images,
our results explain ndings that CNNs are more susceptible to low frequency perturbations in practice
([GFW19;SDB19]).
We use our theoretical results to derive specic predictions: CNN frequency sensitivity aligns with
the frequency content of training data, and deeper models, as well as models trained with substantial
weight decay, exhibit frequency sensitivity more closely reecting the statistics of the underlying images.
We conrm these predictions for nonlinear CNNs trained on the CIFAR10 and ImageNette datasets.
Figure 1 shows our experimental results for a variety of CNN models trained on CIFAR10 as well as a
variant of CIFAR10 preprocessed with high pass ltering (more experimental details will be provided in
section 5).
To the best of our knowledge, ours is the rst work to connect the following research threads (see
section 2 for further discussion):
equivalences between linear neural networks and sparse linear models (implicit bias and repre-
sentation cost),
classical data-dependent “shrinkage” properties of sparse linear models,
statistical properties of natural images, and
sensitivity of CNNs to perturbations in certain frequency ranges.
It seems likely that a similar analysis could provide insight into models trained on data from a domain
other than images; hence our work may serve as a template for combining results on implicit bias and/or
representation cost with domain knowledge of input data statistics to understand the behavior of deep
learning models.
2 Related work
CNN sensitivity to Fourier frequency components: [JB17] computed transfer accuracy of image
classiers trained on data preprocessed with various Fourier ltering schemes (e.g. train on low pass
ltered images, test on unltered images or vice versa). They found signicant generalization gaps,
suggesting that models trained on images with dierent frequency content learned dierent patterns.
[GFW19] proposed algorithms for generating adversarial perturbations constrained to low frequency
Fourier components, nding that they allowed for greater query eciency and higher transferability
between dierent neural networks. [SDB19] demonstrated empirically that constraining to high or or
midrange frequencies did not produce similar eects, suggesting convolutional networks trained on
natural images exhibit a particular sensitivity to low frequency perturbations.
[Yin+19] showed dierent types of corruptions of natural images (e.g. blur, noise, fog) have dierent
eects when viewed in frequency space, and models trained with dierent augmentation strategies (e.g.
2
(a) ConvActually models of varying depth.
(b) ConvActually models of depth 4 trained with varying weight decay.
(c) VGG models of varying depth.
(d) VGG 11 models trained with varying weight decay.
(e) Myrtle CNN models trained with varying weight decay.
Figure 1: Radial averages
𝐸[|𝑥𝑓(𝑥)𝑇̂𝑒𝑐𝑖𝑗|||(𝑖,𝑗)|=𝑟]
of frequency sensitivities of CNNs trained on (hpf-)CIFAR10, post-
processed by dividing each curve by its integral, smoothing by averaging with
3
neighbors on either side and taking logarithms.
Bottom row: frequency statistics of (hpf-)CIFAR10 for comparison. See section 5 for further details.
3
adversarial training, gaussian noise augmentation) exhibit dierent sensitivities to perturbations along
Fourier frequency components. [Dif+21] investigates the relationship between frequency sensitivity
and natural corruption robustness for models compressed with various weight pruning techniques, and
introduces ensembling algorithms where the frequency statistics of a test image are compared to those
of various image augmentation methods, and models trained on the augmentations most spectrally
similar to the test image are used in inference. [Sun+22] designs an augmentation procedure that
explicitly introduces variation in both the amplitude and phase of the DFT of input images, nding it
improves certied robustness and robustness and common corruptions. [Ber+21] investigated the extent
to which constraining models to use only the lowest (or highest) Fourier frequency components of input
data provided perturbation robustness, also nding signicant variability across datasets. [AHW21]
tested the extent to which CNNs relied on various frequency bands by measuring model error on inputs
where certain frequencies were removed, again nding a striking amount of variability across datasets.
[Mai+22] analyzed the sensitivity of networks to perturbations in various frequencies, nding signicant
variation across a variety of datasets and model architectures. All of these works suggest that model
frequency sensitivity depends heavily on the underlying training data. — our work began as an attempt
to explain this phenomenon mathematically.
Implicit bias and representation cost of CNNs: Our analysis of (linear) convolutional networks
leverages prior work on implicit bias and representational cost of CNNs, especially [Gun+18]. There it
was found that for a linear one-dimensional convolutional network where inputs and all hidden layers
have one channel (in the notation of section 3,
𝐶=𝐶1==𝐶𝐿−1 =𝐾=1
) trained on a binary
linear classication task with exponential loss, with linear eective predictor
𝛽
, the Fourier transformed
predictor
̂
𝛽
converges in direction to a rst-order stationary point of an optimization problem of the form
min
̂
𝛽1
2|̂
𝛽|2∕𝐿such that 𝑦𝑛̂
𝛽𝑇𝑥𝑛1for all 𝑛. (2.1)
A generalization to arbitrary group-equivariant CNNs (of which the usual CNNs are a special case)
appears in [Law+22, Thm. 1] — while we suspect that some of our results generalize to more general
equivariant networks we leave that to future work. For generalizations in dierent directions see [LL20;
YKM21], and for additional follow up work see [JRG22]. Our general setup in section 3 closely follows
these authors’, and our theorem 4.10 partially conrms a suspicion of [Gun+18, §6] that “with multiple
outputs, as more layers are added, even fully connected networks exhibit a shrinking sparsity penalty on
the singular values of the eective linear matrix predictor ...
While the aforementioned works study the implicit regularization imposed by gradient descent, we
instead consider explicit regularization imposed by auxiliary
𝓁2
norm penalties in objective functions, and
prove equivalences of minimization problems. In this sense our analysis is more closely related to that
of [DKS21], which considers parametrized families of functions 𝑓(𝑥,𝑤)and denes the representation
cost of a function 𝑔(𝑥)appearing in the parametric family as
𝑅(𝑔)∶=min
𝑤{|𝑤|2
2|𝑓(𝑥,𝑤)=𝑔(𝑥)for all 𝑥}.(2.2)
While this approach lacks the intimate connection with the gradient descent algorithms used to train
modern neural networks, it comes with some benets: for example, results regarding representation
cost are agnostic to the choice of per-sample loss function (in particular they apply to both squared error
and cross entropy loss). In the case where the number of channels
𝐶=𝐶1==𝐶𝐿−1 =1
(but the
number of output dimensions may be >1), theorem 4.10 can be deduced from [DKS21, Thm. 3].
Data-dependent bias: while in this paper we focus on spatial frequency properties of image data,
there is a large and growing body of work on the impact of frequency properties of training data more
broadly interpreted. [Rah+19] gave a formula for the continuous Fourier transform of a ReLU network
4
𝑓𝑛
, and showed in a range of experiments that ReLU networks learn low frequency modes of
input data rst. [Xia22] proves theoretical results on low frequency components being learned rst for
networks
𝑓𝑖𝑆𝑛𝑖
on products of spheres, where the role of frequency is played by spherical
harmonic indices (see also [XP22] for some related results).
Perhaps the work most closely related to ours is that of [HW22] on principal component (PC) bias,
where it is shown that rates of convergence in deep (and wide) linear networks are governed by the
spectrum of the input data covariance matrix.
1
[HW22] also includes experiments connecting PC bias
with spectral bias (learning low frequency modes rst, as in the preceding paragraph) and a phenomenon
known as learning order consistency. However, it is worth noting that in their work there is no explicit
theoretical analysis of CNNs and no consideration of the statistics of natural images in Fourier frequency
space.
Other applications of Fourier transformed CNNs: [MHL14;Pra+17;Vas+15;Zhu+21] all, in
one way or another, leverage frequency space representations of convolutions to accelerate computations,
e.g. neural network training. Since this is not our main focus, we omit a more detailed synopsis.
3 The discrete Fourier transform of a CNN
In this section we x the notation and structures we will be working with. We dene a class of idealized,
linear convolutional networks and derive a useful representation of these networks in the frequency
space via the discrete Fourier transform.
Consider a linear, feedforward, multichannel 2D-CNN 𝑓(𝑥)of the form
𝐶×𝐻×𝑊𝑤1∗−
,,,,,𝐶1×𝐻×𝑊𝑤2∗−
,,,,,𝐶2×𝐻×𝑊𝑤3∗−
,,,,,𝑤𝐿−1∗−
,,,,,,,𝐶𝐿−1×𝐻×𝑊𝑤𝐿,𝑇
,,,,,𝐾(3.1)
where
𝑤𝑙𝑥
denotes the convolution operation between tensors
𝑤𝑙𝐶𝑙×𝐻×𝑊×𝐶𝑙−1
and
𝑥𝐶𝑙−1×𝐻×𝑊
,
dened by (𝑤𝑙𝑥)𝑐𝑖𝑗 =
𝑚+𝑚=𝑖,𝑛+𝑛=𝑗(
𝑑𝑤𝑙𝑐𝑚𝑛𝑑𝑥𝑑𝑚𝑛)(3.2)
and
𝑤𝐿,𝑇𝑥
denotes a contraction (a.k.a. Einstein summation) of the tensor
𝑤𝐿𝐾×𝐻×𝑊×𝐶𝐿−1
with
the tensor
𝑥𝐶𝐿−1×𝐻×𝑊
over the last 3 indices (the
(−)𝑇
denotes a transpose operation described
momentarily). Explicitly, (𝑤𝐿,𝑇𝑥)𝑘=
𝑙,𝑚,𝑛𝑤𝐿
𝑘𝑚𝑛𝑙𝑥𝑙𝑚𝑛.(3.3)
Thus, the model eq. (3.1) has weights 𝑤𝑙𝐶𝑙×𝐻×𝑊×𝐶𝑙−1 for 𝑙=1,,𝐿1and 𝑤𝐿𝐾×𝐻×𝑊×𝐶𝐿−1.
Remarks 3.4.For tensors with at least 3 indices (such as
𝑥
and the weights
𝑤𝑙
above) we will always use
the transpose notation
𝑇
to denote reversing the second and third tensor indices, which will always be
the 2D spatial indices. For matrices and vectors it will be used according to standard practice. In eq. (3.3)
the transpose ensures that the indices in Einstein sums move from “inside to out” as is standard practice.
Equivalently,
𝑤𝐿,𝑇𝑥
can be described as a usual matrix product
̃
𝑤𝐿vec(𝑥)
where
vec(𝑥)
is the vector-
ization (attening) of
𝑥
and
̃
𝑤𝐿
is obtained by attening the last 3 tensor indices of
𝑤𝐿
(compatibly with
those of
𝑥
as dictated by eq. (3.3)). Hence it represents a typical “atten and then apply a linear layer
architecture component. Our reason for adopting the tensor contraction perspective is that it is more
amenable to the Fourier analysis described below.
1They also prove a result for shallow ReLU networks.
5
摘要:

TestingpredictionsofrepresentationcosttheorywithCNNsCharlesGodfrey,EliseBishoff,MylesMckay,DavisBrown,GraysonJorgenson,HenryKvinge&EleanorBylerPacificNorthwestNationalLab{first}.{last}@pnnl.govAbstractItiswidelyacknowledgedthattrainedconvolutionalneuralnetworks(CNNs)havedifferentlevelsofsensitivityt...

展开>> 收起<<
TestingpredictionsofrepresentationcosttheorywithCNNs CharlesGodfreyEliseBishoffMylesMckayDavisBrown GraysonJorgensonHenryKvingeEleanorByler.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:5.64MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注