TestingpredictionsofrepresentationcosttheorywithCNNs CharlesGodfreyEliseBishoffMylesMckayDavisBrown GraysonJorgensonHenryKvingeEleanorByler

2025-05-02 0 0 5.64MB 33 页 10玖币

侵权投诉

Testing predictions of representation cost theory with CNNs

Charles Godfrey, Elise Bisho, Myles Mckay, Davis Brown,

Grayson Jorgenson, Henry Kvinge & Eleanor Byler

Pacic Northwest National Lab

{first}.{last}@pnnl.gov

Abstract

It is widely acknowledged that trained convolutional neural networks (CNNs) have dierent

levels of sensitivity to signals of dierent frequency. In particular, a number of empirical studies

have documented CNNs sensitivity to low-frequency signals. In this work we show with theory and

experiments that this observed sensitivity is a consequence of the frequency distribution of natural

images, which is known to have most of its power concentrated in low-to-mid frequencies. Our

theoretical analysis relies on representations of the layers of a CNN in frequency space, an idea that

has previously been used to accelerate computations and study implicit bias of network training

algorithms, but to the best of our knowledge has not been applied in the domain of model robustness.

1 Introduction

Since their rise to prominence in the early 1990s, convolutional neural networks (CNNs) have formed

the backbone of image and video recognition, object detection, and speech to text systems ([CPS16;

Fuk80;Kar+14;KSH12;LeC+89]). The success of CNNs has largely been attributed to their "hard

priors" of spatial translation invariance and local receptive elds [GBC16, §9.3]. On the other hand,

more recent research has revealed a number of less desirable and potentially data-dependent biases of

CNNs, such as a tendency to make predictions on the basis of texture features ([Gei+19]). Moreover, it

has been repeatedly observed that CNNs are sensitive to perturbations in targeted ranges of the Fourier

frequency spectrum ([GFW19;SDB19]) and further investigation has shown that these frequency ranges

are dependent on training data ([AHW21;Ber+21;Mai+22;Yin+19]). In this work, we provide a

mathematical explanation for these frequency space phenomena, showing with theory and experiments

that neural network training causes CNNs to be most sensitive to frequencies that are prevalent in the

training data distribution.

Our theoretical results rely on representing an idealized CNN in frequency space, a strategy we

borrow from [Gun+18]. This representation is built on the classical convolution theorem,

𝑤∗𝑥=̂

𝑤⋅̂

𝑥(1.1)

where

𝑥

and

𝑤

denote the Fourier transform of

𝑥

and

𝑤

respectively, and

∗

denotes a convolution.

Equation 1.1 demonstrates that a Fourier transform converts convolutions into products. As such, in a

“cartoon” representation of a CNN in frequency space, the convolution layers become coordinate-wise

multiplications (a more precise description is presented in section 3). This suggests that in the presence

of some form of weight decay, the weights

𝑤

for high-power frequencies in the training data distribution

will grow during training, while weights corresponding to low-power frequencies in the training data will

be suppressed. The resulting uneven magnitude of the weights

𝑤

across frequencies can thus account for

arXiv:2210.01257v3 [cs.LG] 26 Sep 2023

the observed uneven perturbation-sensitivity of CNNs in frequency space. We formalize this argument

for linear CNNs (without biases) in sections 3 and 4.

One interesting feature of the framework set up in section 4 is that the discrete Fourier transform

(DFT) representation of a linear CNN is precisely a feedforward network with block diagonal weight

matrices, where each block corresponds to a frequency index. We show in theorem 4.10 that a learning

objective for such a network of depth

𝐿

with an

𝓁2

-norm penalty on weights is equivalent to an objective

for the associated linear model with an

𝓁𝑝

penalty on the singular values of each of its blocks, i.e. each

frequency index — this result is new for CNNs with multiple channels and outputs. In particular, the

latter penalty is highly sparsity-encouraging, suggesting as depth increases these linearly-activated CNNs

have an even stronger incentive to prioritize frequencies present in the training data.

It has long been known that the frequency content of natural images is concentrated in low-to-

mid frequencies, in the sense that the power in Fourier frequency

𝑓

is well-described by

1∕|𝑓|𝛼

for a

coecient 𝛼≈1([LMH01]). Hence, when specialized to training data distributions of natural images,

our results explain ndings that CNNs are more susceptible to low frequency perturbations in practice

([GFW19;SDB19]).

We use our theoretical results to derive specic predictions: CNN frequency sensitivity aligns with

the frequency content of training data, and deeper models, as well as models trained with substantial

weight decay, exhibit frequency sensitivity more closely reecting the statistics of the underlying images.

We conrm these predictions for nonlinear CNNs trained on the CIFAR10 and ImageNette datasets.

Figure 1 shows our experimental results for a variety of CNN models trained on CIFAR10 as well as a

variant of CIFAR10 preprocessed with high pass ltering (more experimental details will be provided in

section 5).

To the best of our knowledge, ours is the rst work to connect the following research threads (see

section 2 for further discussion):

•

equivalences between linear neural networks and sparse linear models (implicit bias and repre-

sentation cost),

•classical data-dependent “shrinkage” properties of sparse linear models,

•statistical properties of natural images, and

•sensitivity of CNNs to perturbations in certain frequency ranges.

It seems likely that a similar analysis could provide insight into models trained on data from a domain

other than images; hence our work may serve as a template for combining results on implicit bias and/or

representation cost with domain knowledge of input data statistics to understand the behavior of deep

learning models.

2 Related work

CNN sensitivity to Fourier frequency components: [JB17] computed transfer accuracy of image

classiers trained on data preprocessed with various Fourier ltering schemes (e.g. train on low pass

ltered images, test on unltered images or vice versa). They found signicant generalization gaps,

suggesting that models trained on images with dierent frequency content learned dierent patterns.

[GFW19] proposed algorithms for generating adversarial perturbations constrained to low frequency

Fourier components, nding that they allowed for greater query eciency and higher transferability

between dierent neural networks. [SDB19] demonstrated empirically that constraining to high or or

midrange frequencies did not produce similar eects, suggesting convolutional networks trained on

natural images exhibit a particular sensitivity to low frequency perturbations.

[Yin+19] showed dierent types of corruptions of natural images (e.g. blur, noise, fog) have dierent

eects when viewed in frequency space, and models trained with dierent augmentation strategies (e.g.

(a) ConvActually models of varying depth.

(b) ConvActually models of depth 4 trained with varying weight decay.

(d) VGG 11 models trained with varying weight decay.

(e) Myrtle CNN models trained with varying weight decay.

Figure 1: Radial averages

𝐸[|∇𝑥𝑓(𝑥)𝑇̂𝑒𝑐𝑖𝑗|||(𝑖,𝑗)|=𝑟]

of frequency sensitivities of CNNs trained on (hpf-)CIFAR10, post-

processed by dividing each curve by its integral, smoothing by averaging with

neighbors on either side and taking logarithms.

Bottom row: frequency statistics of (hpf-)CIFAR10 for comparison. See section 5 for further details.

adversarial training, gaussian noise augmentation) exhibit dierent sensitivities to perturbations along

Fourier frequency components. [Dif+21] investigates the relationship between frequency sensitivity

and natural corruption robustness for models compressed with various weight pruning techniques, and

introduces ensembling algorithms where the frequency statistics of a test image are compared to those

of various image augmentation methods, and models trained on the augmentations most spectrally

similar to the test image are used in inference. [Sun+22] designs an augmentation procedure that

explicitly introduces variation in both the amplitude and phase of the DFT of input images, nding it

improves certied robustness and robustness and common corruptions. [Ber+21] investigated the extent

to which constraining models to use only the lowest (or highest) Fourier frequency components of input

data provided perturbation robustness, also nding signicant variability across datasets. [AHW21]

tested the extent to which CNNs relied on various frequency bands by measuring model error on inputs

where certain frequencies were removed, again nding a striking amount of variability across datasets.

[Mai+22] analyzed the sensitivity of networks to perturbations in various frequencies, nding signicant

variation across a variety of datasets and model architectures. All of these works suggest that model

frequency sensitivity depends heavily on the underlying training data. — our work began as an attempt

to explain this phenomenon mathematically.

Implicit bias and representation cost of CNNs: Our analysis of (linear) convolutional networks

leverages prior work on implicit bias and representational cost of CNNs, especially [Gun+18]. There it

was found that for a linear one-dimensional convolutional network where inputs and all hidden layers

have one channel (in the notation of section 3,

𝐶=𝐶1=⋯=𝐶𝐿−1 =𝐾=1

) trained on a binary

linear classication task with exponential loss, with linear eective predictor

𝛽

, the Fourier transformed

predictor

𝛽

converges in direction to a rst-order stationary point of an optimization problem of the form

min

𝛽1

2|̂

𝛽|2∕𝐿such that 𝑦𝑛̂

𝛽𝑇𝑥𝑛≥1for all 𝑛. (2.1)

A generalization to arbitrary group-equivariant CNNs (of which the usual CNNs are a special case)

appears in [Law+22, Thm. 1] — while we suspect that some of our results generalize to more general

equivariant networks we leave that to future work. For generalizations in dierent directions see [LL20;

YKM21], and for additional follow up work see [JRG22]. Our general setup in section 3 closely follows

these authors’, and our theorem 4.10 partially conrms a suspicion of [Gun+18, §6] that “with multiple

outputs, as more layers are added, even fully connected networks exhibit a shrinking sparsity penalty on

the singular values of the eective linear matrix predictor ...”

While the aforementioned works study the implicit regularization imposed by gradient descent, we

instead consider explicit regularization imposed by auxiliary

𝓁2

norm penalties in objective functions, and

prove equivalences of minimization problems. In this sense our analysis is more closely related to that

of [DKS21], which considers parametrized families of functions 𝑓(𝑥,𝑤)and denes the representation

cost of a function 𝑔(𝑥)appearing in the parametric family as

𝑅(𝑔)∶=min

𝑤{|𝑤|2

2|𝑓(𝑥,𝑤)=𝑔(𝑥)for all 𝑥}.(2.2)

While this approach lacks the intimate connection with the gradient descent algorithms used to train

modern neural networks, it comes with some benets: for example, results regarding representation

cost are agnostic to the choice of per-sample loss function (in particular they apply to both squared error

and cross entropy loss). In the case where the number of channels

𝐶=𝐶1=⋯=𝐶𝐿−1 =1

(but the

number of output dimensions may be >1), theorem 4.10 can be deduced from [DKS21, Thm. 3].

Data-dependent bias: while in this paper we focus on spatial frequency properties of image data,

there is a large and growing body of work on the impact of frequency properties of training data more

broadly interpreted. [Rah+19] gave a formula for the continuous Fourier transform of a ReLU network

𝑓∶ℝ𝑛→ℝ

, and showed in a range of experiments that ReLU networks learn low frequency modes of

input data rst. [Xia22] proves theoretical results on low frequency components being learned rst for

networks

𝑓∶∏𝑖𝑆𝑛𝑖→ℝ

on products of spheres, where the role of frequency is played by spherical

harmonic indices (see also [XP22] for some related results).

Perhaps the work most closely related to ours is that of [HW22] on principal component (PC) bias,

where it is shown that rates of convergence in deep (and wide) linear networks are governed by the

spectrum of the input data covariance matrix.

[HW22] also includes experiments connecting PC bias

with spectral bias (learning low frequency modes rst, as in the preceding paragraph) and a phenomenon

known as learning order consistency. However, it is worth noting that in their work there is no explicit

theoretical analysis of CNNs and no consideration of the statistics of natural images in Fourier frequency

space.

Other applications of Fourier transformed CNNs: [MHL14;Pra+17;Vas+15;Zhu+21] all, in

one way or another, leverage frequency space representations of convolutions to accelerate computations,

e.g. neural network training. Since this is not our main focus, we omit a more detailed synopsis.

3 The discrete Fourier transform of a CNN

In this section we x the notation and structures we will be working with. We dene a class of idealized,

linear convolutional networks and derive a useful representation of these networks in the frequency

space via the discrete Fourier transform.

Consider a linear, feedforward, multichannel 2D-CNN 𝑓(𝑥)of the form

ℝ𝐶×𝐻×𝑊𝑤1∗−

,,,,,→ℝ𝐶1×𝐻×𝑊𝑤2∗−

,,,,,→ℝ𝐶2×𝐻×𝑊𝑤3∗−

,,,,,→⋯𝑤𝐿−1∗−

,,,,,,,→ℝ𝐶𝐿−1×𝐻×𝑊𝑤𝐿,𝑇−

,,,,,→ℝ𝐾(3.1)

where

𝑤𝑙∗𝑥

denotes the convolution operation between tensors

𝑤𝑙∈ℝ𝐶𝑙×𝐻×𝑊×𝐶𝑙−1

and

𝑥∈ℝ𝐶𝑙−1×𝐻×𝑊

dened by (𝑤𝑙∗𝑥)𝑐𝑖𝑗 =∑

𝑚+𝑚′=𝑖,𝑛+𝑛′=𝑗(∑

𝑑𝑤𝑙𝑐𝑚𝑛𝑑𝑥𝑑𝑚′𝑛′)(3.2)

and

𝑤𝐿,𝑇𝑥

denotes a contraction (a.k.a. Einstein summation) of the tensor

𝑤𝐿∈ℝ𝐾×𝐻×𝑊×𝐶𝐿−1

with

the tensor

𝑥∈ℝ𝐶𝐿−1×𝐻×𝑊

over the last 3 indices (the

(−)𝑇

denotes a transpose operation described

momentarily). Explicitly, (𝑤𝐿,𝑇𝑥)𝑘=∑

𝑙,𝑚,𝑛𝑤𝐿

𝑘𝑚𝑛𝑙𝑥𝑙𝑚𝑛.(3.3)

Thus, the model eq. (3.1) has weights 𝑤𝑙∈ℝ𝐶𝑙×𝐻×𝑊×𝐶𝑙−1 for 𝑙=1,…,𝐿−1and 𝑤𝐿∈ℝ𝐾×𝐻×𝑊×𝐶𝐿−1.

Remarks 3.4.For tensors with at least 3 indices (such as

𝑥

and the weights

𝑤𝑙

above) we will always use

the transpose notation

−𝑇

to denote reversing the second and third tensor indices, which will always be

the 2D spatial indices. For matrices and vectors it will be used according to standard practice. In eq. (3.3)

the transpose ensures that the indices in Einstein sums move from “inside to out” as is standard practice.

Equivalently,

𝑤𝐿,𝑇𝑥

can be described as a usual matrix product

𝑤𝐿vec(𝑥)

where

vec(𝑥)

is the vector-

ization (attening) of

𝑥

and

𝑤𝐿

is obtained by attening the last 3 tensor indices of

𝑤𝐿

(compatibly with

those of

𝑥

as dictated by eq. (3.3)). Hence it represents a typical “atten and then apply a linear layer”

architecture component. Our reason for adopting the tensor contraction perspective is that it is more

amenable to the Fourier analysis described below.

1They also prove a result for shallow ReLU networks.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TestingpredictionsofrepresentationcosttheorywithCNNsCharlesGodfrey,EliseBishoff,MylesMckay,DavisBrown,GraysonJorgenson,HenryKvinge&EleanorBylerPacificNorthwestNationalLab{first}.{last}@pnnl.govAbstractItiswidelyacknowledgedthattrainedconvolutionalneuralnetworks(CNNs)havedifferentlevelsofsensitivityt...

展开>> 收起<<

TestingpredictionsofrepresentationcosttheorywithCNNs CharlesGodfreyEliseBishoffMylesMckayDavisBrown GraysonJorgensonHenryKvingeEleanorByler.pdf

共33页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TestingpredictionsofrepresentationcosttheorywithCNNs CharlesGodfreyEliseBishoffMylesMckayDavisBrown GraysonJorgensonHenryKvingeEleanorByler

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: