Preprint RANDOM WEIGHT FACTORIZATION IMPROVES THE TRAINING OF CONTINUOUS NEURAL REPRESENTA -

2025-05-02 0 0 8.92MB 33 页 10玖币

侵权投诉

Preprint

RANDOM WEIGHT FACTORIZATION IMPROVES THE

TRAINING OF CONTINUOUS NEURAL REPRESENTA-

TIONS

Sifan Wang, Hanwen Wang, Jacob H. Seidman, Paris Perdikaris

University of Pennsylvania, Philadelphia, PA 19104

{sifanw, wangh19, seidj@sas.upenn.edu}@sas.upenn.edu,

pgp@seas.upenn.edu

ABSTRACT

Continuous neural representations have recently emerged as a powerful and ﬂex-

ible alternative to classical discretized representations of signals. However, train-

ing them to capture ﬁne details in multi-scale signals is difﬁcult and computa-

tionally expensive. Here we propose random weight factorization as a simple

drop-in replacement for parameterizing and initializing conventional linear layers

in coordinate-based multi-layer perceptrons (MLPs) that signiﬁcantly accelerates

and improves their training. We show how this factorization alters the underlying

loss landscape and effectively enables each neuron in the network to learn using

its own self-adaptive learning rate. This not only helps with mitigating spectral

bias, but also allows networks to quickly recover from poor initializations and

reach better local minima. We demonstrate how random weight factorization can

be leveraged to improve the training of neural representations on a variety of tasks,

including image regression, shape representation, computed tomography, inverse

rendering, solving partial differential equations, and learning operators between

function spaces.

1 INTRODUCTION

Some of the recent advances in machine learning can be attributed to new developments in the de-

sign of continuous neural representations, which employ coordinate-based multi-layer perceptrons

(MLPs) to parameterize discrete signals (e.g. images, videos, point clouds) across space and time.

Such parameterizations are appealing because they are differentiable and much more memory ef-

ﬁcient than grid-sampled representations, naturally allowing smooth interpolations to unseen input

coordinates. As such, they have achieved widespread success in a variety of computer vision and

graphics tasks, including image representation (Stanley, 2007; Nguyen et al., 2015), shape represen-

tation (Chen & Zhang, 2019; Park et al., 2019; Genova et al., 2019; 2020), view synthesis (Sitzmann

et al., 2019; Saito et al., 2019; Mildenhall et al., 2020; Niemeyer et al., 2020), texture generation

(Oechsle et al., 2019; Henzler et al., 2020), etc. Coordinate-based MLPs have also been applied

to scientiﬁc computing applications such as physics-informed neural networks (PINNs) for solving

forward and inverse partial differential equations (PDEs) Raissi et al. (2019; 2020); Karniadakis

et al. (2021), and Deep Operator networks (DeepONets) for learning operators between inﬁnite-

dimensional function spaces Lu et al. (2021); Wang et al. (2021e).

Despite their ﬂexibility, it has been shown both empirically and theoretically that coordinate-based

MLPs suffer from “spectral bias” (Rahaman et al., 2019; Cao et al., 2019; Xu et al., 2019). This

manifests as a difﬁculty in learning the high frequency components and ﬁne details of a target func-

tion. A popular method to resolve this issue is to embed input coordinates into a higher dimensional

space, for example by using Fourier features before the MLP (Mildenhall et al., 2020; Tancik et al.,

2020). Another widely used approach is the use of SIREN networks (Sitzmann et al., 2020), which

employs MLPs with periodic activations to represent complex natural signals and their derivatives.

One main limitation of these methods is that a number of associated hyper-parameters (e.g. scale

factors) need to be carefully tuned in order to avoid catastrophic generalization/interpolation errors.

arXiv:2210.01274v2 [cs.LG] 5 Oct 2022

Preprint

Unfortunately, the selection of appropriate hyper-parameters typically requires some prior knowl-

edge about the target signals, which may not be available in some applications.

More general approaches to improve the training and performance of MLPs involve different types

of normalizations, such as Batch Normalization (Ioffe & Szegedy, 2015), Layer Normalization (Ba

et al., 2016) and Weight Normalization (Salimans & Kingma, 2016). However, despite their re-

markable success in deep learning benchmarks, these techniques are not widely used in MLP-based

neural representations. Here we draw motivation from the work of (Salimans & Kingma, 2016;

Wang et al., 2021a) and investigate a simple yet remarkably effective re-parameterization of weight

vectors in MLP networks, coined as random weight factorization, which provides a generalization

of Weight Normalization and demonstrates signiﬁcant performance gains. Our main contributions

are summarized as

• We show that random weight factorization alters the loss landscape of a neural representa-

tion in a way that can drastically reduce the distance between different parameter conﬁgu-

rations, and effectively assigns a self-adaptive learning rate to each neuron in the network.

• We empirically illustrate that random weight factorization can effectively mitigate spectral

bias, as well as enable coordinate-based MLP networks to escape from poor intializations

and ﬁnd better local minima.

• We demonstrate that random weight factorization can be used as a simple drop-in enhance-

ment to conventional linear layers, and yield consistent and robust improvements across a

wide range of tasks in computer vision, graphics and scientiﬁc computing.

2 WEIGHT FACTORIZATION

Let x∈Rdbe the input, g(0)(x) = xand d0=d. We consider a standard multi-layer perceptron

(MLP) fθ(x)recursively deﬁned by

f(l)

θ(x) = W(l)·g(l−1)(x) + b(l),g(l)(x) = σ(f(l)

θ(x)), l = 1,2, . . . , L, (2.1)

with a ﬁnal layer

fθ(x) = W(L+1) ·g(L)(x) + b(L+1),(2.2)

where W(l)∈Rdl×dl−1is the weight matrix in l-th layer and σis an element-wise activation

function. Here, θ=W(1),b(1),...,W(L+1),b(L+1)represents all trainable parameters in the

network.

MLPs are commonly trained by minimizing an appropriate loss function L(θ)via gradient descent.

To improve convergence, we propose to factorize the weight parameters associated with each neuron

in the network as follows

w(k,l)=s(k,l)·v(k,l), k = 1,2, . . . , dl, l = 1,2, . . . , L + 1,(2.3)

where w(k,l)∈Rdl−1is a weight vector representing the k-th row of the weight matrix W(l),

s(k,l)∈Ris a trainable scale factor assigned to each individual neuron, and v(k,l)∈Rdl−1. Conse-

quently, the proposed weight factorization can be written by

W(l)= diag(s(l))·V(l), l = 1,2, . . . , L + 1.(2.4)

with s∈Rdl.

2.1 A GEOMETRIC PERSPECTIVE

In this section, we provide a geometric motivation for the proposed weight factorization. To this

end, we consider the simplest setting of a one-parameter loss function `(w). For this case, the

weight factorization is reduced to w=s·vwith two scalars s, v. Note that for a given w6= 0

there are inﬁnitely many pairs (s, v)such that w=s·v. The set of such pairs forms a family of

hyperbolas in the sv-plane (one for each choice of signs for both sand v). As such, the loss function

in the sv-plane is constant along these hyperbolas.

Preprint

Figure 1: Weight factorization transforms loss landscapes and short-

ens the distance to minima.

Figure 1 gives a visual il-

lustration of the difference

between the original loss

landscape as a function of

wversus the loss landscape

in the factorized sv-plane.

In the left panel, we plot

the original loss function as

well as an initial parameter

point, the local minimum,

and the global minimum.

The right panel shows how

in the factorized parameter

space, each of these three points corresponds to two hyperbolas in the sv-plane. Note how the dis-

tance between the initialization and the global minima is reduced from the top to the bottom panel

upon an appropriate choice of factorization. The key observation is that the distance between factor-

izations representing the initial parameter and the global minimum become arbitrarily small in the

sv-plane for larger values of s. Indeed, we can prove that this holds for any general loss function in

arbitrary parameter dimensions (the proof is provided in Appendix A.1).

Theorem 1. Suppose that L(θ)is the associated loss function of a neural network deﬁned in equa-

tion 2.1 and equation 2.2. For a given θ, we deﬁne Uθas the set containing all possible weight

factorizations

Uθ=n(s(l),V(l))L+1

l=1 : diag(s(l))·V(l)=W(l), l = 1, . . . , L + 1o.(2.5)

Then for any θ,θ0, we have

dist(Uθ, Uθ0)=0.(2.6)

2.2 SELF-ADAPTIVE LEARNING RATE FOR EACH NEURON

A different way to examine the effect of the proposed weight factorization is by studying its associ-

ated gradient updates. Recall that a standard gradient descent update with a learning rate ηtakes the

form

w(k,l)

n+1 =w(k,l)

n−η∂L

∂w(k,l)

.(2.7)

The following theorem derives the corresponding gradient descent update expressed in the original

parameter space for models using the proposed weight factorization.

Theorem 2. Under the weight factorization of equation 2.3, the gradient descent update is given by

w(k,l)

n+1 =w(k,l)

n−ηk[s(k,l)

n]2+v(k,l)

nk2

2∂L

∂w(k,l)

+O(η2),(2.8)

for l= 1,2, . . . , L + 1 and k= 1,2, . . . , dl.

The proof is provided in Appendix A.2. By comparing equation 2.7 and equation 2.8, we observe

that the weight factorization w=s·vre-scales the learning rate of wby a factor of (s2+kvk2

2).

Since s,vare trainable parameters, this analysis suggests that this weight factorization effectively

assigns a self-adaptive learning rate to each neuron in the network. In the following sections, we

will demonstrate that the proposed weight factorization (under an appropriate initialization of the

scale factors), not only helps with mitigating spectral bias (Rahaman et al., 2019; Bietti & Mairal,

2019; Tancik et al., 2020; Wang et al., 2021c), but also allows networks to quickly move away from

a poor initialization and reach better local minima faster.

2.3 RELATION TO EXISTING WORKS

The proposed weight factorization is largely motivated by weight normalization (Salimans &

Kingma, 2016), which decouples the norm and the directions of the weights associated with each

Preprint

neuron as

w=gv

kvk,(2.9)

where g=kwk, and gradient descent updates are applied directly to the new parameters v, g.

Indeed, this can be viewed as a special case of the proposed weight factorization by setting s=kwk

in equation 2.3. In contrast to weight normalization, our weight factorization scheme allows for more

ﬂexibility in the choice of the scale factors s.

We note that SIREN networks (Sitzmann et al., 2020) also employ a special weight factorization for

each hidden layer weight matrix,

W=ω0∗ˆ

W,(2.10)

where the scale factor ω0∈Ris a user-deﬁned hyper-parameter. Although the authors attribute the

success of SIREN to the periodic activation functions in conjunction with a tailored initialization

scheme, here we will demonstrate that the speciﬁc choice of ω0is the most crucial element in

SIREN’s performance, see Appendix G, H for more details.

It is worth pointing out that the proposed weight factorization also bears some resemblance to the

adaptive activation functions introduced in (Jagtap et al., 2020), which modiﬁes the activation of

each neuron by introducing an additional trainable parameter aas

g(l)(x) = σ(af(l)(x)).(2.11)

These adaptive activations aim to help networks learn sharp gradients and transitions of the target

functions. In practice, the scale factor is generally initialized as a=1, yielding a trivial weight

factorization. As illustrated in the next section, this is fundamentally different from our approach

as we initialize the scale factors sby a random distribution and re-parameterize the weight matrix

accordingly. In Section 4 we demonstrate that, by initializing susing an appropriate distribution,

we can consistently outperform both weight normalization, SIREN, and adaptive activations across

a broad range of supervised and self-supervised learning tasks.

3 RANDOM WEIGHT FACTORIZATION IN PRACTICE

Here we illustrate the use of weight factorization through the lens of a simple regression task.

Speciﬁcally, we consider a smooth scalar-valued function fsampled from a Gaussian random

ﬁeld using a square exponential kernel with a length scale of l= 0.02. This generates a

data-set of N= 256 observation pairs {xi, f(x)i}N

i=1, where {xi}N

i=1 lie on a uniform grid in

[0,1]. The goal is to train a network fθto learn fby minimizing the mean square error loss

L(θ)=1/N PN

i=1 |fθ(xi)−f(xi)|2.

The proposed random weight factorization is applied as follows. We ﬁrst initialize the parameters

of an MLP network via the Glorot scheme (Glorot & Bengio, 2010). Then, for every weight matrix

W, we proceed by initializing a scale vector exp(s)where sis sampled from a multivariate normal

distribution N(µ, σI). Finally, every weight matrix is factorized by the associated scale factor as

W= diag(exp(s)) ·Vat initialization. We train this network by gradient descent on the new

parameters s,Vdirectly. This procedure is summarized in Appendix B, along with a simple JAX

Flax implementation (Heek et al., 2020) in Appendix C.

In Figure 2, we train networks (3 layers, 128 neurons per layer, ReLU activations) to learn the

target function using: (a) a conventional MLP, (b) an MLP with adaptive activations (AA) (Jagtap

et al., 2020), (c) an MLP with weight normalization (WN) (Salimans & Kingma, 2016), and (d) an

MLP with the proposed random weight factorization scheme (RWF). Evidently, RWF yields the best

predictive accuracy and loss convergence. Moreover, we plot the relative change of the weights in

the original (unfactorized) parameter space during training in the bottom middle panel. We observe

that RWF leads to the largest weight change during training, thereby enabling the network to ﬁnd

better local minima further away from its initialization. To further emphasize the beneﬁt of weight

factorization, we compute the eigenvalues of the resulting empirical Neural Tangent Kernel (NTK)

(Jacot et al., 2018)

Kθ=∂fθ

∂θ(xi),∂fθ

∂θ(xj)ij

,(3.1)

Preprint

0.0 0.5 1.0

−2

Plain

0.0 0.5 1.0

−2

0.0 0.5 1.0

−2

0.0 0.5 1.0

−2

RWF

0 50 100

Iter ×1000

10−2

100

MSE

0 50 100

Iter ×1000

kθ(t)−θ(0)k

kθ(0)k2

0 100 200

Index

10−2

101

104

λ(K)

Plain AA WN RWF

Figure 2: 1D regression: Top: Model predictions using different parameterizations. Plain: Stan-

dard MLP; AA: adaptive activation; WN: weight normalization; RWF: random weight factoriza-

tion. Bottom left: Mean square error (MSE) during training. Bottom Middle: Relative change of

weights during training. The comparison is performed in the original parameter space. Bottom right:

Eigenvalues (descending order) of the empirical NTK at the end of training.

at the last step of training and visualize them in the bottom right panel. Notice how RWF ex-

hibits a ﬂatter NTK spectrum and slower eigenvalue decay than the other methods, indicating better-

conditioned training dynamics and less severe spectral bias, see (Rahaman et al., 2019; Bietti &

Mairal, 2019; Tancik et al., 2020; Wang et al., 2021c) for more details. To explore the robustness of

the proposed RWF, we conduct a systematic study on the effect of µand σin the initialization of the

scale factor s. The results suggest that the choice of µ, σ plays an important role. Speciﬁcally, too

small µ, σ values may lead to performance that is similar to a conventional MLP, while setting µ, σ

too large can result in an unstable training process. We empirically ﬁnd that µ= 1, σ = 0.1consis-

tently improves the loss convergence and model accuracy for the vast majority of tasks considered

in this work. Additional details are presented in Appendix D.

4 EXPERIMENTS

In this section, we demonstrate the effectiveness and robustness of random weight factorization

for training continuous neural representations across a range of tasks in computer vision, graphics,

and scientiﬁc computing. More precisely, we compare the performance of plain MLPs, MLPs with

adaptive activations (AA) (Jagtap et al., 2020), weight normalization (WN) (Salimans & Kingma,

2016), and the proposed random weight factorization (RWF). The comparison is performed over

a collection of MLP architectures, including conventional MLPs, SIREN (Sitzmann et al., 2020),

modiﬁed MLPs (Wang et al., 2021b), as well as MLPs with positional encodings (Mildenhall et al.,

2020) and Gaussian Fourier features (Tancik et al., 2020). The hyper-parameters of our experiments

along with the computational cost associated with each experiment are presented in Appendix E

and Appendix F, respectively. Notice that the computational overhead of our method is marginal,

and RWF can be therefore considered as a drop-in enhancement to any architecture that uses linear

layers. Table 1 summarizes the results obtained for each benchmark, corresponding to the optimal

input mapping and network architecture. Overall, RWF consistently achieves the best performance

across tasks and architectures. All code and data will be made publicly available. A summary of

each benchmark study is presented below, with more details provided in Appendix.

4.1 2D IMAGE REGRESSION

We train coordinate-based MLPs to learn a map from 2D input pixel coordinates to the corresponding

RGB values of an image, using the benchmarks put forth in (Tancik et al., 2020). We conduct

experiments using two data-sets: Natural and Text, each containing 16 images. The Natural data-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PreprintRANDOMWEIGHTFACTORIZATIONIMPROVESTHETRAININGOFCONTINUOUSNEURALREPRESENTA-TIONSSifanWang,HanwenWang,JacobH.Seidman,ParisPerdikarisUniversityofPennsylvania,Philadelphia,PA19104fsifanw,wangh19,seidj@sas.upenn.edug@sas.upenn.edu,pgp@seas.upenn.eduABSTRACTContinuousneuralrepresentationshaverecent...

展开>> 收起<<

Preprint RANDOM WEIGHT FACTORIZATION IMPROVES THE TRAINING OF CONTINUOUS NEURAL REPRESENTA -.pdf

共33页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Preprint RANDOM WEIGHT FACTORIZATION IMPROVES THE TRAINING OF CONTINUOUS NEURAL REPRESENTA -

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: