Preprint RANDOM WEIGHT FACTORIZATION IMPROVES THE TRAINING OF CONTINUOUS NEURAL REPRESENTA -

2025-05-02 0 0 8.92MB 33 页 10玖币
侵权投诉
Preprint
RANDOM WEIGHT FACTORIZATION IMPROVES THE
TRAINING OF CONTINUOUS NEURAL REPRESENTA-
TIONS
Sifan Wang, Hanwen Wang, Jacob H. Seidman, Paris Perdikaris
University of Pennsylvania, Philadelphia, PA 19104
{sifanw, wangh19, seidj@sas.upenn.edu}@sas.upenn.edu,
pgp@seas.upenn.edu
ABSTRACT
Continuous neural representations have recently emerged as a powerful and flex-
ible alternative to classical discretized representations of signals. However, train-
ing them to capture fine details in multi-scale signals is difficult and computa-
tionally expensive. Here we propose random weight factorization as a simple
drop-in replacement for parameterizing and initializing conventional linear layers
in coordinate-based multi-layer perceptrons (MLPs) that significantly accelerates
and improves their training. We show how this factorization alters the underlying
loss landscape and effectively enables each neuron in the network to learn using
its own self-adaptive learning rate. This not only helps with mitigating spectral
bias, but also allows networks to quickly recover from poor initializations and
reach better local minima. We demonstrate how random weight factorization can
be leveraged to improve the training of neural representations on a variety of tasks,
including image regression, shape representation, computed tomography, inverse
rendering, solving partial differential equations, and learning operators between
function spaces.
1 INTRODUCTION
Some of the recent advances in machine learning can be attributed to new developments in the de-
sign of continuous neural representations, which employ coordinate-based multi-layer perceptrons
(MLPs) to parameterize discrete signals (e.g. images, videos, point clouds) across space and time.
Such parameterizations are appealing because they are differentiable and much more memory ef-
ficient than grid-sampled representations, naturally allowing smooth interpolations to unseen input
coordinates. As such, they have achieved widespread success in a variety of computer vision and
graphics tasks, including image representation (Stanley, 2007; Nguyen et al., 2015), shape represen-
tation (Chen & Zhang, 2019; Park et al., 2019; Genova et al., 2019; 2020), view synthesis (Sitzmann
et al., 2019; Saito et al., 2019; Mildenhall et al., 2020; Niemeyer et al., 2020), texture generation
(Oechsle et al., 2019; Henzler et al., 2020), etc. Coordinate-based MLPs have also been applied
to scientific computing applications such as physics-informed neural networks (PINNs) for solving
forward and inverse partial differential equations (PDEs) Raissi et al. (2019; 2020); Karniadakis
et al. (2021), and Deep Operator networks (DeepONets) for learning operators between infinite-
dimensional function spaces Lu et al. (2021); Wang et al. (2021e).
Despite their flexibility, it has been shown both empirically and theoretically that coordinate-based
MLPs suffer from “spectral bias” (Rahaman et al., 2019; Cao et al., 2019; Xu et al., 2019). This
manifests as a difficulty in learning the high frequency components and fine details of a target func-
tion. A popular method to resolve this issue is to embed input coordinates into a higher dimensional
space, for example by using Fourier features before the MLP (Mildenhall et al., 2020; Tancik et al.,
2020). Another widely used approach is the use of SIREN networks (Sitzmann et al., 2020), which
employs MLPs with periodic activations to represent complex natural signals and their derivatives.
One main limitation of these methods is that a number of associated hyper-parameters (e.g. scale
factors) need to be carefully tuned in order to avoid catastrophic generalization/interpolation errors.
1
arXiv:2210.01274v2 [cs.LG] 5 Oct 2022
Preprint
Unfortunately, the selection of appropriate hyper-parameters typically requires some prior knowl-
edge about the target signals, which may not be available in some applications.
More general approaches to improve the training and performance of MLPs involve different types
of normalizations, such as Batch Normalization (Ioffe & Szegedy, 2015), Layer Normalization (Ba
et al., 2016) and Weight Normalization (Salimans & Kingma, 2016). However, despite their re-
markable success in deep learning benchmarks, these techniques are not widely used in MLP-based
neural representations. Here we draw motivation from the work of (Salimans & Kingma, 2016;
Wang et al., 2021a) and investigate a simple yet remarkably effective re-parameterization of weight
vectors in MLP networks, coined as random weight factorization, which provides a generalization
of Weight Normalization and demonstrates significant performance gains. Our main contributions
are summarized as
We show that random weight factorization alters the loss landscape of a neural representa-
tion in a way that can drastically reduce the distance between different parameter configu-
rations, and effectively assigns a self-adaptive learning rate to each neuron in the network.
We empirically illustrate that random weight factorization can effectively mitigate spectral
bias, as well as enable coordinate-based MLP networks to escape from poor intializations
and find better local minima.
We demonstrate that random weight factorization can be used as a simple drop-in enhance-
ment to conventional linear layers, and yield consistent and robust improvements across a
wide range of tasks in computer vision, graphics and scientific computing.
2 WEIGHT FACTORIZATION
Let xRdbe the input, g(0)(x) = xand d0=d. We consider a standard multi-layer perceptron
(MLP) fθ(x)recursively defined by
f(l)
θ(x) = W(l)·g(l1)(x) + b(l),g(l)(x) = σ(f(l)
θ(x)), l = 1,2, . . . , L, (2.1)
with a final layer
fθ(x) = W(L+1) ·g(L)(x) + b(L+1),(2.2)
where W(l)Rdl×dl1is the weight matrix in l-th layer and σis an element-wise activation
function. Here, θ=W(1),b(1),...,W(L+1),b(L+1)represents all trainable parameters in the
network.
MLPs are commonly trained by minimizing an appropriate loss function L(θ)via gradient descent.
To improve convergence, we propose to factorize the weight parameters associated with each neuron
in the network as follows
w(k,l)=s(k,l)·v(k,l), k = 1,2, . . . , dl, l = 1,2, . . . , L + 1,(2.3)
where w(k,l)Rdl1is a weight vector representing the k-th row of the weight matrix W(l),
s(k,l)Ris a trainable scale factor assigned to each individual neuron, and v(k,l)Rdl1. Conse-
quently, the proposed weight factorization can be written by
W(l)= diag(s(l))·V(l), l = 1,2, . . . , L + 1.(2.4)
with sRdl.
2.1 A GEOMETRIC PERSPECTIVE
In this section, we provide a geometric motivation for the proposed weight factorization. To this
end, we consider the simplest setting of a one-parameter loss function `(w). For this case, the
weight factorization is reduced to w=s·vwith two scalars s, v. Note that for a given w6= 0
there are infinitely many pairs (s, v)such that w=s·v. The set of such pairs forms a family of
hyperbolas in the sv-plane (one for each choice of signs for both sand v). As such, the loss function
in the sv-plane is constant along these hyperbolas.
2
Preprint
Figure 1: Weight factorization transforms loss landscapes and short-
ens the distance to minima.
Figure 1 gives a visual il-
lustration of the difference
between the original loss
landscape as a function of
wversus the loss landscape
in the factorized sv-plane.
In the left panel, we plot
the original loss function as
well as an initial parameter
point, the local minimum,
and the global minimum.
The right panel shows how
in the factorized parameter
space, each of these three points corresponds to two hyperbolas in the sv-plane. Note how the dis-
tance between the initialization and the global minima is reduced from the top to the bottom panel
upon an appropriate choice of factorization. The key observation is that the distance between factor-
izations representing the initial parameter and the global minimum become arbitrarily small in the
sv-plane for larger values of s. Indeed, we can prove that this holds for any general loss function in
arbitrary parameter dimensions (the proof is provided in Appendix A.1).
Theorem 1. Suppose that L(θ)is the associated loss function of a neural network defined in equa-
tion 2.1 and equation 2.2. For a given θ, we define Uθas the set containing all possible weight
factorizations
Uθ=n(s(l),V(l))L+1
l=1 : diag(s(l))·V(l)=W(l), l = 1, . . . , L + 1o.(2.5)
Then for any θ,θ0, we have
dist(Uθ, Uθ0)=0.(2.6)
2.2 SELF-ADAPTIVE LEARNING RATE FOR EACH NEURON
A different way to examine the effect of the proposed weight factorization is by studying its associ-
ated gradient updates. Recall that a standard gradient descent update with a learning rate ηtakes the
form
w(k,l)
n+1 =w(k,l)
nηL
w(k,l)
n
.(2.7)
The following theorem derives the corresponding gradient descent update expressed in the original
parameter space for models using the proposed weight factorization.
Theorem 2. Under the weight factorization of equation 2.3, the gradient descent update is given by
w(k,l)
n+1 =w(k,l)
nηk[s(k,l)
n]2+v(k,l)
nk2
2L
w(k,l)
n
+O(η2),(2.8)
for l= 1,2, . . . , L + 1 and k= 1,2, . . . , dl.
The proof is provided in Appendix A.2. By comparing equation 2.7 and equation 2.8, we observe
that the weight factorization w=s·vre-scales the learning rate of wby a factor of (s2+kvk2
2).
Since s,vare trainable parameters, this analysis suggests that this weight factorization effectively
assigns a self-adaptive learning rate to each neuron in the network. In the following sections, we
will demonstrate that the proposed weight factorization (under an appropriate initialization of the
scale factors), not only helps with mitigating spectral bias (Rahaman et al., 2019; Bietti & Mairal,
2019; Tancik et al., 2020; Wang et al., 2021c), but also allows networks to quickly move away from
a poor initialization and reach better local minima faster.
2.3 RELATION TO EXISTING WORKS
The proposed weight factorization is largely motivated by weight normalization (Salimans &
Kingma, 2016), which decouples the norm and the directions of the weights associated with each
3
Preprint
neuron as
w=gv
kvk,(2.9)
where g=kwk, and gradient descent updates are applied directly to the new parameters v, g.
Indeed, this can be viewed as a special case of the proposed weight factorization by setting s=kwk
in equation 2.3. In contrast to weight normalization, our weight factorization scheme allows for more
flexibility in the choice of the scale factors s.
We note that SIREN networks (Sitzmann et al., 2020) also employ a special weight factorization for
each hidden layer weight matrix,
W=ω0ˆ
W,(2.10)
where the scale factor ω0Ris a user-defined hyper-parameter. Although the authors attribute the
success of SIREN to the periodic activation functions in conjunction with a tailored initialization
scheme, here we will demonstrate that the specific choice of ω0is the most crucial element in
SIREN’s performance, see Appendix G, H for more details.
It is worth pointing out that the proposed weight factorization also bears some resemblance to the
adaptive activation functions introduced in (Jagtap et al., 2020), which modifies the activation of
each neuron by introducing an additional trainable parameter aas
g(l)(x) = σ(af(l)(x)).(2.11)
These adaptive activations aim to help networks learn sharp gradients and transitions of the target
functions. In practice, the scale factor is generally initialized as a=1, yielding a trivial weight
factorization. As illustrated in the next section, this is fundamentally different from our approach
as we initialize the scale factors sby a random distribution and re-parameterize the weight matrix
accordingly. In Section 4 we demonstrate that, by initializing susing an appropriate distribution,
we can consistently outperform both weight normalization, SIREN, and adaptive activations across
a broad range of supervised and self-supervised learning tasks.
3 RANDOM WEIGHT FACTORIZATION IN PRACTICE
Here we illustrate the use of weight factorization through the lens of a simple regression task.
Specifically, we consider a smooth scalar-valued function fsampled from a Gaussian random
field using a square exponential kernel with a length scale of l= 0.02. This generates a
data-set of N= 256 observation pairs {xi, f(x)i}N
i=1, where {xi}N
i=1 lie on a uniform grid in
[0,1]. The goal is to train a network fθto learn fby minimizing the mean square error loss
L(θ)=1/N PN
i=1 |fθ(xi)f(xi)|2.
The proposed random weight factorization is applied as follows. We first initialize the parameters
of an MLP network via the Glorot scheme (Glorot & Bengio, 2010). Then, for every weight matrix
W, we proceed by initializing a scale vector exp(s)where sis sampled from a multivariate normal
distribution N(µ, σI). Finally, every weight matrix is factorized by the associated scale factor as
W= diag(exp(s)) ·Vat initialization. We train this network by gradient descent on the new
parameters s,Vdirectly. This procedure is summarized in Appendix B, along with a simple JAX
Flax implementation (Heek et al., 2020) in Appendix C.
In Figure 2, we train networks (3 layers, 128 neurons per layer, ReLU activations) to learn the
target function using: (a) a conventional MLP, (b) an MLP with adaptive activations (AA) (Jagtap
et al., 2020), (c) an MLP with weight normalization (WN) (Salimans & Kingma, 2016), and (d) an
MLP with the proposed random weight factorization scheme (RWF). Evidently, RWF yields the best
predictive accuracy and loss convergence. Moreover, we plot the relative change of the weights in
the original (unfactorized) parameter space during training in the bottom middle panel. We observe
that RWF leads to the largest weight change during training, thereby enabling the network to find
better local minima further away from its initialization. To further emphasize the benefit of weight
factorization, we compute the eigenvalues of the resulting empirical Neural Tangent Kernel (NTK)
(Jacot et al., 2018)
Kθ=fθ
θ(xi),fθ
θ(xj)ij
,(3.1)
4
Preprint
0.0 0.5 1.0
2
0
2
Plain
0.0 0.5 1.0
2
0
2
AA
0.0 0.5 1.0
2
0
2
WN
0.0 0.5 1.0
2
0
2
RWF
0 50 100
Iter ×1000
102
100
MSE
0 50 100
Iter ×1000
0
2
4
kθ(t)θ(0)k
kθ(0)k2
0 100 200
Index
102
101
104
λ(K)
Plain AA WN RWF
Figure 2: 1D regression: Top: Model predictions using different parameterizations. Plain: Stan-
dard MLP; AA: adaptive activation; WN: weight normalization; RWF: random weight factoriza-
tion. Bottom left: Mean square error (MSE) during training. Bottom Middle: Relative change of
weights during training. The comparison is performed in the original parameter space. Bottom right:
Eigenvalues (descending order) of the empirical NTK at the end of training.
at the last step of training and visualize them in the bottom right panel. Notice how RWF ex-
hibits a flatter NTK spectrum and slower eigenvalue decay than the other methods, indicating better-
conditioned training dynamics and less severe spectral bias, see (Rahaman et al., 2019; Bietti &
Mairal, 2019; Tancik et al., 2020; Wang et al., 2021c) for more details. To explore the robustness of
the proposed RWF, we conduct a systematic study on the effect of µand σin the initialization of the
scale factor s. The results suggest that the choice of µ, σ plays an important role. Specifically, too
small µ, σ values may lead to performance that is similar to a conventional MLP, while setting µ, σ
too large can result in an unstable training process. We empirically find that µ= 1, σ = 0.1consis-
tently improves the loss convergence and model accuracy for the vast majority of tasks considered
in this work. Additional details are presented in Appendix D.
4 EXPERIMENTS
In this section, we demonstrate the effectiveness and robustness of random weight factorization
for training continuous neural representations across a range of tasks in computer vision, graphics,
and scientific computing. More precisely, we compare the performance of plain MLPs, MLPs with
adaptive activations (AA) (Jagtap et al., 2020), weight normalization (WN) (Salimans & Kingma,
2016), and the proposed random weight factorization (RWF). The comparison is performed over
a collection of MLP architectures, including conventional MLPs, SIREN (Sitzmann et al., 2020),
modified MLPs (Wang et al., 2021b), as well as MLPs with positional encodings (Mildenhall et al.,
2020) and Gaussian Fourier features (Tancik et al., 2020). The hyper-parameters of our experiments
along with the computational cost associated with each experiment are presented in Appendix E
and Appendix F, respectively. Notice that the computational overhead of our method is marginal,
and RWF can be therefore considered as a drop-in enhancement to any architecture that uses linear
layers. Table 1 summarizes the results obtained for each benchmark, corresponding to the optimal
input mapping and network architecture. Overall, RWF consistently achieves the best performance
across tasks and architectures. All code and data will be made publicly available. A summary of
each benchmark study is presented below, with more details provided in Appendix.
4.1 2D IMAGE REGRESSION
We train coordinate-based MLPs to learn a map from 2D input pixel coordinates to the corresponding
RGB values of an image, using the benchmarks put forth in (Tancik et al., 2020). We conduct
experiments using two data-sets: Natural and Text, each containing 16 images. The Natural data-
5
摘要:

PreprintRANDOMWEIGHTFACTORIZATIONIMPROVESTHETRAININGOFCONTINUOUSNEURALREPRESENTA-TIONSSifanWang,HanwenWang,JacobH.Seidman,ParisPerdikarisUniversityofPennsylvania,Philadelphia,PA19104fsifanw,wangh19,seidj@sas.upenn.edug@sas.upenn.edu,pgp@seas.upenn.eduABSTRACTContinuousneuralrepresentationshaverecent...

展开>> 收起<<
Preprint RANDOM WEIGHT FACTORIZATION IMPROVES THE TRAINING OF CONTINUOUS NEURAL REPRESENTA -.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:8.92MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注