
Preprint
Unfortunately, the selection of appropriate hyper-parameters typically requires some prior knowl-
edge about the target signals, which may not be available in some applications.
More general approaches to improve the training and performance of MLPs involve different types
of normalizations, such as Batch Normalization (Ioffe & Szegedy, 2015), Layer Normalization (Ba
et al., 2016) and Weight Normalization (Salimans & Kingma, 2016). However, despite their re-
markable success in deep learning benchmarks, these techniques are not widely used in MLP-based
neural representations. Here we draw motivation from the work of (Salimans & Kingma, 2016;
Wang et al., 2021a) and investigate a simple yet remarkably effective re-parameterization of weight
vectors in MLP networks, coined as random weight factorization, which provides a generalization
of Weight Normalization and demonstrates significant performance gains. Our main contributions
are summarized as
• We show that random weight factorization alters the loss landscape of a neural representa-
tion in a way that can drastically reduce the distance between different parameter configu-
rations, and effectively assigns a self-adaptive learning rate to each neuron in the network.
• We empirically illustrate that random weight factorization can effectively mitigate spectral
bias, as well as enable coordinate-based MLP networks to escape from poor intializations
and find better local minima.
• We demonstrate that random weight factorization can be used as a simple drop-in enhance-
ment to conventional linear layers, and yield consistent and robust improvements across a
wide range of tasks in computer vision, graphics and scientific computing.
2 WEIGHT FACTORIZATION
Let x∈Rdbe the input, g(0)(x) = xand d0=d. We consider a standard multi-layer perceptron
(MLP) fθ(x)recursively defined by
f(l)
θ(x) = W(l)·g(l−1)(x) + b(l),g(l)(x) = σ(f(l)
θ(x)), l = 1,2, . . . , L, (2.1)
with a final layer
fθ(x) = W(L+1) ·g(L)(x) + b(L+1),(2.2)
where W(l)∈Rdl×dl−1is the weight matrix in l-th layer and σis an element-wise activation
function. Here, θ=W(1),b(1),...,W(L+1),b(L+1)represents all trainable parameters in the
network.
MLPs are commonly trained by minimizing an appropriate loss function L(θ)via gradient descent.
To improve convergence, we propose to factorize the weight parameters associated with each neuron
in the network as follows
w(k,l)=s(k,l)·v(k,l), k = 1,2, . . . , dl, l = 1,2, . . . , L + 1,(2.3)
where w(k,l)∈Rdl−1is a weight vector representing the k-th row of the weight matrix W(l),
s(k,l)∈Ris a trainable scale factor assigned to each individual neuron, and v(k,l)∈Rdl−1. Conse-
quently, the proposed weight factorization can be written by
W(l)= diag(s(l))·V(l), l = 1,2, . . . , L + 1.(2.4)
with s∈Rdl.
2.1 A GEOMETRIC PERSPECTIVE
In this section, we provide a geometric motivation for the proposed weight factorization. To this
end, we consider the simplest setting of a one-parameter loss function `(w). For this case, the
weight factorization is reduced to w=s·vwith two scalars s, v. Note that for a given w6= 0
there are infinitely many pairs (s, v)such that w=s·v. The set of such pairs forms a family of
hyperbolas in the sv-plane (one for each choice of signs for both sand v). As such, the loss function
in the sv-plane is constant along these hyperbolas.
2