
the literature. For example, the assumption of no nonlinearities and no batch-norm can be found in [26]. In [27], the
assumptions of (i) no nonlinearities, (ii) no batch-norm, (iii) the number of units in the different layers are more than
the number of units in the input or output layers, and (iv) data is whitened. The assumptions of (i) no batch-norm,
(ii) infinite number of hidden units, and (iii) infinite number of training samples can be found in [28]. In [29], the
assumptions of (i) no nonlinearities, (ii) no batch-norm, (iii) the thinnest layer is either the input layer or the output
layer, and (iv) arbitrary convex differentiable loss can be found. Nonetheless, the results from the aforementioned
works have proven quite useful in practice. We show later on in our experiments that both non-linear and linear DNNs
exhibit similar training characteristics for the generalization gap problem, which is not surprising.
4.2 Preliminaries
For analytical simplicity, we consider MLP networks that represent hidden layer units’ activations with matrices.
However, we show later on that the extension of analytical results to Convolutional Neural Networks (CNNs) that
represents hidden layer units’ activations as 4-dimensional tensors is straightforward.
The rank of hidden layer units’ activations, H(X)l∈Rhn
l×bs:hn
l≥bs, describes the maximum number of linearly
independent columns of H(X)l. The rank of H(X)lcan be obtained as its number of non-zero singular values,
Sl
nz , via singular value decomposition (SVD). Specifically, a matrix with full rank means that all its singular values
are non-zero. Conversely, the number of zero singular values, Sl
z, shows the degree of rank loss.
Definition 1 Near-rank loss in this paper is taken to mean a scenario where the singular values denoted by σare very
small (i.e. σ1).
Our main analytical observations for the problem of generalization gap are summarized as follows in Section 4.3.
4.3 Units’ activation near-rank loss and optimization
In this section, the limits of the singular values of random matrices are related to their dimensions. For stating the
first proposition, we consider that the hidden layer representations, H(X)l∈Rhn
l×bs, are samples of some unknown
distribution, p(H(X)l), and then characterize the behaviour of the maximum and minimum singular vlaues for bs−→
∞. In the second proposition, we assume that the enteries in H(X)l∈Rhn
l×bsfollow a Gaussian distribution.
Subsequently, we characterize the limits of the singular values of H(X)lfor bs∈R.
Proposition 1 (The asymptotic behaviour of the singular values of matrix with increase in dimension): For a matrix
A∈Rm×n:m≥n, from Marchenko-Pastur law, the singular values, σ, concentrate in the range [σmin(A)∼
√m−√n, σmax(A)∼√m+√n]as m, n −→ ∞; where σmax(A)and σmin(A)are the maximum and minimum
singular values of A, respectively.
Proof. See Rudelson-Vershynin [30] for proof.
Remark 1 In fact, [30] notes that Proposition 1 holds for general distributions. As such, we can conclude from
Proposition 1 that H(X)l∈Rhn
l×bs:bs−→ ∞ results in small and large distribution ranges, which are admissible
for σmin(H(X)l)and σmax(H(X)l), respectively. Accordingly, as bsincreases, we have the following scenarios
(i) higher probability for H(X)lto have a small σmin(H(X)l); and (ii) higher probability for H(X)lto have a
larger condition number.
Proposition 2 (The non-asymptotic behaviour of the singular values of matrix with increase in dimension): For a
Gaussian random matrix A∈Rm×n:m≥n, the expected minimum and maximum singular values are given as
√m−√n≤Eσmin(A)≤Eσmax(A)≤√m+√n(4)
Proof. See Theorem 2.6 in Rudelson-Vershynin [30].
In Section A1.1 and Section A1.2 of the supplementary material, we empirically study the distributions of singular
values and expected values of the minimum singular values for random matrices, respectively, where the entries are
drawn from popular distributions including the Gaussian, uniform and lognormal. Subsequently, for the aforemen-
tioned distributions, in Section A1.2 of the supplementary material, we show that for a fixed m,Eσmin(A)−→ 0, as
nbecomes large. This observation is akin to that in Eqn.(4), so that it is applicable for other popular distributions.
Remark 2 Given H(X)l
1∈Rhn
l×bs1:hn
l> bs1and H(X)l
2∈Rhn
l×bs2:hn
l> bs2with bs2> bs1, then
Eσmin(H(X)l
1)>Eσmin(H(X)l
2)using Proposition 2. Importantly, given that hn
lis fixed as it is typical for
4