
which is estimated using the reparametrization trick [13] that
allows to use regular gradient computation.
Variational Neural Networks (VNNs) [14] can be consid-
ered in the same group as MCD and BBB from the Bayesian
Model Averaging perspective, where sampled models can
lie in the same loss-basin and be similar, i.e., describing
the problem from the same point of view, as explained in
[2]. VNNs consider a Gaussian distribution over each layer,
which is parameterized by the outputs of the corresponding
sub-layers. Hypermodels [6] consider an additional hyper-
model θ=gν(z)to generate parameters of a base model
fθ(x)using a random variable z∼ N (0, I)as an input to the
hypermodel.
Deep Ensembles [8] have a better uncertainty quality than
all other discussed methods and can be viewed as a BNN with
a Categorical distribution over weights, with the ideal num-
ber of weight samples being equal to the number of networks
in the ensemble. The addition of prior untrained models to
Deep Ensembles, as described in [8], improves the uncer-
tainty quality of the network. Deep Sub-Ensembles [15] split
the neural network into two parts, where the first part contains
only a single trunk network, and the second part is a regular
Deep Ensemble network operating on features generated by
the trunk network. This reduces the memory and computa-
tional load compared to the Deep Ensembles and provides
a trade-off between the uncertainty quality and resource re-
quirements. Batch Ensembles [16] optimize Deep Ensembles
by using all weights in a single matrix operation and using
Hadamard product instead of matrix multiplication that in-
creases inference speed and reduces memory usage.
Kushibar et al. [17] propose to use a single deterministic
network to generate different outputs by using multiple early-
exits branches [18] from the same network, and compute the
variance in those outputs. Contrary to this approach using
a single deterministic network, we propose Layer Ensembles
in the context of Bayesian Neural Networks by considering
an independent Categorical distribution over weights of each
layer, as described in the following section.
3. LAYER ENSEMBLES
In this section, we first provide a mathematical definition of
Layer Ensembles structure. Then, we provide the intuition
behind Layer Ensembles and the relations between Layer
Ensembles and other network structures. Furthermore, we
show how inference of Layer Ensembles can be optimized
by reusing common layer outputs. Finally, we propose a
method for selecting the best sample combinations based on
the quality of uncertainty metric. This metric is computed us-
ing Epistemic Neural Networks (ENNs) [9] experiments with
a synthetic dataset that has ground-truth uncertainty values,
as described in details in Section 3.2.
We consider a neural network F(x, w)with Nlayers
which takes xas input and is parameterized by the weights w.
(a) Deep Ensemble structure (b) Layer Ensemble structure
(c) Two samples of a Layer Ensemble network with common first
two layer options
Fig. 1: Example structures of (a) Deep Ensembles and (b)
Layer Ensembles for a 3-layer network with 3 ensembles
(N= 3,K= 3). While the memory structure remains iden-
tical, Layer Ensembles have many more options for sampling
that can be optimized considering the common layers in sam-
ples. Layer Ensembles with common layers earlier in the ar-
chitecture lead to faster inference (c).
A Deep Ensemble network is formed by Kidentically struc-
tured networks, each formed by Nlayers and the weights of
each network, i.e., wi, i ∈[1, K], are trained independently.
We formulate Layer Ensembles as a stochastic neural network
F(x, w)with Nlayers LEi(x, wi
q), i ∈[1, K], q ∈[1, N ]
and Kweight options for each layer:
wi
q∼Categorical(K).(1)
This results in the same memory structure as for Deep Ensem-
bles, with KN weight sets for an ensemble of Knetworks
each formed by Nlayers. However, Layer Ensembles allow
for connections between the layers of different weight sets, by
sampling different layer options to form a network in the en-
semble. This greatly increases the number of possible weight
samples, while those can contain identical subnetworks. This
can be used to speed up the inference of a set of sampled lay-
ers.
The intuition behind Layer Ensembles is to define an en-
semble for each layer. These ensembles are independent and
can have a different number of members in each ensemble,
leading to a highly flexible network structure. If a single ran-
dom variable controls all layer-wise ensembles, and they have
an identical number of members, this leads to the well-known
Deep Ensemble neural network. Fig. 1 illustrates how the
same memory structure of the ensembles is used in Deep En-
sembles (Fig. 1a) and Layer Ensembles (Fig. 1b).
Training of Layer Ensembles is done by using a regular
loss function and averaging over the outputs of different net-
work samples. The number of network samples for Deep En-
sembles is usually equal to the number of networks K, mean-
ing that all the ensembles are used for inference. The same
strategy is not required for Layer Ensembles, as each layer
option can be included in multiple networks. This means that