the weight-space uncertainty/ensembles can be a limiting assumption since the architecture choice
also contributes to the epistemic (model-form) uncertainty of the prediction. The importance of
architecture choice over other considerations in Bayesian neural networks has been highlighted with
rightness in [11].
On the other hand, Neural Architecture Search (NAS) has received tremendous attention re-
cently because of its promise to democratize machine learning and enable the learning of custom,
data-specific neural architectures. The most popular approaches in this context are reinforcement
learning [
12
], Bayesian optimization [
13
], and evolutionary optimization [
14
], but usually incur a
large computational overhead. More recently, a differential neural architecture search framework,
DARTS [
15
] was proposed that adopts a continuous relaxation of categorical space to facilitate
architecture search through gradient-based optimizers. Distribution-based learning of architecture
parameters has recently been explored in DrNAS [
16
], BayesNAS [
17
], BaLeNAS [
18
] to avoid
suboptimal exploration observed with deterministic optimization [
18
] by introducing stochasticity
and encouraging exploration. However, these works were tasked with learning a point estimate of the
architecture and weights rather than uncertainty quantification, ensembling, or robustness.
In this work, we develop
U
nified p
r
obabilistic
a
rchitecture and weight
e
nsembling
N
eural
A
rchitecture
S
earch (UraeNAS) to improve the accuracy and robustness of neural network models.
We employ a distribution learning approach to differentiable NAS, which allows us to move beyond
ad hoc architecture selection and point estimation of architecture parameters to treat them as random
variables and estimate their distributions. This property of distribution learning of architectures, when
combined with the Bayesian formulation of neural network weights, allows us to characterize the
full epistemic uncertainty arising from the modeling choices of neural networks. With UraeNAS,
we are able to generate rich samples/ensembles from the joint distribution of the architecture and
weight parameters, which provides significant improvement in uncertainty/calibration, accuracy, and
robustness in both in-distribution and out-of-distribution scenarios compared to deterministic models
and weight ensemble models.
2 Unified probabilistic architecture and weight ensembling NAS
2.1 Distributional formulation of differentiable NAS
In the differentiable NAS setup, the neural network search space is designed by repeatedly stacking
building blocks called cells [
12
,
15
,
16
]. The cells can be normal cells or reduction cells. Normal
cells maintain the spatial resolution of inputs, and reduction cells halve the spatial resolution, but
double the number of channels. Different neural network architectures are generated by changing the
basic cell structure. Each cell is represented by a Directed Acyclic Graph with N-ordered nodes and
E edges. The feature maps are denoted by
x(j),0≤j≤N−1
and each edge corresponds to an
operation
o(i,j)
. The feature map for each node is given by
x(j)=Pi<j o(i,j)(x(i))
, with
x(0)
and
x(1)
fixed to be the output from the previous two cells. The final output of each cell is obtained by
concatenating the outputs of each intermediate node, that is, (x(2),x(3) . . . , x(N−1)).
The operation selection problem is inherently discrete in nature. However, continuous relaxation of
the discrete space [
15
] leads to continuous architecture mixing weights (
ˆo(i,j)(x) = Po∈Oθ(i,j)
oo(x)
)
that can be learned through gradient-based optimization. The transformed operation
ˆo(i,j)
is a
weighted average of the operations selected from a finite candidate space
O
. The input features are
denoted by
x
and
θ(i,j)
o
represents the weight of operation
o
for the edge
(i, j)
. The operation mixing
weights
θ(i,j)= (θ(i,j)
1, θ(i,j)
2. . . θ(i,j)
|O|)
belong to a probability simplex, i.e.,
Po∈Oθ(i,j)
o= 1
.
Throughout this paper, we use the terms architecture parameters and operation mixing weights
interchangeably.
NAS as Bi-level Optimization:
With a differentiable architecture search (DAS) formulation, NAS
can be posed as a a bi-level optimization problem over neural network weights
w
and architecture
parameters θ[15] in the following manner:
min
θLval(w∗(θ),θ)s.t. w∗∈arg min
wLtrain(w,θ)(1)
However, it was observed in recent works [
16
] that optimizing directly over architecture parameters
can lead to overfitting due to insufficient exploration of the architecture space. To alleviate this, dif-
ferent DAS strategies were employed [
16
,
19
,
20
]. Among them, the most versatile is the distribution
learning approach [
16
] in which the architecture parameters are sampled from a distribution such as
the Dirichlet distribution
θ(i,j)iid
∼Dirichlet(β(i,j))
that can inherently satisfy the simplex constraint
2