Unified Probabilistic Neural Architecture and Weight Ensembling Improves Model Robustness Sumegha Premchandar

2025-05-06 0 0 378.54KB 7 页 10玖币
侵权投诉
Unified Probabilistic Neural Architecture and Weight
Ensembling Improves Model Robustness
Sumegha Premchandar
Michigan State University
premchan@msu.edu
Sandeep Madireddy
Argonne National Laboratory
smadireddy@anl.gov
Sanket Jantre
Brookhaven National Laboratory
sjantre@bnl.gov
Prasanna Balaprakash
Argonne National Laboratory
pbalapra@anl.gov
Abstract
Robust machine learning models with accurately calibrated uncertainties are crucial
for safety-critical applications. Probabilistic machine learning and especially the
Bayesian formalism provide a systematic framework to incorporate robustness
through the distributional estimates and reason about uncertainty. Recent works
have shown that approximate inference approaches that take the weight space un-
certainty of neural networks to generate ensemble prediction are the state-of-the-art.
However, architecture choices have mostly been ad hoc, which essentially ignores
the epistemic uncertainty from the architecture space. To this end, we propose
a
U
nified p
r
obabilistic
a
rchitecture and weight
e
nsembling
N
eural
A
rchitecture
S
earch (UraeNAS) that leverages advances in probabilistic neural architecture
search and approximate Bayesian inference to generate ensembles form the joint
distribution of neural network architectures and weights. The proposed approach
showed a significant improvement both with in-distribution (
0.86%
in accuracy,
42%
in ECE) CIFAR-10 and out-of-distribution (
2.43%
in accuracy,
30%
in ECE)
CIFAR-10-C compared to the baseline deterministic approach.
1 Introduction
Bayesian neural networks have recently seen a lot of interest due to the potential of these models
to provide improved predictions with quantified uncertainty and robustness, which is crucial to
designing safe and reliable systems [
1
], especially for safety-critical applications such as autonomous
driving, medicine, and scientific applications such as model-based control of nuclear fusion reactors.
Even though modern Bayesian neural networks have great potential for robustness, their inference
is challenging due to the presence of millions of parameters and a multi-modal landscape. For
this reason, approximate inference techniques such as variational inference (VI) and stochastic
gradient Markov chain Monte Carlo are being increasingly adopted. However, VI, which typically
makes a unimodal approximation of the multimodal posterior, can be limiting. Recent works in the
realm of probabilistic deep learning have shown that ensembles of neural networks [
2
] have shown
superior accuracy and robustness properties over learning single models. This kind of ensembling
has been shown to be analogous to sampling models from different modes of multimodal Bayesian
posteriors [3, 4] and hence enjoys these superior properties.
While different techniques for ensembling neural networks have been explored, both in the
context of Bayesian and non-Bayesian inference, a key limitation is that ensembles are primarily
in the weight space, where the architecture of the neural networks is fixed arbitrarily. For example,
techniques such as Monte Carlo dropout [
5
], dropConnect [
6
], Swapout [
7
], SSIG [
8
] deactivate
certain units/connections during training and testing. They are ‘implicit", as model ensembling is
happening internally in a single model and so are efficient, but the gain in robustness is not significant.
On the other hand, “explicit" ensembling techniques such as Deep Ensembles [
2
], BatchEnsemble [
9
],
MIMO [
10
] have shown superior accuracy and robustness gains over single models. Considering just
Corresponding Author.
arXiv:2210.04083v1 [cs.LG] 8 Oct 2022
the weight-space uncertainty/ensembles can be a limiting assumption since the architecture choice
also contributes to the epistemic (model-form) uncertainty of the prediction. The importance of
architecture choice over other considerations in Bayesian neural networks has been highlighted with
rightness in [11].
On the other hand, Neural Architecture Search (NAS) has received tremendous attention re-
cently because of its promise to democratize machine learning and enable the learning of custom,
data-specific neural architectures. The most popular approaches in this context are reinforcement
learning [
12
], Bayesian optimization [
13
], and evolutionary optimization [
14
], but usually incur a
large computational overhead. More recently, a differential neural architecture search framework,
DARTS [
15
] was proposed that adopts a continuous relaxation of categorical space to facilitate
architecture search through gradient-based optimizers. Distribution-based learning of architecture
parameters has recently been explored in DrNAS [
16
], BayesNAS [
17
], BaLeNAS [
18
] to avoid
suboptimal exploration observed with deterministic optimization [
18
] by introducing stochasticity
and encouraging exploration. However, these works were tasked with learning a point estimate of the
architecture and weights rather than uncertainty quantification, ensembling, or robustness.
In this work, we develop
U
nified p
r
obabilistic
a
rchitecture and weight
e
nsembling
N
eural
A
rchitecture
S
earch (UraeNAS) to improve the accuracy and robustness of neural network models.
We employ a distribution learning approach to differentiable NAS, which allows us to move beyond
ad hoc architecture selection and point estimation of architecture parameters to treat them as random
variables and estimate their distributions. This property of distribution learning of architectures, when
combined with the Bayesian formulation of neural network weights, allows us to characterize the
full epistemic uncertainty arising from the modeling choices of neural networks. With UraeNAS,
we are able to generate rich samples/ensembles from the joint distribution of the architecture and
weight parameters, which provides significant improvement in uncertainty/calibration, accuracy, and
robustness in both in-distribution and out-of-distribution scenarios compared to deterministic models
and weight ensemble models.
2 Unified probabilistic architecture and weight ensembling NAS
2.1 Distributional formulation of differentiable NAS
In the differentiable NAS setup, the neural network search space is designed by repeatedly stacking
building blocks called cells [
12
,
15
,
16
]. The cells can be normal cells or reduction cells. Normal
cells maintain the spatial resolution of inputs, and reduction cells halve the spatial resolution, but
double the number of channels. Different neural network architectures are generated by changing the
basic cell structure. Each cell is represented by a Directed Acyclic Graph with N-ordered nodes and
E edges. The feature maps are denoted by
x(j),0jN1
and each edge corresponds to an
operation
o(i,j)
. The feature map for each node is given by
x(j)=Pi<j o(i,j)(x(i))
, with
x(0)
and
x(1)
fixed to be the output from the previous two cells. The final output of each cell is obtained by
concatenating the outputs of each intermediate node, that is, (x(2),x(3) . . . , x(N1)).
The operation selection problem is inherently discrete in nature. However, continuous relaxation of
the discrete space [
15
] leads to continuous architecture mixing weights (
ˆo(i,j)(x) = PoOθ(i,j)
oo(x)
)
that can be learned through gradient-based optimization. The transformed operation
ˆo(i,j)
is a
weighted average of the operations selected from a finite candidate space
O
. The input features are
denoted by
x
and
θ(i,j)
o
represents the weight of operation
o
for the edge
(i, j)
. The operation mixing
weights
θ(i,j)= (θ(i,j)
1, θ(i,j)
2. . . θ(i,j)
|O|)
belong to a probability simplex, i.e.,
PoOθ(i,j)
o= 1
.
Throughout this paper, we use the terms architecture parameters and operation mixing weights
interchangeably.
NAS as Bi-level Optimization:
With a differentiable architecture search (DAS) formulation, NAS
can be posed as a a bi-level optimization problem over neural network weights
w
and architecture
parameters θ[15] in the following manner:
min
θLval(w(θ),θ)s.t. warg min
wLtrain(w,θ)(1)
However, it was observed in recent works [
16
] that optimizing directly over architecture parameters
can lead to overfitting due to insufficient exploration of the architecture space. To alleviate this, dif-
ferent DAS strategies were employed [
16
,
19
,
20
]. Among them, the most versatile is the distribution
learning approach [
16
] in which the architecture parameters are sampled from a distribution such as
the Dirichlet distribution
θ(i,j)iid
Dirichlet(β(i,j))
that can inherently satisfy the simplex constraint
2
摘要:

UniedProbabilisticNeuralArchitectureandWeightEnsemblingImprovesModelRobustnessSumeghaPremchandarMichiganStateUniversitypremchan@msu.eduSandeepMadireddyArgonneNationalLaboratorysmadireddy@anl.govSanketJantreBrookhavenNationalLaboratorysjantre@bnl.govPrasannaBalaprakashArgonneNationalLaboratorypbala...

收起<<
Unified Probabilistic Neural Architecture and Weight Ensembling Improves Model Robustness Sumegha Premchandar.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:7 页 大小:378.54KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注