Deep Combinatorial Aggregation Yuesong Shen12Daniel Cremers12 1Technical University of Munich Germany

2025-04-26 0 0 580.26KB 16 页 10玖币
侵权投诉
Deep Combinatorial Aggregation
Yuesong Shen 1,2Daniel Cremers 1,2
1Technical University of Munich, Germany
2Munich Center for Machine Learning, Germany
{yuesong.shen, cremers}@tum.de
Abstract
Neural networks are known to produce poor uncertainty estimations, and a vari-
ety of approaches have been proposed to remedy this issue. This includes deep
ensemble, a simple and effective method that achieves state-of-the-art results for
uncertainty-aware learning tasks. In this work, we explore a combinatorial gener-
alization of deep ensemble called deep combinatorial aggregation (DCA). DCA
creates multiple instances of network components and aggregates their combina-
tions to produce diversified model proposals and predictions. DCA components can
be defined at different levels of granularity. And we discovered that coarse-grain
DCAs can outperform deep ensemble for uncertainty-aware learning both in terms
of predictive performance and uncertainty estimation. For fine-grain DCAs, we dis-
cover that an average parameterization approach named deep combinatorial weight
averaging (DCWA) can improve the baseline training. It is on par with stochastic
weight averaging (SWA) but does not require any custom training schedule or
adaptation of BatchNorm layers. Furthermore, we propose a consistency enforcing
loss that helps the training of DCWA and modelwise DCA. We experiment on
in-domain, distributional shift, and out-of-distribution image classification tasks,
and empirically confirm the effectiveness of DCWA and DCA approaches. 1
1 Introduction
Deep learning has achieved groundbreaking progress and neural networks are now widely used in
various domains [
27
]. However, they are known to produce poor uncertainty estimations [
11
,
41
,
12
],
which can be problematic for challenges like safety-critical applications [
20
,
29
] or active learning
[
38
]. Numerous approaches have been proposed to tackle this issue, among which an effective yet
simple method is deep ensemble [
26
]. Deep ensemble yields state-of-the-art results for uncertainty
aware learning [
41
,
12
], and it does not require elaborate architectural design and hyperparameter
search. However, while it aggregates multiple separately trained models, it can neither generate new
samples from the posterior to obtain more diverse predictions, nor produce a summarizing average
model which improves on the individual models in the ensemble.
Motivated by the success of deep ensemble, in this paper, we propose deep combinatorial aggregation
(DCA), which generalizes via a combinatorial perspective: given the hierarchical structure of neural
networks, we explore the idea of ensembling components of the network architecture and combining
them to form an enriched collection of model proposals. DCA inherits the simplicity and effectiveness
of deep ensemble, and additionally leads to several new possibilities: Apart from generating a
diversified set of model proposals, we discover that fine-grain DCA can lead to a new average
proposal via deep combinatorial weight averaging (DCWA). Furthermore, DCA training can benefit
from a consistency enforcing loss, which can produce DCA models that surpass standard deep
ensemble both in classification performance and uncertainty estimation.
1Source code is available at https://github.com/tum-vision/dca.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06436v1 [cs.LG] 12 Oct 2022
1.1 Related work
Improving not only the predictive performance but also the uncertainty estimation of neural networks
has been a core objective of Bayesian deep learning, where an abundance of prior work exists. This
includes methods based on variational inference and weight perturbation such as Bayes by backprop
[
10
,
4
] and its variants [
45
,
6
], Bayesian interpretation of dropout including MC dropout [
8
] and
variational dropout [
3
,
33
], expectation propagation which leads to probabilistic backpropagation
[
16
], Markov chain Monte Carlo (MCMC) with methods like stochastic gradient Langevin dynamics
(SGLD) [
44
] and stochastic gradient Hamiltonian Monte Carlo (SGHMC) [
5
], stochastic gradient
descent (SGD) as approximate MCMC which results in SWAG [
31
], as well as Bayesian noisy
optimizers such as variational online Gauss-Newton (VOGN) [
21
] and approaches using Laplace
approximation [
23
,
36
]. Beyond the Bayesian formulation, methods like post-hoc calibration [
11
,
43
]
readjust trained networks to produce more calibrated predictions, while approaches like evidential
deep learning [37] also make use of ideas like subjective logic.
Most relevant to our work is the deep ensemble method [
26
], which aggregates multiple independently
trained network models with different initial parameters. Deep ensemble has been shown to produce
state-of-the-art results for uncertainty estimation [
41
,
12
]. It can be combined with methods like MC
dropout [
7
] and SWAG [
48
], or extended to the hyperparemeter space [
47
]. Several variants have also
been proposed, which often aim at providing more computationally or memory-efficient alternatives.
This includes snapshot ensemble [
17
], BatchEnsemble [
46
], fast geometric ensembling [
9
], TreeNet
[
28
], etc. In contrast, this work aims at exploring a combinatorial generalization of deep ensemble to
obtain new features and improve the performance of uncertainty estimation.
Lastly, our proposed DCWA method provides an alternative weight averaging scheme comparable to
stochastic weight averaging (SWA) [19].
1.2 Contributions
The main contributions of this paper are the following:
We propose deep combinatorial aggregation, a combinatorial generalization of deep ensem-
ble that can produce more diverse model proposals and predictions.
We explore DCA at different levels of granularity and propose deep combinatorial weight
averaging (DCWA) for fine-grain DCA models. It produces a new average model that
improves on standard training and is competitive w.r.t. alternatives like stochastic weight
averaging (SWA) [19].
We introduce a consistency enforcing loss adapted for DCA training. It strengthens the
predictive consistency of DCA model proposals and consistently improves the performance
of DCA and DCWA models.
We conduct experiments on in-domain, distributional shift, and out-of-distribution image
classification tasks, which validate our analysis and demonstrate the effectiveness of DCA
for uncertainty-aware learning.
2 Deep combinatorial aggregation (DCA)
In this section, we introduce the deep combinatorial aggregation (DCA) method. To simplify our
discussion, we start with a layerwise setting and assume that our base model is a neural network with
Llayers. This setting can easily be generalized to other DCA variants discussed in Section 2.3.
DCA
Figure 1: Illustration of layerwise deep combinatorial aggregation using a three-layer neural network:
two sets of DCA parameter instances result in
23= 8
model proposals. A random proposal is chosen
for each forward pass during both training and test time to generate diverse prediction samples.
2
2.1 Methodology
The main idea of layerwise deep combinatorial aggregation is straightforward: while deep ensemble
[
26
] creates multiple copies of the entire model with different initializations, layerwise DCA instead
creates multiple instances for each layer in the model. Randomly selected instances from network
layers can be combined to form a variety of model proposals. As shown in Figure 1, this results in an
exponential number of proposals w.r.t. the network depth, since
n
sets of layer instances lead to
nL
total network proposals.
To ensure consistency among the instance combinations, all layer instances are jointly trained: during
each feed-forward, a random instance choice from each layer is made to construct a model proposal,
and the parameters belonging to the selected layer instances are then updated via backpropagation.
This differs from deep ensemble [
26
] where model copies are independently trained. In Appendix A
we provide a pseudo-code for layerwise DCA training.
During inference time, to obtain uncertainty-aware predictions, one can simply sample multiple DCA
model proposals and aggregate their predictions.
2.2 Understanding DCA
While DCA is a simple and intuitive procedure, it is helpful to conduct a more in-depth theoretical
analysis to understand the assumptions it implicitly makes and their implications.
Being able to freely combine samples of layer parameters assumes that they follow mutually inde-
pendent distributions. This translates to the assumption that the weight posterior
p(θ|y, x)
should be
layerwise decomposable:
p(θ|y, x)
L
Y
l=1
φl(θl).(1)
Is this a valid assumption? Not exactly. Admittedly, the posterior is proportional to the product of
prior
p(θ)
and likelihood
p(y|x, θ)
(i.e.
p(θ|y, x)p(θ)p(y|x, θ)
), and the weight prior
p(θ)
often
satisfies a layerwise independence assumption, e.g., commonly used weight decay is equivalent to
Gaussian prior with constant diagonal covariance matrix. However, the likelihood term
p(y|x, θ)
is
in general not decomposable. In fact, following the directed graphical model [
24
] interpretation of
standard neural networks [
39
], the base network represents the overall distribution
p(h1:L1, y|x, θ)
where
hl
represents hidden neurons in layer
l
. Interestingly,
p(h1:L1, y|x, θ)
is actually layerwise
decomposable itself
p(h1:L1, y|x, θ) =
L
Y
l=1
p(hl|hl1, θl),(x, y := h0, hL).(2)
Nevertheless, the likelihood term p(y|x, θ)requires marginalization of all hidden neurons h1:L1
p(y|x, θ) = Zh1:L1
p(h1:L1, y|x, θ)dh1:L1.(3)
This entangles the layer parameters, and p(y|x, θ)is no longer decomposable in general.
Thus approximations are made when we perform layerwise DCA, and experiments show that this
indeed results in some performance penalty. The above analysis also implies that weight aggregation
at a coarser multilayer level could alleviate the issue. This is also confirmed by our empirical findings
(cf . Section 5.3).
2.3 Granularity of aggregation
The analysis in Section 2.2 raises an interesting concern about doing deep combinatorial aggregation
at different levels of granularity: From finest to coarsest, DCA can be defined at neuronwise, layerwise,
multilayer, or modelwise levels. For convolutional neural networks, due to weight-sharing along the
spatial dimensions, DCA can work with channel components instead of neurons. Note that modelwise
DCA is similar to deep ensemble, except that in each epoch DCA model copies are trained with
distinct subsets of training data which results from the random component selection, and it can benefit
from consistency enforcing loss introduced in Section 4.
3
In general, DCA with finer granularity generates a greater amount of model proposals and more varied
predictions. However, the DCA components are more tightly coupled, deviating more significantly
from the assumption of decomposable posterior, which could lead to worse performance. This results
in a tradeoff between performance and prediction variety. Alternatively, one can also enrich the set of
model proposals by using more DCA instances, at the cost of a higher computational budget. This
can also lead to improved performance (cf . Section 5.3).
Among the broad range of granularities, there exists a notable dichotomy between (sub)layerwise
DCA and multilayered DCA. This comes from the fact that component instances from multilayered
DCA variants can have similar behaviors but dissimilar weights. Neural networks admit a large
number of equivalent reparameterizations via the reordering of hidden layer neurons. While the joint
training of DCA enforces consistency among different DCA components of the network model, it
does not prevent equivalent reordering of hidden neurons inside multilayer components. This issue
does not occur for (sub)layerwise DCA cases. To make the distinctions, we refer to (sub)layerwise
cases as fine-grain DCA and multilayered cases as coarse-grain DCA.
3 Deep combinatorial weight averaging (DCWA) for fine-grain aggregation
For fine-grain DCA models, it turns out that averaging the learned weights of DCA components leads
to an improved parameterization of the base model. We discuss this procedure here in detail.
Deep combinatorial weight averaging
Since component instances of a fine-grain DCA model
have compatible weights after the joint training, it is sensible to consider their mean value. This
produces a new average parameterization for the base network model. We refer to this process as
deep combinatorial weight averaging (DCWA).
Consider as an example the layerwise DCA model for a base neural network with
l
layers. After
the joint training of
n
sets of DCA layer instances parameterized by
Θ= (θ1
1:n, . . . , θL
1:n)
, DCWA
simply computes the average parameterization
¯
θ= (¯
θ1,...,¯
θL)
for the base network model, where
for each layer l, we have ¯
θl=1
nPn
i=1 θl
i.
Experiments show that DCWA achieves comparable test accuracy w.r.t. corresponding DCA predic-
tions (cf . Section 5.3). It also consistently outperforms the standard training of the base network, and
delivers comparable results w.r.t. to SWA [19] (cf . Section 5.1).
Comparison to SWA
It is interesting to compare DCWA with SWA [
19
] since they are both
weight averaging schemes that improve the standard training. This said, DCWA and SWA are based
on different principles: SWA averages over the SGD trajectory while DCA relies on combining
component aggregations. In practice, SWA requires custom learning rate scheduling and careful
choice of end learning rate. Also, SWA requires an additional Batch Normalization update [
19
] to
produce good predictions, which leads to an extra overhead. In contrast, DCWA does not have any of
these issues and is simple to implement and deploy.
4 Consistency enforcing loss for DCA and DCWA
Through component combination, DCA is able to produce a combinatorial amount of model proposals.
However, during the joint training, each DCA component receives gradient updates from different
model proposals. These updates can be inconsistent, which can lead to suboptimal training of
the DCA model. To remedy this issue, we propose in this section a consistency enforcing loss to
encourage consistency among DCA model proposals.
Consistency enforcing loss
To promote consistency among DCA model proposals, we encourage
DCA predictions agree with both the ground-truth and the predictions from other model proposals.
To achieve this, given an input
x
with ground-truth
y
and a DCA model proposal parameterized by
ˆ
θ
,
instead of minimizing the negative log-likelihood
`NLL(x, y;ˆ
θ) = log p(y|x;ˆ
θ)
, the consistency
enforcing loss includes an additional KL divergence term between the predictive output probability
p(y|x;ˆ
θ)and a reference output probability ˜p:
`(x, y, ˜p;ˆ
θ) = log p(y|x;ˆ
θ) + DKL(˜pkp;ˆ
θ).(4)
4
摘要:

DeepCombinatorialAggregationYuesongShen1;2DanielCremers1;21TechnicalUniversityofMunich,Germany2MunichCenterforMachineLearning,Germany{yuesong.shen,cremers}@tum.deAbstractNeuralnetworksareknowntoproducepooruncertaintyestimations,andavari-etyofapproacheshavebeenproposedtoremedythisissue.Thisincludesde...

展开>> 收起<<
Deep Combinatorial Aggregation Yuesong Shen12Daniel Cremers12 1Technical University of Munich Germany.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:580.26KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注