Deep Combinatorial Aggregation Yuesong Shen12Daniel Cremers12 1Technical University of Munich Germany

2025-04-26 0 0 580.26KB 16 页 10玖币

侵权投诉

Deep Combinatorial Aggregation

Yuesong Shen 1,2Daniel Cremers 1,2

1Technical University of Munich, Germany

2Munich Center for Machine Learning, Germany

{yuesong.shen, cremers}@tum.de

Abstract

Neural networks are known to produce poor uncertainty estimations, and a vari-

ety of approaches have been proposed to remedy this issue. This includes deep

ensemble, a simple and effective method that achieves state-of-the-art results for

uncertainty-aware learning tasks. In this work, we explore a combinatorial gener-

alization of deep ensemble called deep combinatorial aggregation (DCA). DCA

creates multiple instances of network components and aggregates their combina-

tions to produce diversiﬁed model proposals and predictions. DCA components can

be deﬁned at different levels of granularity. And we discovered that coarse-grain

DCAs can outperform deep ensemble for uncertainty-aware learning both in terms

of predictive performance and uncertainty estimation. For ﬁne-grain DCAs, we dis-

cover that an average parameterization approach named deep combinatorial weight

averaging (DCWA) can improve the baseline training. It is on par with stochastic

weight averaging (SWA) but does not require any custom training schedule or

adaptation of BatchNorm layers. Furthermore, we propose a consistency enforcing

loss that helps the training of DCWA and modelwise DCA. We experiment on

in-domain, distributional shift, and out-of-distribution image classiﬁcation tasks,

and empirically conﬁrm the effectiveness of DCWA and DCA approaches. 1

1 Introduction

Deep learning has achieved groundbreaking progress and neural networks are now widely used in

various domains [

]. However, they are known to produce poor uncertainty estimations [

which can be problematic for challenges like safety-critical applications [

] or active learning

[

]. Numerous approaches have been proposed to tackle this issue, among which an effective yet

simple method is deep ensemble [

]. Deep ensemble yields state-of-the-art results for uncertainty

aware learning [

], and it does not require elaborate architectural design and hyperparameter

search. However, while it aggregates multiple separately trained models, it can neither generate new

samples from the posterior to obtain more diverse predictions, nor produce a summarizing average

model which improves on the individual models in the ensemble.

Motivated by the success of deep ensemble, in this paper, we propose deep combinatorial aggregation

(DCA), which generalizes via a combinatorial perspective: given the hierarchical structure of neural

networks, we explore the idea of ensembling components of the network architecture and combining

them to form an enriched collection of model proposals. DCA inherits the simplicity and effectiveness

of deep ensemble, and additionally leads to several new possibilities: Apart from generating a

diversiﬁed set of model proposals, we discover that ﬁne-grain DCA can lead to a new average

proposal via deep combinatorial weight averaging (DCWA). Furthermore, DCA training can beneﬁt

from a consistency enforcing loss, which can produce DCA models that surpass standard deep

ensemble both in classiﬁcation performance and uncertainty estimation.

1Source code is available at https://github.com/tum-vision/dca.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06436v1 [cs.LG] 12 Oct 2022

1.1 Related work

Improving not only the predictive performance but also the uncertainty estimation of neural networks

has been a core objective of Bayesian deep learning, where an abundance of prior work exists. This

includes methods based on variational inference and weight perturbation such as Bayes by backprop

[

] and its variants [

], Bayesian interpretation of dropout including MC dropout [

] and

variational dropout [

], expectation propagation which leads to probabilistic backpropagation

[

], Markov chain Monte Carlo (MCMC) with methods like stochastic gradient Langevin dynamics

(SGLD) [

] and stochastic gradient Hamiltonian Monte Carlo (SGHMC) [

], stochastic gradient

descent (SGD) as approximate MCMC which results in SWAG [

], as well as Bayesian noisy

optimizers such as variational online Gauss-Newton (VOGN) [

] and approaches using Laplace

approximation [

]. Beyond the Bayesian formulation, methods like post-hoc calibration [

]

readjust trained networks to produce more calibrated predictions, while approaches like evidential

deep learning [37] also make use of ideas like subjective logic.

Most relevant to our work is the deep ensemble method [

], which aggregates multiple independently

trained network models with different initial parameters. Deep ensemble has been shown to produce

state-of-the-art results for uncertainty estimation [

]. It can be combined with methods like MC

dropout [

] and SWAG [

], or extended to the hyperparemeter space [

]. Several variants have also

been proposed, which often aim at providing more computationally or memory-efﬁcient alternatives.

This includes snapshot ensemble [

], BatchEnsemble [

], fast geometric ensembling [

], TreeNet

[

], etc. In contrast, this work aims at exploring a combinatorial generalization of deep ensemble to

obtain new features and improve the performance of uncertainty estimation.

Lastly, our proposed DCWA method provides an alternative weight averaging scheme comparable to

stochastic weight averaging (SWA) [19].

1.2 Contributions

The main contributions of this paper are the following:

•

We propose deep combinatorial aggregation, a combinatorial generalization of deep ensem-

ble that can produce more diverse model proposals and predictions.

•

We explore DCA at different levels of granularity and propose deep combinatorial weight

averaging (DCWA) for ﬁne-grain DCA models. It produces a new average model that

improves on standard training and is competitive w.r.t. alternatives like stochastic weight

averaging (SWA) [19].

•

We introduce a consistency enforcing loss adapted for DCA training. It strengthens the

predictive consistency of DCA model proposals and consistently improves the performance

of DCA and DCWA models.

•

We conduct experiments on in-domain, distributional shift, and out-of-distribution image

classiﬁcation tasks, which validate our analysis and demonstrate the effectiveness of DCA

for uncertainty-aware learning.

2 Deep combinatorial aggregation (DCA)

In this section, we introduce the deep combinatorial aggregation (DCA) method. To simplify our

discussion, we start with a layerwise setting and assume that our base model is a neural network with

Llayers. This setting can easily be generalized to other DCA variants discussed in Section 2.3.

DCA

Figure 1: Illustration of layerwise deep combinatorial aggregation using a three-layer neural network:

two sets of DCA parameter instances result in

23= 8

model proposals. A random proposal is chosen

for each forward pass during both training and test time to generate diverse prediction samples.

2.1 Methodology

The main idea of layerwise deep combinatorial aggregation is straightforward: while deep ensemble

[

] creates multiple copies of the entire model with different initializations, layerwise DCA instead

creates multiple instances for each layer in the model. Randomly selected instances from network

layers can be combined to form a variety of model proposals. As shown in Figure 1, this results in an

exponential number of proposals w.r.t. the network depth, since

sets of layer instances lead to

total network proposals.

To ensure consistency among the instance combinations, all layer instances are jointly trained: during

each feed-forward, a random instance choice from each layer is made to construct a model proposal,

and the parameters belonging to the selected layer instances are then updated via backpropagation.

This differs from deep ensemble [

] where model copies are independently trained. In Appendix A

we provide a pseudo-code for layerwise DCA training.

During inference time, to obtain uncertainty-aware predictions, one can simply sample multiple DCA

model proposals and aggregate their predictions.

2.2 Understanding DCA

While DCA is a simple and intuitive procedure, it is helpful to conduct a more in-depth theoretical

analysis to understand the assumptions it implicitly makes and their implications.

Being able to freely combine samples of layer parameters assumes that they follow mutually inde-

pendent distributions. This translates to the assumption that the weight posterior

p(θ|y, x)

should be

layerwise decomposable:

p(θ|y, x)∝

l=1

φl(θl).(1)

Is this a valid assumption? Not exactly. Admittedly, the posterior is proportional to the product of

prior

p(θ)

and likelihood

p(y|x, θ)

(i.e.

p(θ|y, x)∝p(θ)p(y|x, θ)

), and the weight prior

p(θ)

often

satisﬁes a layerwise independence assumption, e.g., commonly used weight decay is equivalent to

Gaussian prior with constant diagonal covariance matrix. However, the likelihood term

p(y|x, θ)

in general not decomposable. In fact, following the directed graphical model [

] interpretation of

standard neural networks [

], the base network represents the overall distribution

p(h1:L−1, y|x, θ)

where

represents hidden neurons in layer

. Interestingly,

p(h1:L−1, y|x, θ)

is actually layerwise

decomposable itself

p(h1:L−1, y|x, θ) =

l=1

p(hl|hl−1, θl),(x, y := h0, hL).(2)

Nevertheless, the likelihood term p(y|x, θ)requires marginalization of all hidden neurons h1:L−1

p(y|x, θ) = Zh1:L−1

p(h1:L−1, y|x, θ)dh1:L−1.(3)

This entangles the layer parameters, and p(y|x, θ)is no longer decomposable in general.

Thus approximations are made when we perform layerwise DCA, and experiments show that this

indeed results in some performance penalty. The above analysis also implies that weight aggregation

at a coarser multilayer level could alleviate the issue. This is also conﬁrmed by our empirical ﬁndings

(cf . Section 5.3).

2.3 Granularity of aggregation

The analysis in Section 2.2 raises an interesting concern about doing deep combinatorial aggregation

at different levels of granularity: From ﬁnest to coarsest, DCA can be deﬁned at neuronwise, layerwise,

multilayer, or modelwise levels. For convolutional neural networks, due to weight-sharing along the

spatial dimensions, DCA can work with channel components instead of neurons. Note that modelwise

DCA is similar to deep ensemble, except that in each epoch DCA model copies are trained with

distinct subsets of training data which results from the random component selection, and it can beneﬁt

from consistency enforcing loss introduced in Section 4.

In general, DCA with ﬁner granularity generates a greater amount of model proposals and more varied

predictions. However, the DCA components are more tightly coupled, deviating more signiﬁcantly

from the assumption of decomposable posterior, which could lead to worse performance. This results

in a tradeoff between performance and prediction variety. Alternatively, one can also enrich the set of

model proposals by using more DCA instances, at the cost of a higher computational budget. This

can also lead to improved performance (cf . Section 5.3).

Among the broad range of granularities, there exists a notable dichotomy between (sub)layerwise

DCA and multilayered DCA. This comes from the fact that component instances from multilayered

DCA variants can have similar behaviors but dissimilar weights. Neural networks admit a large

number of equivalent reparameterizations via the reordering of hidden layer neurons. While the joint

training of DCA enforces consistency among different DCA components of the network model, it

does not prevent equivalent reordering of hidden neurons inside multilayer components. This issue

does not occur for (sub)layerwise DCA cases. To make the distinctions, we refer to (sub)layerwise

cases as ﬁne-grain DCA and multilayered cases as coarse-grain DCA.

3 Deep combinatorial weight averaging (DCWA) for ﬁne-grain aggregation

For ﬁne-grain DCA models, it turns out that averaging the learned weights of DCA components leads

to an improved parameterization of the base model. We discuss this procedure here in detail.

Deep combinatorial weight averaging

Since component instances of a ﬁne-grain DCA model

have compatible weights after the joint training, it is sensible to consider their mean value. This

produces a new average parameterization for the base network model. We refer to this process as

deep combinatorial weight averaging (DCWA).

Consider as an example the layerwise DCA model for a base neural network with

layers. After

the joint training of

sets of DCA layer instances parameterized by

Θ= (θ1

1:n, . . . , θL

1:n)

, DCWA

simply computes the average parameterization

θ= (¯

θ1,...,¯

θL)

for the base network model, where

for each layer l, we have ¯

θl=1

nPn

i=1 θl

Experiments show that DCWA achieves comparable test accuracy w.r.t. corresponding DCA predic-

tions (cf . Section 5.3). It also consistently outperforms the standard training of the base network, and

delivers comparable results w.r.t. to SWA [19] (cf . Section 5.1).

Comparison to SWA

It is interesting to compare DCWA with SWA [

] since they are both

weight averaging schemes that improve the standard training. This said, DCWA and SWA are based

on different principles: SWA averages over the SGD trajectory while DCA relies on combining

component aggregations. In practice, SWA requires custom learning rate scheduling and careful

choice of end learning rate. Also, SWA requires an additional Batch Normalization update [

] to

produce good predictions, which leads to an extra overhead. In contrast, DCWA does not have any of

these issues and is simple to implement and deploy.

4 Consistency enforcing loss for DCA and DCWA

Through component combination, DCA is able to produce a combinatorial amount of model proposals.

However, during the joint training, each DCA component receives gradient updates from different

model proposals. These updates can be inconsistent, which can lead to suboptimal training of

the DCA model. To remedy this issue, we propose in this section a consistency enforcing loss to

encourage consistency among DCA model proposals.

Consistency enforcing loss

To promote consistency among DCA model proposals, we encourage

DCA predictions agree with both the ground-truth and the predictions from other model proposals.

To achieve this, given an input

with ground-truth

and a DCA model proposal parameterized by

instead of minimizing the negative log-likelihood

`NLL(x, y;ˆ

θ) = −log p(y|x;ˆ

θ)

, the consistency

enforcing loss includes an additional KL divergence term between the predictive output probability

p(y|x;ˆ

θ)and a reference output probability ˜p:

`(x, y, ˜p;ˆ

θ) = −log p(y|x;ˆ

θ) + DKL(˜pkp;ˆ

θ).(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepCombinatorialAggregationYuesongShen1;2DanielCremers1;21TechnicalUniversityofMunich,Germany2MunichCenterforMachineLearning,Germany{yuesong.shen,cremers}@tum.deAbstractNeuralnetworksareknowntoproducepooruncertaintyestimations,andavari-etyofapproacheshavebeenproposedtoremedythisissue.Thisincludesde...

展开>> 收起<<

Deep Combinatorial Aggregation Yuesong Shen12Daniel Cremers12 1Technical University of Munich Germany.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep Combinatorial Aggregation Yuesong Shen12Daniel Cremers12 1Technical University of Munich Germany

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: