
In general, DCA with finer granularity generates a greater amount of model proposals and more varied
predictions. However, the DCA components are more tightly coupled, deviating more significantly
from the assumption of decomposable posterior, which could lead to worse performance. This results
in a tradeoff between performance and prediction variety. Alternatively, one can also enrich the set of
model proposals by using more DCA instances, at the cost of a higher computational budget. This
can also lead to improved performance (cf . Section 5.3).
Among the broad range of granularities, there exists a notable dichotomy between (sub)layerwise
DCA and multilayered DCA. This comes from the fact that component instances from multilayered
DCA variants can have similar behaviors but dissimilar weights. Neural networks admit a large
number of equivalent reparameterizations via the reordering of hidden layer neurons. While the joint
training of DCA enforces consistency among different DCA components of the network model, it
does not prevent equivalent reordering of hidden neurons inside multilayer components. This issue
does not occur for (sub)layerwise DCA cases. To make the distinctions, we refer to (sub)layerwise
cases as fine-grain DCA and multilayered cases as coarse-grain DCA.
3 Deep combinatorial weight averaging (DCWA) for fine-grain aggregation
For fine-grain DCA models, it turns out that averaging the learned weights of DCA components leads
to an improved parameterization of the base model. We discuss this procedure here in detail.
Deep combinatorial weight averaging
Since component instances of a fine-grain DCA model
have compatible weights after the joint training, it is sensible to consider their mean value. This
produces a new average parameterization for the base network model. We refer to this process as
deep combinatorial weight averaging (DCWA).
Consider as an example the layerwise DCA model for a base neural network with
l
layers. After
the joint training of
n
sets of DCA layer instances parameterized by
Θ= (θ1
1:n, . . . , θL
1:n)
, DCWA
simply computes the average parameterization
¯
θ= (¯
θ1,...,¯
θL)
for the base network model, where
for each layer l, we have ¯
θl=1
nPn
i=1 θl
i.
Experiments show that DCWA achieves comparable test accuracy w.r.t. corresponding DCA predic-
tions (cf . Section 5.3). It also consistently outperforms the standard training of the base network, and
delivers comparable results w.r.t. to SWA [19] (cf . Section 5.1).
Comparison to SWA
It is interesting to compare DCWA with SWA [
19
] since they are both
weight averaging schemes that improve the standard training. This said, DCWA and SWA are based
on different principles: SWA averages over the SGD trajectory while DCA relies on combining
component aggregations. In practice, SWA requires custom learning rate scheduling and careful
choice of end learning rate. Also, SWA requires an additional Batch Normalization update [
19
] to
produce good predictions, which leads to an extra overhead. In contrast, DCWA does not have any of
these issues and is simple to implement and deploy.
4 Consistency enforcing loss for DCA and DCWA
Through component combination, DCA is able to produce a combinatorial amount of model proposals.
However, during the joint training, each DCA component receives gradient updates from different
model proposals. These updates can be inconsistent, which can lead to suboptimal training of
the DCA model. To remedy this issue, we propose in this section a consistency enforcing loss to
encourage consistency among DCA model proposals.
Consistency enforcing loss
To promote consistency among DCA model proposals, we encourage
DCA predictions agree with both the ground-truth and the predictions from other model proposals.
To achieve this, given an input
x
with ground-truth
y
and a DCA model proposal parameterized by
ˆ
θ
,
instead of minimizing the negative log-likelihood
`NLL(x, y;ˆ
θ) = −log p(y|x;ˆ
θ)
, the consistency
enforcing loss includes an additional KL divergence term between the predictive output probability
p(y|x;ˆ
θ)and a reference output probability ˜p:
`(x, y, ˜p;ˆ
θ) = −log p(y|x;ˆ
θ) + DKL(˜pkp;ˆ
θ).(4)
4