
Table 1:
(Un-)Related Fields
and how they compare in terms of: A) whether they make instance-wise
predictions; B) the distributional information which they require; C) the form of the information; and
D) the target quantity they are focused on. References given as: [1] Torrey and Shavlik (2010); [2]
Raftery et al. (1997); [3] Ren et al. (2019); [4] Ruta and Gabrys (2005); [5] Jaffe et al. (2016).
Problem Ref. A) Instance-wise B) Distribution C) Information D) Target
Transfer Learning [1] 3ptrain
(X,Y ), ptest
(X,Y )DT rain ptest
(Y|X)
Bayesian Model Averaging [2] 7ptest
(X,Y )DV al p(Mi|D)
Out-of-distribution Detection [3] 7/3ptest
(X)-ptrain
X(xtest)<
Majority Voting [4] 7ptrain
(X,Y )-ptest
(Y|X)
Unsupervised Ensemble Regression [5] 7ptest
(Y)E[Y], V ar[Y] ˆw
Synthetic Model Combination [Us] 3{pMj
(X)}N
j=1 {Ij}N
j=1 w(x)
In our case, we assume information from considerably less informative distributions, the train-
ing time feature distributions
{pMj
(X)}N
j=1
, were the information can take a variety of forms,
most practically though through samples of the features or details of the first and second mo-
ments. Practically, this appears to be the minimal set of information for which we can do some-
thing useful since if only given a set of models and no accompanying information then there
is no way to determine which models may be best in general let alone for specific features.
How are models normally ensembled?
The literature on ensemble methods is vast, and we do not
intend to provide a survey, a number of which already exist (Sagi and Rokach, 2018; Dong et al.,
2020). The focus is often on training ones own set of models that can then be ensembled for epistemic
uncertainty insight (Rahaman et al., 2021) or boosted for performance (Chen and Guestrin, 2016).
In terms of methods of ensembling models that are provided to a practitioner (instead of ones
trained by them as well) then closest is the setting of unsupervised ensemble regression - which
like us does not consider any joint feature label distribution. To make progress though some in-
formation needs to be provided, with Dror et al. (2017) considering the marginal label distribution
ptest
(Y)
, making the strong assumption the first two moments of the response are known. Instead
of being directly provided the mean and variance, another strand of work assumes conditional
independence given the label (Dawid and Skene, 1979), meaning that any predictors that agree
consistently will be more likely to be accurate. Platanios et al. (2014) and Jaffe et al. (2016)
relax this assumption through the use of graphical models and meta-learner construction respec-
tively, with a Bayesian approach proposed by Platanios et al. (2016) Recent work of Shaham
et al. (2016) attempts to learn the dependence structure using restricted Boltzmann machines.
Model averaging when validation data is available.
Moving from the unsupervised methods
mentioned above, which at most considered information from only marginal distributions, we come
to the case of when we may have some validation data from the joint distribution
ptest
(X,Y )
(or one
assumed to be the same). This can effectively be used to evaluate a ranking on the given models
for more informed ensembles. This ensemble can be created in a number of ways (Huang et al.,
2009), including work that moves in the direction of instance-wise predictions by dividing the space
into regions before calculating per-region weights (Verikas et al., 1999). A practical and more
common approach is Bayesian Model Averaging (BMA) (Raftery et al., 1997). Given an appropriate
prior, we calculate the posterior probability that a given model is the optimal one - and once this
is obtained the models can be marginalised out during test time predictions, creating an ensemble
weighted by each model’s posterior probability. The posterior being intractable, the probability
is approximated using the Bayesian Information Criterion (BIC) (Neath and Cavanaugh, 2012) -
which requires a likelihood estimate over some validation set and is estimated as:
p(Mi|D) =
exp(−1
2BIC(Mi))/PN
i=1 exp(−1
2BIC(Mi))
With this, along with all the ensemble methods
previously mentioned, it is important to note the subtle difference in setup to the problem we are
trying to work with. In all cases, it is assumed that there is some ordering for the models that
holds across the feature space and so a global ensemble is produced with a fixed weighting
ˆw
such that
w(x) = ˆw∀x∈ X
. This causes failure cases when there is variation in the models
across the feature space, since it is a key point that BMA is not model combination (Minka, 2000).
This being an important distinction and one of the main reasons BMA has been shown to perform
badly under covariate shifted tasks (Izmailov et al., 2021). That being said, it can be extended by
3