Synthetic Model Combination An Instance-wise Approach to Unsupervised Ensemble Learning Alex J. Chan

2025-05-02 0 0 1.62MB 16 页 10玖币
侵权投诉
Synthetic Model Combination: An Instance-wise
Approach to Unsupervised Ensemble Learning
Alex J. Chan
University of Cambridge
Cambridge, UK
ajc340cam.ac.uk
Mihaela van der Schaar
University of Cambridge
Cambridge Centre for AI in Medicine
Cambridge, UK
mv472@cam.ac.uk
Abstract
Consider making a prediction over new test data without any opportunity to learn
from a training set of labelled data - instead given access to a set of expert models
and their predictions alongside some limited information about the dataset used to
train them. In scenarios from finance to the medical sciences, and even consumer
practice, stakeholders have developed models on private data they either cannot,
or do not want to, share. Given the value and legislation surrounding personal
information, it is not surprising that only the models, and not the data, will be
released - the pertinent question becoming: how best to use these models? Previous
work has focused on global model selection or ensembling, with the result of a
single final model across the feature space. Machine learning models perform
notoriously poorly on data outside their training domain however, and so we argue
that when ensembling models the weightings for individual instances must reflect
their respective domains - in other words models that are more likely to have seen
information on that instance should have more attention paid to them. We introduce
a method for such an instance-wise ensembling of models, including a novel
representation learning step for handling sparse high-dimensional domains. Finally,
we demonstrate the need and generalisability of our method on classical machine
learning tasks as well as highlighting a real world use case in the pharmacological
setting of vancomycin precision dosing.
1 Introduction
Sharing data is often a very problematic affair - before we even arrive at whether a stakeholder will
want to, given the modern day value - it may not even be allowed. In particular, when the data contains
identifiable and personal information it may be inappropriate, and illegal, to do so. A common solution
is to provide some proxy of the true data in the form of a generated fully synthetic dataset (Alaa et al.,
2020) or one that has undergone some privatisation or anonymisation process (Elliot et al., 2018;
Chan et al., 2021), but this can often lead to low-quality or extremely noisy data (Alaa et al., 2021)
that is hard to gain insight on. Alternatively, groups may release models that they have trained on
their private data. If for a given task there are multiple of such models, we are left with the task of
how to best use these models in combination when they have potentially conflicting predictions. This
problem is known as unsupervised ensemble learning since we have no explicit signal on the task,
and we aim to construct a combination of the models for making future predictions (Jaffe et al., 2016).
Without having trained the models ourselves, it is a very challenging task for us to know how well
the individual models should perform, making the task of choosing the most appropriate model (or
ensemble) difficult. This is compounded by the problem that the provided models could perform
poorly for not one, but two main reasons: Firstly, the model itself may not have been flexible
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05320v1 [cs.LG] 11 Oct 2022
enough to properly capture the underlying true function present in the data; and secondly, in the
area that they are making a prediction there may not have been sufficient training data used for
the model to have been able to learn appropriately - i.e. the model is extrapolating (potentially
unreasonably) to cover a new feature point - the main issue in covariate shifted problems (Bickel
et al., 2009). Current practice is usually to try and select the globally optimal model (Shaham
et al., 2016; Dror et al., 2017), that is to say of those made available, which model should be
used to make predictions for any given test point in the feature space. This approach potentially
addresses the first point, it completely overlooks the second - and this is what we will focus on.
In order to consider this problem of extrapolation amongst the individual models, the important
question is: what might a solution look like? We must consider the desiderata that given no
augmentation of the individual models, the ensemble weights must vary depending on the test
features and additionally, these weights should reflect the confidence that a model will be able to
make an appropriate prediction, and we should be able to tell generally when our confidence is low.
A
B
C
𝓜
1
𝓜
2
𝓜
3
1 2 3
1 2 3
1 2 3
w(A)
w(B)
w(C)
Figure 1:
Instance-wise Ensembles
. Here
we represent the density of the training
features for three separate models -
M1
,
M2
,and
M3
. Given new test points A, B,
and C, we need to construct predictions from
these models. A is well represented by both
M2
and
M3
while B only has significant
density under
M3
. C looks like none of the
models will be able to make confident predic-
tions.
Contributions
In this work we make a three-fold
contribution. First, we establish and document the
need for instance-wise predictions in the setting of
unsupervised ensemble learning, in doing so intro-
ducing the concept of Synthetic Model Combination
(SMC), shown in Figure 1. Second, we introduce a
novel unsupervised representation learning procedure
that can be incorporated into SMC - allowing for
more appropriate separation of models and estima-
tion of ensemble weights in sparse high-dimensional
settings. Finally, we provide practical demonstrations
of both the success and failure cases of traditional
methods and SMC, in synthetic examples as well
as a real example of precision dosing - code for
which is made available at
https://github.com/
XanderJC/synthetic-model-combination
,
along with the group codebase at
https://github.
com/vanderschaarlab/mlforhealthlabpub/
synthetic-model-combination
.
2 Background
Formulation
Consider having access only to
N
provided tuples
{(Mj,Ij)}N
j=1
of models
Mj
and
associated information
Ij
. With each
Mj:X 7→
Y
being a mapping from some covariate space
X
to another target space
Y
and having been trained on some dataset
Dj
which is not ob-
served by us although is in some way summarised by
Ij
. We are then presented with
some other test dataset
DT={xi}M
i=1
and consequently tasked with making predictions.
Goal
Our aim is to construct a convex combination of models in a way so as to produce
optimal model predictions
ˆy=M(x) = PN
j=1 w(x)jMj(x)
with
PN
j=1 w(x)j= 1
and
w(x)j
taking the
j
th index of the output of the function
w:X 7→ N
that maps features
to a set of weights on the probability
N
-simplex. We can summarise our task process as:
Given: {(Mj,Ij)}N
j=1 and {xi}M
i=1 Obtain: wto predict {ˆyi}M
i=1
Having established the setting we are focused on, we can explore contemporary methods and compare
how they relate to our work. We consider methods that are interested ultimately in the test-time
distribution of labels conditional on feature covariates
ptest
(Y|X)
. We discuss the differences in methods
focuses on the distributional information available to them, summarised in Table 1. For example, in
standard supervised learning, samples of the feature-label joint distribution
ptrain
(X,Y )
are given - the
training-time conditional distribution is then estimated and assumed to be equivalent at test-time.
2
Table 1:
(Un-)Related Fields
and how they compare in terms of: A) whether they make instance-wise
predictions; B) the distributional information which they require; C) the form of the information; and
D) the target quantity they are focused on. References given as: [1] Torrey and Shavlik (2010); [2]
Raftery et al. (1997); [3] Ren et al. (2019); [4] Ruta and Gabrys (2005); [5] Jaffe et al. (2016).
Problem Ref. A) Instance-wise B) Distribution C) Information D) Target
Transfer Learning [1] 3ptrain
(X,Y ), ptest
(X,Y )DT rain ptest
(Y|X)
Bayesian Model Averaging [2] 7ptest
(X,Y )DV al p(Mi|D)
Out-of-distribution Detection [3] 7/3ptest
(X)-ptrain
X(xtest)< 
Majority Voting [4] 7ptrain
(X,Y )-ptest
(Y|X)
Unsupervised Ensemble Regression [5] 7ptest
(Y)E[Y], V ar[Y] ˆw
Synthetic Model Combination [Us] 3{pMj
(X)}N
j=1 {Ij}N
j=1 w(x)
In our case, we assume information from considerably less informative distributions, the train-
ing time feature distributions
{pMj
(X)}N
j=1
, were the information can take a variety of forms,
most practically though through samples of the features or details of the first and second mo-
ments. Practically, this appears to be the minimal set of information for which we can do some-
thing useful since if only given a set of models and no accompanying information then there
is no way to determine which models may be best in general let alone for specific features.
How are models normally ensembled?
The literature on ensemble methods is vast, and we do not
intend to provide a survey, a number of which already exist (Sagi and Rokach, 2018; Dong et al.,
2020). The focus is often on training ones own set of models that can then be ensembled for epistemic
uncertainty insight (Rahaman et al., 2021) or boosted for performance (Chen and Guestrin, 2016).
In terms of methods of ensembling models that are provided to a practitioner (instead of ones
trained by them as well) then closest is the setting of unsupervised ensemble regression - which
like us does not consider any joint feature label distribution. To make progress though some in-
formation needs to be provided, with Dror et al. (2017) considering the marginal label distribution
ptest
(Y)
, making the strong assumption the first two moments of the response are known. Instead
of being directly provided the mean and variance, another strand of work assumes conditional
independence given the label (Dawid and Skene, 1979), meaning that any predictors that agree
consistently will be more likely to be accurate. Platanios et al. (2014) and Jaffe et al. (2016)
relax this assumption through the use of graphical models and meta-learner construction respec-
tively, with a Bayesian approach proposed by Platanios et al. (2016) Recent work of Shaham
et al. (2016) attempts to learn the dependence structure using restricted Boltzmann machines.
Model averaging when validation data is available.
Moving from the unsupervised methods
mentioned above, which at most considered information from only marginal distributions, we come
to the case of when we may have some validation data from the joint distribution
ptest
(X,Y )
(or one
assumed to be the same). This can effectively be used to evaluate a ranking on the given models
for more informed ensembles. This ensemble can be created in a number of ways (Huang et al.,
2009), including work that moves in the direction of instance-wise predictions by dividing the space
into regions before calculating per-region weights (Verikas et al., 1999). A practical and more
common approach is Bayesian Model Averaging (BMA) (Raftery et al., 1997). Given an appropriate
prior, we calculate the posterior probability that a given model is the optimal one - and once this
is obtained the models can be marginalised out during test time predictions, creating an ensemble
weighted by each model’s posterior probability. The posterior being intractable, the probability
is approximated using the Bayesian Information Criterion (BIC) (Neath and Cavanaugh, 2012) -
which requires a likelihood estimate over some validation set and is estimated as:
p(Mi|D) =
exp(1
2BIC(Mi))/PN
i=1 exp(1
2BIC(Mi))
With this, along with all the ensemble methods
previously mentioned, it is important to note the subtle difference in setup to the problem we are
trying to work with. In all cases, it is assumed that there is some ordering for the models that
holds across the feature space and so a global ensemble is produced with a fixed weighting
ˆw
such that
w(x) = ˆwx∈ X
. This causes failure cases when there is variation in the models
across the feature space, since it is a key point that BMA is not model combination (Minka, 2000).
This being an important distinction and one of the main reasons BMA has been shown to perform
badly under covariate shifted tasks (Izmailov et al., 2021). That being said, it can be extended by
3
considering the set of models being averaged to be every possible combination of the provided
models (Kim and Ghahramani, 2012), although this becomes even more computationally infeasible.
Is this some form of unsupervised domain adaptation or transfer learning then?
Given the
focus on models performing on some region of the feature space outside their training domain
this may seem like a natural question. Unsupervised domain adaptation represents this task at
an individual model level but usually considers, access to unlabelled data in the target domain
as well as labelled data from a (different) source domain
ptrain
(X,Y ), ptest
(X)
(Chan et al., 2020). We
refer the interested reader to Kouw and Loog (2019) for a detailed review given space constraints.
In a very similar vein, the transfer learning (Torrey and Shavlik, 2010) task also involves a change
in feature distribution but instead of being completely unsupervised tends to include some labels
on the target set, thus requiring some information on the joint distribution at both test and training
time
ptrain
(X,Y ), ptest
(X,Y )
. A great deal of work then involves learning a prior on the first domain that
can be updated appropriately on the target domain (Raina et al., 2006; Karbalayghareh et al., 2018).
In contrast to both of these areas, we do not aim to improve the performance of the individual
models, but rather combine them based on how well we expect them to perform in the new domain.
3 Introducing Synthetic Model Combination
Our success hinges on the assumption that the quality of a model’s prediction will depend on
the context features. That is to say that the performance ordering of the models will not stay
constant across the feature space. With this being the case, it should seem obvious that the
weightings of individual models should depend on the features presented. As such, we in-
troduce our notion of Synthetic Model Combination
1
- a method that constructs a representa-
tion space within which we can practically reason, enabling it to select weights for the mod-
els based on a given feature’s location within the representation space. Recalling our starting
point of
{(Mj,Ij)}N
j=1 and {xi}M
i=1
, we proceed in three main steps which are outlined below:
1. Estimate densities for models: From each information Ij, we generate a density pX
j(x).
2.
Learn low-dimensional representation space: Using
{xi}M
i=1
and
{pX
j(x)}N
j=1
, learn a
mapping to a low dimensional representation space fθ:X 7→ Z.
3.
Calculate ensemble weights for predictions: Evaluate weights
w(x)
so that predictions can
then be made as ˆy=PN
j=1 w(x)jMj(x).
These steps are highlighted again in the full Algorithm 1 as well as shown pictorially in Figure 2.
3.1 From Information to Probability Densities
The first step in SMC is to use the information
Ij
to produce a density estimate such that we can sam-
ple from each model’s effective support. Given the flexibility in the form of what we allow
Ij
to take,
SMC must remain relatively agnostic to this step. A common example of the type of information we
expect will simply be example feature samples, and in this case a simple kernel density estimate (Ter-
rell and Scott, 1992) or other density estimation method could be employed. On the other hand, in the
medical setting for example, when models are published authors will often also provide demographics
information on the patients that were involved in the study, such as the mean and variance of each
covariate recorded. In this case we may simply want to approximate the density using a Gaussian and
moment-matching for example. When the information is provided in the form of samples it is possible
to skip this step, as they can be used directly in the subsequent representation learning step’s losses.
3.2 Learning a Separable and Informative Space
The point of learning a new representation space - and not simply using the original feature space
- is twofold. Firstly, we would like to reduce the dimensionality, leading to a more compact rep-
1
Our name is a nod towards synthetic control (Abadie et al., 2010), as we construct a new synthetic model as
a convex combination of others that we think will be most appropriate for a given instance.
4
摘要:

SyntheticModelCombination:AnInstance-wiseApproachtoUnsupervisedEnsembleLearningAlexJ.ChanUniversityofCambridgeCambridge,UKajc340cam.ac.ukMihaelavanderSchaarUniversityofCambridgeCambridgeCentreforAIinMedicineCambridge,UKmv472@cam.ac.ukAbstractConsidermakingapredictionovernewtestdatawithoutanyopportun...

展开>> 收起<<
Synthetic Model Combination An Instance-wise Approach to Unsupervised Ensemble Learning Alex J. Chan.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.62MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注