Synthetic Model Combination An Instance-wise Approach to Unsupervised Ensemble Learning Alex J. Chan

2025-05-02 0 0 1.62MB 16 页 10玖币

侵权投诉

Synthetic Model Combination: An Instance-wise

Approach to Unsupervised Ensemble Learning

Alex J. Chan

University of Cambridge

Cambridge, UK

ajc340cam.ac.uk

Mihaela van der Schaar

University of Cambridge

Cambridge Centre for AI in Medicine

Cambridge, UK

mv472@cam.ac.uk

Abstract

Consider making a prediction over new test data without any opportunity to learn

from a training set of labelled data - instead given access to a set of expert models

and their predictions alongside some limited information about the dataset used to

train them. In scenarios from ﬁnance to the medical sciences, and even consumer

practice, stakeholders have developed models on private data they either cannot,

or do not want to, share. Given the value and legislation surrounding personal

information, it is not surprising that only the models, and not the data, will be

released - the pertinent question becoming: how best to use these models? Previous

work has focused on global model selection or ensembling, with the result of a

single ﬁnal model across the feature space. Machine learning models perform

notoriously poorly on data outside their training domain however, and so we argue

that when ensembling models the weightings for individual instances must reﬂect

their respective domains - in other words models that are more likely to have seen

information on that instance should have more attention paid to them. We introduce

a method for such an instance-wise ensembling of models, including a novel

representation learning step for handling sparse high-dimensional domains. Finally,

we demonstrate the need and generalisability of our method on classical machine

learning tasks as well as highlighting a real world use case in the pharmacological

setting of vancomycin precision dosing.

1 Introduction

Sharing data is often a very problematic affair - before we even arrive at whether a stakeholder will

want to, given the modern day value - it may not even be allowed. In particular, when the data contains

identiﬁable and personal information it may be inappropriate, and illegal, to do so. A common solution

is to provide some proxy of the true data in the form of a generated fully synthetic dataset (Alaa et al.,

2020) or one that has undergone some privatisation or anonymisation process (Elliot et al., 2018;

Chan et al., 2021), but this can often lead to low-quality or extremely noisy data (Alaa et al., 2021)

that is hard to gain insight on. Alternatively, groups may release models that they have trained on

their private data. If for a given task there are multiple of such models, we are left with the task of

how to best use these models in combination when they have potentially conﬂicting predictions. This

problem is known as unsupervised ensemble learning since we have no explicit signal on the task,

and we aim to construct a combination of the models for making future predictions (Jaffe et al., 2016).

Without having trained the models ourselves, it is a very challenging task for us to know how well

the individual models should perform, making the task of choosing the most appropriate model (or

ensemble) difﬁcult. This is compounded by the problem that the provided models could perform

poorly for not one, but two main reasons: Firstly, the model itself may not have been ﬂexible

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05320v1 [cs.LG] 11 Oct 2022

enough to properly capture the underlying true function present in the data; and secondly, in the

area that they are making a prediction there may not have been sufﬁcient training data used for

the model to have been able to learn appropriately - i.e. the model is extrapolating (potentially

unreasonably) to cover a new feature point - the main issue in covariate shifted problems (Bickel

et al., 2009). Current practice is usually to try and select the globally optimal model (Shaham

et al., 2016; Dror et al., 2017), that is to say of those made available, which model should be

used to make predictions for any given test point in the feature space. This approach potentially

addresses the ﬁrst point, it completely overlooks the second - and this is what we will focus on.

In order to consider this problem of extrapolation amongst the individual models, the important

question is: what might a solution look like? We must consider the desiderata that given no

augmentation of the individual models, the ensemble weights must vary depending on the test

features and additionally, these weights should reﬂect the conﬁdence that a model will be able to

make an appropriate prediction, and we should be able to tell generally when our conﬁdence is low.

⨯

𝓜

1 2 3

w(A)

w(B)

w(C)

Figure 1:

Instance-wise Ensembles

. Here

we represent the density of the training

features for three separate models -

,and

. Given new test points A, B,

and C, we need to construct predictions from

these models. A is well represented by both

and

while B only has signiﬁcant

density under

. C looks like none of the

models will be able to make conﬁdent predic-

tions.

Contributions

In this work we make a three-fold

contribution. First, we establish and document the

need for instance-wise predictions in the setting of

unsupervised ensemble learning, in doing so intro-

ducing the concept of Synthetic Model Combination

(SMC), shown in Figure 1. Second, we introduce a

novel unsupervised representation learning procedure

that can be incorporated into SMC - allowing for

more appropriate separation of models and estima-

tion of ensemble weights in sparse high-dimensional

settings. Finally, we provide practical demonstrations

of both the success and failure cases of traditional

methods and SMC, in synthetic examples as well

as a real example of precision dosing - code for

which is made available at

https://github.com/

XanderJC/synthetic-model-combination

along with the group codebase at

https://github.

com/vanderschaarlab/mlforhealthlabpub/

synthetic-model-combination

2 Background

Formulation

Consider having access only to

provided tuples

{(Mj,Ij)}N

j=1

of models

and

associated information

. With each

Mj:X 7→

being a mapping from some covariate space

to another target space

and having been trained on some dataset

which is not ob-

served by us although is in some way summarised by

. We are then presented with

some other test dataset

DT={xi}M

i=1

and consequently tasked with making predictions.

Goal

Our aim is to construct a convex combination of models in a way so as to produce

optimal model predictions

ˆy=M∗(x) = PN

j=1 w(x)jMj(x)

with

j=1 w(x)j= 1

and

w(x)j

taking the

th index of the output of the function

w:X 7→ ∆N

that maps features

to a set of weights on the probability

-simplex. We can summarise our task process as:

Given: {(Mj,Ij)}N

j=1 and {xi}M

i=1 Obtain: wto predict {ˆyi}M

i=1

Having established the setting we are focused on, we can explore contemporary methods and compare

how they relate to our work. We consider methods that are interested ultimately in the test-time

distribution of labels conditional on feature covariates

ptest

(Y|X)

. We discuss the differences in methods

focuses on the distributional information available to them, summarised in Table 1. For example, in

standard supervised learning, samples of the feature-label joint distribution

ptrain

(X,Y )

are given - the

training-time conditional distribution is then estimated and assumed to be equivalent at test-time.

Table 1:

(Un-)Related Fields

and how they compare in terms of: A) whether they make instance-wise

predictions; B) the distributional information which they require; C) the form of the information; and

D) the target quantity they are focused on. References given as: [1] Torrey and Shavlik (2010); [2]

Raftery et al. (1997); [3] Ren et al. (2019); [4] Ruta and Gabrys (2005); [5] Jaffe et al. (2016).

Problem Ref. A) Instance-wise B) Distribution C) Information D) Target

Transfer Learning [1] 3ptrain

(X,Y ), ptest

(X,Y )DT rain ptest

(Y|X)

Bayesian Model Averaging [2] 7ptest

(X,Y )DV al p(Mi|D)

Out-of-distribution Detection [3] 7/3ptest

(X)-ptrain

X(xtest)< 

Majority Voting [4] 7ptrain

(X,Y )-ptest

(Y|X)

Unsupervised Ensemble Regression [5] 7ptest

(Y)E[Y], V ar[Y] ˆw

Synthetic Model Combination [Us] 3{pMj

(X)}N

j=1 {Ij}N

j=1 w(x)

In our case, we assume information from considerably less informative distributions, the train-

ing time feature distributions

{pMj

(X)}N

j=1

, were the information can take a variety of forms,

most practically though through samples of the features or details of the ﬁrst and second mo-

ments. Practically, this appears to be the minimal set of information for which we can do some-

thing useful since if only given a set of models and no accompanying information then there

is no way to determine which models may be best in general let alone for speciﬁc features.

How are models normally ensembled?

The literature on ensemble methods is vast, and we do not

intend to provide a survey, a number of which already exist (Sagi and Rokach, 2018; Dong et al.,

2020). The focus is often on training ones own set of models that can then be ensembled for epistemic

uncertainty insight (Rahaman et al., 2021) or boosted for performance (Chen and Guestrin, 2016).

In terms of methods of ensembling models that are provided to a practitioner (instead of ones

trained by them as well) then closest is the setting of unsupervised ensemble regression - which

like us does not consider any joint feature label distribution. To make progress though some in-

formation needs to be provided, with Dror et al. (2017) considering the marginal label distribution

ptest

(Y)

, making the strong assumption the ﬁrst two moments of the response are known. Instead

of being directly provided the mean and variance, another strand of work assumes conditional

independence given the label (Dawid and Skene, 1979), meaning that any predictors that agree

consistently will be more likely to be accurate. Platanios et al. (2014) and Jaffe et al. (2016)

relax this assumption through the use of graphical models and meta-learner construction respec-

tively, with a Bayesian approach proposed by Platanios et al. (2016) Recent work of Shaham

et al. (2016) attempts to learn the dependence structure using restricted Boltzmann machines.

Model averaging when validation data is available.

Moving from the unsupervised methods

mentioned above, which at most considered information from only marginal distributions, we come

to the case of when we may have some validation data from the joint distribution

ptest

(X,Y )

(or one

assumed to be the same). This can effectively be used to evaluate a ranking on the given models

for more informed ensembles. This ensemble can be created in a number of ways (Huang et al.,

2009), including work that moves in the direction of instance-wise predictions by dividing the space

into regions before calculating per-region weights (Verikas et al., 1999). A practical and more

common approach is Bayesian Model Averaging (BMA) (Raftery et al., 1997). Given an appropriate

prior, we calculate the posterior probability that a given model is the optimal one - and once this

is obtained the models can be marginalised out during test time predictions, creating an ensemble

weighted by each model’s posterior probability. The posterior being intractable, the probability

is approximated using the Bayesian Information Criterion (BIC) (Neath and Cavanaugh, 2012) -

which requires a likelihood estimate over some validation set and is estimated as:

p(Mi|D) =

exp(−1

2BIC(Mi))/PN

i=1 exp(−1

2BIC(Mi))

With this, along with all the ensemble methods

previously mentioned, it is important to note the subtle difference in setup to the problem we are

trying to work with. In all cases, it is assumed that there is some ordering for the models that

holds across the feature space and so a global ensemble is produced with a ﬁxed weighting

ˆw

such that

w(x) = ˆw∀x∈ X

. This causes failure cases when there is variation in the models

across the feature space, since it is a key point that BMA is not model combination (Minka, 2000).

This being an important distinction and one of the main reasons BMA has been shown to perform

badly under covariate shifted tasks (Izmailov et al., 2021). That being said, it can be extended by

considering the set of models being averaged to be every possible combination of the provided

models (Kim and Ghahramani, 2012), although this becomes even more computationally infeasible.

Is this some form of unsupervised domain adaptation or transfer learning then?

Given the

focus on models performing on some region of the feature space outside their training domain

this may seem like a natural question. Unsupervised domain adaptation represents this task at

an individual model level but usually considers, access to unlabelled data in the target domain

as well as labelled data from a (different) source domain

ptrain

(X,Y ), ptest

(X)

(Chan et al., 2020). We

refer the interested reader to Kouw and Loog (2019) for a detailed review given space constraints.

In a very similar vein, the transfer learning (Torrey and Shavlik, 2010) task also involves a change

in feature distribution but instead of being completely unsupervised tends to include some labels

on the target set, thus requiring some information on the joint distribution at both test and training

time

ptrain

(X,Y ), ptest

(X,Y )

. A great deal of work then involves learning a prior on the ﬁrst domain that

can be updated appropriately on the target domain (Raina et al., 2006; Karbalayghareh et al., 2018).

In contrast to both of these areas, we do not aim to improve the performance of the individual

models, but rather combine them based on how well we expect them to perform in the new domain.

3 Introducing Synthetic Model Combination

Our success hinges on the assumption that the quality of a model’s prediction will depend on

the context features. That is to say that the performance ordering of the models will not stay

constant across the feature space. With this being the case, it should seem obvious that the

weightings of individual models should depend on the features presented. As such, we in-

troduce our notion of Synthetic Model Combination

- a method that constructs a representa-

tion space within which we can practically reason, enabling it to select weights for the mod-

els based on a given feature’s location within the representation space. Recalling our starting

point of

{(Mj,Ij)}N

j=1 and {xi}M

i=1

, we proceed in three main steps which are outlined below:

1. Estimate densities for models: From each information Ij, we generate a density pX

j(x).

Learn low-dimensional representation space: Using

{xi}M

i=1

and

{pX

j(x)}N

j=1

, learn a

mapping to a low dimensional representation space fθ:X 7→ Z.

Calculate ensemble weights for predictions: Evaluate weights

w(x)

so that predictions can

then be made as ˆy=PN

j=1 w(x)jMj(x).

These steps are highlighted again in the full Algorithm 1 as well as shown pictorially in Figure 2.

3.1 From Information to Probability Densities

The ﬁrst step in SMC is to use the information

to produce a density estimate such that we can sam-

ple from each model’s effective support. Given the ﬂexibility in the form of what we allow

to take,

SMC must remain relatively agnostic to this step. A common example of the type of information we

expect will simply be example feature samples, and in this case a simple kernel density estimate (Ter-

rell and Scott, 1992) or other density estimation method could be employed. On the other hand, in the

medical setting for example, when models are published authors will often also provide demographics

information on the patients that were involved in the study, such as the mean and variance of each

covariate recorded. In this case we may simply want to approximate the density using a Gaussian and

moment-matching for example. When the information is provided in the form of samples it is possible

to skip this step, as they can be used directly in the subsequent representation learning step’s losses.

3.2 Learning a Separable and Informative Space

The point of learning a new representation space - and not simply using the original feature space

- is twofold. Firstly, we would like to reduce the dimensionality, leading to a more compact rep-

Our name is a nod towards synthetic control (Abadie et al., 2010), as we construct a new synthetic model as

a convex combination of others that we think will be most appropriate for a given instance.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SyntheticModelCombination:AnInstance-wiseApproachtoUnsupervisedEnsembleLearningAlexJ.ChanUniversityofCambridgeCambridge,UKajc340cam.ac.ukMihaelavanderSchaarUniversityofCambridgeCambridgeCentreforAIinMedicineCambridge,UKmv472@cam.ac.ukAbstractConsidermakingapredictionovernewtestdatawithoutanyopportun...

展开>> 收起<<

Synthetic Model Combination An Instance-wise Approach to Unsupervised Ensemble Learning Alex J. Chan.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Synthetic Model Combination An Instance-wise Approach to Unsupervised Ensemble Learning Alex J. Chan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: