Targeted Active Learning for Probabilistic Models Christopher Tosh1 Mauricio Tec2 and Wesley Tansey1 1Memorial Sloan Kettering Cancer Center New York NY

2025-05-02 0 0 1.19MB 31 页 10玖币

侵权投诉

Targeted Active Learning for Probabilistic Models

Christopher Tosh1, Mauricio Tec2, and Wesley Tansey1

1Memorial Sloan Kettering Cancer Center, New York, NY

2Harvard University, Cambridge, MA

October 24, 2022

Abstract

A fundamental task in science is to design experiments that yield valuable insights about the system

under study. Mathematically, these insights can be represented as a utility or risk function that shapes

the value of conducting each experiment. We present PDBAL, a targeted active learning method that

adaptively designs experiments to maximize scientiﬁc utility. PDBAL takes a user-speciﬁed risk function

and combines it with a probabilistic model of the experimental outcomes to choose designs that rapidly

converge on a high-utility model. We prove theoretical bounds on the label complexity of PDBAL and

provide fast closed-form solutions for designing experiments with common exponential family likelihoods.

In simulation studies, PDBAL consistently outperforms standard untargeted approaches that focus on

maximizing expected information gain over the design space. Finally, we demonstrate the scientiﬁc

potential of PDBAL through a study on a large cancer drug screen dataset where PDBAL quickly recovers

the most efﬁcacious drugs with a small fraction of the total number of experiments.

1 Introduction

Scientiﬁc experiments are often expensive, laborious, and time-consuming to conduct. In practice, this limits

the capacity of many studies to only a small subset of possible experiments. Limited experimental capacity

poses a risk: the sample size may be too small to learn meaningful aspects about the system under study.

However, when experiments can be conducted sequentially or in batches, there is an opportunity to alleviate

this risk by adaptively designing each batch. The hope is that the results of previous experiments can be used

to design a maximally-informative batch of experiments to conduct next.

In machine learning, the sequential experimental design task is often posed as an active learning problem.

The active learning paradigm allows a learner to adaptively choose on which data points it wants feedback.

The objective is to ﬁt a high-quality model while spending as little as possible on data collection. Modern

active learning algorithms have shown substantial gains when optimizing models for aggregate objectives,

such as accuracy (Ash et al.,2021) or parameter estimation (Tong and Koller,2000).

Many scientiﬁc studies have a more targeted objective than a simple aggregate metric. For example,

one may be interested in assessing the prognostic value of a collection potential biomarkers. Accurately

modeling the distribution of the biomarker variables may require modeling nuisance variables about the

patient, environment, and disease status. While optimizing an aggregate objective like accuracy can lead to

recovery of the parameters of interest, this is merely a surrogate to our true objective. Consequently, it may

lead to a less efﬁcient data collection strategy.

Here we consider the task of targeted active learning. The goal in targeted active learning is to efﬁciently

gather data to produce a model with high utility. Optimizing data collection for utility, rather than model

E-mail: christopher.j.tosh@gmail.com,mauriciogtec@hsph.harvard.edu,tanseyw@mskcc.org

arXiv:2210.12122v1 [cs.LG] 21 Oct 2022

performance, better aligns the model with the scientiﬁc objective of the study. It can also dramatically reduce

the sample complexity of the active learning algorithm. For instance, in the case of

-dimensional linear

regression, at least

Ω(d)

observations are required to learn the entire parameter vector. However, if the

targeted objective is to estimate

kd

coordinates, there is an active learning strategy that can do so with

O(k)

queries, provided it is given access to enough unlabeled data (see Appendix Afor a formal proof). This

toy example shows the potential savings in active learning when the end objective is explicitly taken into

account.

We propose Probabilistic Diameter-based Active Learning (PDBAL), a targeted active learning algorithm

compatible with any probabilistic model. PDBAL builds on diameter-based active learning (Tosh and Dasgupta,

2017;Tosh and Hsu,2020), a framework that allows a scientist to explicitly encode the targeted objective as

a distance function between two hypothetical models of the data. Parts of the model that are not important

to the scientiﬁc study can be ignored in the distance function, resulting in a targeted distance that directly

encodes scientiﬁc utility. PDBAL generalizes DBAL from the simple ﬁnite outcome setting (e.g. multiclass

classiﬁcation) to arbitrary probabilistic models, greatly expanding the scope of its applicability.

We provide a theoretical analysis that bounds the number of queries for PDBAL to recover a model that is

close to the ground-truth with respect to the target distance. We additionally prove lower bounds showing

that under certain conditions, PDBAL is nearly optimal. In a suite of empirical evaluations on synthetic data,

PDBAL consistently outperforms untargeted active learning approaches based on expected information gain

and variance sampling. In a study using real cancer drug data to ﬁnd drugs with high therapeutic index,

PDBAL learns a model that accurately detects effective drugs after seeing only 10% of the total dataset. The

generality and empirical success of PDBAL suggest it has the potential to signiﬁcantly increase the scale of

modern scientiﬁc studies.

1.1 Related work

There is a substantial body of work on active learning and Bayesian experimental design. Here, we outline

some of the most relevant lines of work.

Bayesian active learning.

The seminal work of Lindley (1956) introduced expected information gain (EIG)

as a measure for the value of a new experiment in a Bayesian context. Roughly, EIG measures the change in

the entropy of the posterior distribution of a parameter after conditioning on new data. Inspired by this work,

others have proposed maximizing EIG as a Bayesian active learning strategy (MacKay,1992;Lawrence

et al.,2002). Noting that computing entropy in parameter space can be expensive for non-parametric models,

Houlsby et al. (2011) rewrite EIG as a mutual information problem over outcomes. Their method, Bayesian

active learning by disagreement (BALD), is used for Gaussian process classiﬁcation. BALD has inspired

a large body of work in developing EIG-based active learning strategies, particularly for Bayesian neural

networks (Gal et al.,2017;Kirsch et al.,2019). However, despite its popularity, EIG can be shown to be

suboptimal for reducing prediction error in general (Freund et al.,1997).

One alternative to such information gain strategies is Query by committee (QBC), (Seung et al.,1992;

Freund et al.,1997), which more directly seeks to shrink the parameter space by querying points that elicit

maximum disagreement among a committee of predictors. Recently, Riis et al. (2022) applied QBC to

a Bayesian regression setup. Their method, B-QBC, reduces to choosing experiments that maximize the

posterior variance of the mean predictor. For the special case of Gaussian models with homoscedastic

observation noise, this is equivalent to EIG.

Another Bayesian active learning departure from EIG is the decision-theoretic approach of Fisher et al.

(2021), called GAUSSED, based on Bayes’ risk minimization. The objective function in GAUSSED is similar

to the PDBAL objective when in the special case of homoskedastic location models and an untargeted squared

error distance function over the entire latent parameter space.

Bayesian optimization.

Black-box function optimization is another classical area of interest in the se-

quential experimental design literature. In this setting, there is an unknown global utility function that can

be queried in a black box manner, and the goal is to ﬁnd the set of inputs that maximize this function. A

standard approach to this problem is to posit a Bayesian non-parametric model (such as a Gaussian process)

of the underlying function, and then to adaptively make queries that trade off exploration of uncertainty and

exploitation of suspected maxima (Hennig and Schuler,2012;Hernández-Lobato et al.,2015;Kandasamy

et al.,2018). One of the key differences between black-box (Bayesian) optimization and targeted (Bayesian)

active learning is that in black-box optimization, the underlying utility function is being directly modeled.

In targeted active learning, utility may only be indirectly expressed as a function of the underlying prob-

abilistic models. One work that helps to bridge Bayesian optimization and targeted active learning is the

decision-theoretic Bayesian optimization framework of Neiswanger et al. (2022), which considers a richer set

of objectives related to the underlying utility function than simply ﬁnding a single maximum.

Active learning of probabilistic models.

Beyond the Bayesian methods outlined above, others have

considered alternate approaches to active learning for probabilistic models. Sabato and Munos (2014) studied

active linear regression in the misspeciﬁed setting. Agarwal (2013) designed active learning algorithms for

generalized linear models for multiclass classiﬁcation in a streaming setting. Chaudhuri et al. (2015) studied a

two-stage active learning procedure for maximum-likelihood estimators for a variety of probabilistic models.

Ash et al. (2021) built on this two stage approach to design an active maximum-likelihood approach for deep

learning-based models.

2 Setting

Let

denote a data space and

denote a response space. Let

denote a marginal distribution over

. Our

goal is to model our data with some parametric probabilistic model

Pθ(·;·)

, where

lies in a parameter space

and

Pθ(y;x)

denotes the probability (or density) of observing

y∈ Y

at data point

x∈ X

. We will use the

notation y∼Pθ(x)to denote drawing yfrom the density Pθ(·;x).

We consider models that factorize across data points, that is for

x1, . . . , xn∈ Xn

and

y1, . . . , yn∈ Yn

we have

Pθ(y1:n;x1:n):=Pθ(y1, . . . , yn;x1, . . . , xn) =

i=1

Pθ(yi;xi).

For a data point x∈ X and parameter θ∈Θ, we denote the entropy of the response to xunder model θas

Hθ(x):=Ey∼Pθ(x)log 1

Pθ(y;x).

We will take a Bayesian approach to learning. To that end, let

denote a prior distribution over

. Given

observations (x1, y1),...,(xn, yn), denote the posterior distribution as

πn(θ) = 1

π(θ)

i=1

Pθ(yi;xi)

where

Zn=Eθ0∼πQn

i=1 Pθ0(yi;xi)

is the normalizing constant to make

πn

integrate to one. In this paper,

we will assume that we are in the well-speciﬁed Bayesian setting, i.e. there is some ground-truth

θ?∼π

, and

when we query point

, the observation

is drawn from

Pθ?(·;xi)

. We will also use the notation

πn(y;x)

to denote the posterior predictive density

πn(y;x) = Eθ∼πnPθ(y;x),

and the notation y∼πn(x)to denote drawing yfrom the density πn(y;x).

Arisk-aligned distance is a function d:Θ×Θ→[0,1] satisfying two properties for all θ, θ0∈Θ:

•Identity. i.e., d(θ, θ) = 0.

•Symmetry. i.e. d(θ, θ0) = d(θ0, θ).

The requirement that

d(θ, θ0)≤1

is not onerous – any smooth distance over a bounded space can be

transformed into a distance that satisﬁes this requirement by rescaling. In our setup, a risk-aligned distance

encodes our objective: if we committed to the model

when the true model was

θ?

, then we expect to suffer

a loss of d(θ, θ?).

The goal in our setting is to ﬁnd a posterior distribution πnwith small average diameter:

avg-diam(πn) = Eθ,θ0∼πn[d(θ, θ0)].

To see that this is a reasonable objective, observe that if

θ?∼π

and

(x1, y1),...,(xn, yn)

is generated

according to

Pθ?

, then after observing this data,

θ?

is distributed according to

πn

. If we make predictions by

sampling a model from

πn

, the expected risk of this strategy is exactly the average diameter. Moreover, even

without this Bayesian assumption, the risk of this strategy can still be bounded above as a function of the

average diameter (Tosh and Dasgupta,2017, Lemma 2).

3 Probabilistic DBAL (PDBAL)

We ﬁrst recall the standard (functional) diameter-based active learning algorithm. Let

be a ﬁnite set, and

let

F ⊂ {f:X → Y}

denote some function class, let

d(·,·)

denote a distance over

, and let

πn

denote

a posterior distribution over

. The DBAL approach is to score candidate queries

x∈ X

according to the

function

vn(x) = max

y∈Y

Ef,f0∼πnhd(f, f0) 1If(x) = y=f0(x)i,(1)

where

1I[·]

denotes the indicator function, and then choose the available query

that minimizes

vn(x)

. To

practically implement this, one can sample f1, . . . , fm∼πnand compute the Monte Carlo approximation

ˆvn(x) = max

y∈Y

m

2X

i<j

d(fi, f0

j) 1Ihfi(x) = y=f0

j(x)i.(2)

The idea behind eq. (1) is that in the realizable setting where the true model is in F, the posterior satisﬁes

πn(f)∝πn−1(f) 1I[f(xn) = yn].

By choosing queries according to eq. (1), DBAL minimizes a function of the diameter of the posterior πn+1,

while also hedging its bets against the worst possible outcome.

When moving from discrete-valued functions to probabilistic models, there are a few issues that arise.

The ﬁrst is that, unless our probabilistic models are deterministic,

Pθ(·;x)

will in general be a distribution

over outcomes and not a point mass. Thus, we should not use the indicator function to approximate the

posterior update. The second issue is that when our outcomes are continuous, it may be intractable to compute

a maximization over potential outcomes. Finally, even if computation over potential outcomes was not an

issue, the outcomes that achieve the maximum may be so unlikely that they should not really be considered at

all. Indeed, in the Bayesian setting it only makes sense to consider outcomes that have reasonable probability

under πn.

Extension to probabilistic models.

With the aim of addressing these issues in mind, we propose the

PDBAL objective: for x∈ X,

sn(x) = Eθ?,θ,θ0∼πnEy∼Pθ?(x)hd(θ, θ0)Pθ(y;x)Pθ0(y;x)e2Hθ?(x)i.(3)

By using

Pθ(y;x)

, we are again minimizing some function of the diameter of the posterior

πn+1

. Moreover, by

switching from a maximization to an expectation over potential outcomes

, we avoid the tricky optimization

problem. Finally, the entropy term in eq. (3) balances out the possibility of

θ?

generating unlikely outcomes.

Despite these changes, in Section 4we show that PDBAL enjoys nice optimality properties, even as the scope

of its applications has grown considerably over DBAL.

Constant entropy models.

When

parameterizes location models with ﬁxed scale parameters, the entropy

term is constant. We rewrite the objective for these models as

sn(x) = Eθ,θ0∼πn,y∼πn(x)d(θ, θ0)Pθ(y;x)Pθ0(y;x).(4)

For readability, we will discuss approximations of eq. (4). Extending these ideas to eq. (3) can easily be done

when we can compute

Hθ(x)

in closed form (as is the case for Gaussians, Laplacians,

-distributions, and

other common likelihoods) or approximate it sufﬁciently well.

A general approach to approximating eq. (4) is to draw

θ1, . . . , θm∼πn

and

yi∼Pθi(x)

for

1, . . . , m, and use the estimate

ˆsn(x) = 1

m

3X

i<j<k

d(θi, θj)Pθi(yk;x)Pθj(yk;x).(5)

Although eq. (5) is an unbiased estimator, it may have large variance due to the sampling

yi∼Pθi(x)

. In

some cases, we can reduce this variance by avoiding sampling the

’s whenever we can compute the function

M(x;θ1, θ2, θ3) = Ey∼Pθ1(x)Pθ2(y;x)Pθ3(y;x).

This allows us to compute the alternate approximation

ˆsn(x) = 1

m

3X

i<j<k

d(θi, θj)M(x;θi, θj, θk).(6)

Computing the sums in eq. (5) or eq. (6) would take

O(m3)

time. In practice, we approximate these sums via

Monte Carlo by subsampling

Nmc

triples

(i, j, k)

. Thus, to compute our approximation of eq. (3) for a set of

Bpotential queries takes time O(BNmc). Algorithm 1presents the full PDBAL selection procedure.

The following proposition shows that for the important case of Gaussian likelihoods, we can indeed

compute the function Min closed form.

Proposition 1. Fix d≥1, and let µi∈Rdand σ2

i>0for i= 1,2,3.

Ey∼N(µ1,σ2

1Id)hN(y;µ2, σ2

2Id)N(y;µ3, σ2

3Id)i=1

α(2π)2d/2

exp 

−σ2

1σ2

2σ2

2α2X

i6=j6=k

σ2

kkµi−µjk2

,

where α=σ2

1σ2

2+σ2

2σ2

3+σ2

1σ2

All proofs are deferred to the appendix. In Appendix B, we work out closed form solutions for other

important likelihoods, including multinomial and exponential.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TargetedActiveLearningforProbabilisticModelsChristopherTosh1,MauricioTec2,andWesleyTansey11MemorialSloanKetteringCancerCenter,NewYork,NY2HarvardUniversity,Cambridge,MAOctober24,2022AbstractAfundamentaltaskinscienceistodesignexperimentsthatyieldvaluableinsightsaboutthesystemunderstudy.Mathematically,...

展开>> 收起<<

Targeted Active Learning for Probabilistic Models Christopher Tosh1 Mauricio Tec2 and Wesley Tansey1 1Memorial Sloan Kettering Cancer Center New York NY.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Targeted Active Learning for Probabilistic Models Christopher Tosh1 Mauricio Tec2 and Wesley Tansey1 1Memorial Sloan Kettering Cancer Center New York NY

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: