Targeted Active Learning for Probabilistic Models Christopher Tosh1 Mauricio Tec2 and Wesley Tansey1 1Memorial Sloan Kettering Cancer Center New York NY

2025-05-02 0 0 1.19MB 31 页 10玖币
侵权投诉
Targeted Active Learning for Probabilistic Models
Christopher Tosh1, Mauricio Tec2, and Wesley Tansey1
1Memorial Sloan Kettering Cancer Center, New York, NY
2Harvard University, Cambridge, MA
October 24, 2022
Abstract
A fundamental task in science is to design experiments that yield valuable insights about the system
under study. Mathematically, these insights can be represented as a utility or risk function that shapes
the value of conducting each experiment. We present PDBAL, a targeted active learning method that
adaptively designs experiments to maximize scientific utility. PDBAL takes a user-specified risk function
and combines it with a probabilistic model of the experimental outcomes to choose designs that rapidly
converge on a high-utility model. We prove theoretical bounds on the label complexity of PDBAL and
provide fast closed-form solutions for designing experiments with common exponential family likelihoods.
In simulation studies, PDBAL consistently outperforms standard untargeted approaches that focus on
maximizing expected information gain over the design space. Finally, we demonstrate the scientific
potential of PDBAL through a study on a large cancer drug screen dataset where PDBAL quickly recovers
the most efficacious drugs with a small fraction of the total number of experiments.
1 Introduction
Scientific experiments are often expensive, laborious, and time-consuming to conduct. In practice, this limits
the capacity of many studies to only a small subset of possible experiments. Limited experimental capacity
poses a risk: the sample size may be too small to learn meaningful aspects about the system under study.
However, when experiments can be conducted sequentially or in batches, there is an opportunity to alleviate
this risk by adaptively designing each batch. The hope is that the results of previous experiments can be used
to design a maximally-informative batch of experiments to conduct next.
In machine learning, the sequential experimental design task is often posed as an active learning problem.
The active learning paradigm allows a learner to adaptively choose on which data points it wants feedback.
The objective is to fit a high-quality model while spending as little as possible on data collection. Modern
active learning algorithms have shown substantial gains when optimizing models for aggregate objectives,
such as accuracy (Ash et al.,2021) or parameter estimation (Tong and Koller,2000).
Many scientific studies have a more targeted objective than a simple aggregate metric. For example,
one may be interested in assessing the prognostic value of a collection potential biomarkers. Accurately
modeling the distribution of the biomarker variables may require modeling nuisance variables about the
patient, environment, and disease status. While optimizing an aggregate objective like accuracy can lead to
recovery of the parameters of interest, this is merely a surrogate to our true objective. Consequently, it may
lead to a less efficient data collection strategy.
Here we consider the task of targeted active learning. The goal in targeted active learning is to efficiently
gather data to produce a model with high utility. Optimizing data collection for utility, rather than model
E-mail: christopher.j.tosh@gmail.com,mauriciogtec@hsph.harvard.edu,tanseyw@mskcc.org
1
arXiv:2210.12122v1 [cs.LG] 21 Oct 2022
performance, better aligns the model with the scientific objective of the study. It can also dramatically reduce
the sample complexity of the active learning algorithm. For instance, in the case of
d
-dimensional linear
regression, at least
Ω(d)
observations are required to learn the entire parameter vector. However, if the
targeted objective is to estimate
kd
coordinates, there is an active learning strategy that can do so with
O(k)
queries, provided it is given access to enough unlabeled data (see Appendix Afor a formal proof). This
toy example shows the potential savings in active learning when the end objective is explicitly taken into
account.
We propose Probabilistic Diameter-based Active Learning (PDBAL), a targeted active learning algorithm
compatible with any probabilistic model. PDBAL builds on diameter-based active learning (Tosh and Dasgupta,
2017;Tosh and Hsu,2020), a framework that allows a scientist to explicitly encode the targeted objective as
a distance function between two hypothetical models of the data. Parts of the model that are not important
to the scientific study can be ignored in the distance function, resulting in a targeted distance that directly
encodes scientific utility. PDBAL generalizes DBAL from the simple finite outcome setting (e.g. multiclass
classification) to arbitrary probabilistic models, greatly expanding the scope of its applicability.
We provide a theoretical analysis that bounds the number of queries for PDBAL to recover a model that is
close to the ground-truth with respect to the target distance. We additionally prove lower bounds showing
that under certain conditions, PDBAL is nearly optimal. In a suite of empirical evaluations on synthetic data,
PDBAL consistently outperforms untargeted active learning approaches based on expected information gain
and variance sampling. In a study using real cancer drug data to find drugs with high therapeutic index,
PDBAL learns a model that accurately detects effective drugs after seeing only 10% of the total dataset. The
generality and empirical success of PDBAL suggest it has the potential to significantly increase the scale of
modern scientific studies.
1.1 Related work
There is a substantial body of work on active learning and Bayesian experimental design. Here, we outline
some of the most relevant lines of work.
Bayesian active learning.
The seminal work of Lindley (1956) introduced expected information gain (EIG)
as a measure for the value of a new experiment in a Bayesian context. Roughly, EIG measures the change in
the entropy of the posterior distribution of a parameter after conditioning on new data. Inspired by this work,
others have proposed maximizing EIG as a Bayesian active learning strategy (MacKay,1992;Lawrence
et al.,2002). Noting that computing entropy in parameter space can be expensive for non-parametric models,
Houlsby et al. (2011) rewrite EIG as a mutual information problem over outcomes. Their method, Bayesian
active learning by disagreement (BALD), is used for Gaussian process classification. BALD has inspired
a large body of work in developing EIG-based active learning strategies, particularly for Bayesian neural
networks (Gal et al.,2017;Kirsch et al.,2019). However, despite its popularity, EIG can be shown to be
suboptimal for reducing prediction error in general (Freund et al.,1997).
One alternative to such information gain strategies is Query by committee (QBC), (Seung et al.,1992;
Freund et al.,1997), which more directly seeks to shrink the parameter space by querying points that elicit
maximum disagreement among a committee of predictors. Recently, Riis et al. (2022) applied QBC to
a Bayesian regression setup. Their method, B-QBC, reduces to choosing experiments that maximize the
posterior variance of the mean predictor. For the special case of Gaussian models with homoscedastic
observation noise, this is equivalent to EIG.
Another Bayesian active learning departure from EIG is the decision-theoretic approach of Fisher et al.
(2021), called GAUSSED, based on Bayes’ risk minimization. The objective function in GAUSSED is similar
to the PDBAL objective when in the special case of homoskedastic location models and an untargeted squared
error distance function over the entire latent parameter space.
2
Bayesian optimization.
Black-box function optimization is another classical area of interest in the se-
quential experimental design literature. In this setting, there is an unknown global utility function that can
be queried in a black box manner, and the goal is to find the set of inputs that maximize this function. A
standard approach to this problem is to posit a Bayesian non-parametric model (such as a Gaussian process)
of the underlying function, and then to adaptively make queries that trade off exploration of uncertainty and
exploitation of suspected maxima (Hennig and Schuler,2012;Hernández-Lobato et al.,2015;Kandasamy
et al.,2018). One of the key differences between black-box (Bayesian) optimization and targeted (Bayesian)
active learning is that in black-box optimization, the underlying utility function is being directly modeled.
In targeted active learning, utility may only be indirectly expressed as a function of the underlying prob-
abilistic models. One work that helps to bridge Bayesian optimization and targeted active learning is the
decision-theoretic Bayesian optimization framework of Neiswanger et al. (2022), which considers a richer set
of objectives related to the underlying utility function than simply finding a single maximum.
Active learning of probabilistic models.
Beyond the Bayesian methods outlined above, others have
considered alternate approaches to active learning for probabilistic models. Sabato and Munos (2014) studied
active linear regression in the misspecified setting. Agarwal (2013) designed active learning algorithms for
generalized linear models for multiclass classification in a streaming setting. Chaudhuri et al. (2015) studied a
two-stage active learning procedure for maximum-likelihood estimators for a variety of probabilistic models.
Ash et al. (2021) built on this two stage approach to design an active maximum-likelihood approach for deep
learning-based models.
2 Setting
Let
X
denote a data space and
Y
denote a response space. Let
D
denote a marginal distribution over
X
. Our
goal is to model our data with some parametric probabilistic model
Pθ(·;·)
, where
θ
lies in a parameter space
Θ
and
Pθ(y;x)
denotes the probability (or density) of observing
y∈ Y
at data point
x∈ X
. We will use the
notation yPθ(x)to denote drawing yfrom the density Pθ(·;x).
We consider models that factorize across data points, that is for
x1, . . . , xn∈ Xn
and
y1, . . . , yn∈ Yn
,
we have
Pθ(y1:n;x1:n):=Pθ(y1, . . . , yn;x1, . . . , xn) =
n
Y
i=1
Pθ(yi;xi).
For a data point x∈ X and parameter θΘ, we denote the entropy of the response to xunder model θas
Hθ(x):=EyPθ(x)log 1
Pθ(y;x).
We will take a Bayesian approach to learning. To that end, let
π
denote a prior distribution over
Θ
. Given
observations (x1, y1),...,(xn, yn), denote the posterior distribution as
πn(θ) = 1
Zn
π(θ)
n
Y
i=1
Pθ(yi;xi)
where
Zn=Eθ0πQn
i=1 Pθ0(yi;xi)
is the normalizing constant to make
πn
integrate to one. In this paper,
we will assume that we are in the well-specified Bayesian setting, i.e. there is some ground-truth
θ?π
, and
when we query point
xi
, the observation
yi
is drawn from
Pθ?(·;xi)
. We will also use the notation
πn(y;x)
to denote the posterior predictive density
πn(y;x) = EθπnPθ(y;x),
3
and the notation yπn(x)to denote drawing yfrom the density πn(y;x).
Arisk-aligned distance is a function d:Θ×Θ[0,1] satisfying two properties for all θ, θ0Θ:
Identity. i.e., d(θ, θ) = 0.
Symmetry. i.e. d(θ, θ0) = d(θ0, θ).
The requirement that
d(θ, θ0)1
is not onerous – any smooth distance over a bounded space can be
transformed into a distance that satisfies this requirement by rescaling. In our setup, a risk-aligned distance
encodes our objective: if we committed to the model
θ
when the true model was
θ?
, then we expect to suffer
a loss of d(θ, θ?).
The goal in our setting is to find a posterior distribution πnwith small average diameter:
avg-diam(πn) = Eθ,θ0πn[d(θ, θ0)].
To see that this is a reasonable objective, observe that if
θ?π
and
(x1, y1),...,(xn, yn)
is generated
according to
Pθ?
, then after observing this data,
θ?
is distributed according to
πn
. If we make predictions by
sampling a model from
πn
, the expected risk of this strategy is exactly the average diameter. Moreover, even
without this Bayesian assumption, the risk of this strategy can still be bounded above as a function of the
average diameter (Tosh and Dasgupta,2017, Lemma 2).
3 Probabilistic DBAL (PDBAL)
We first recall the standard (functional) diameter-based active learning algorithm. Let
Y
be a finite set, and
let
F ⊂ {f:X → Y}
denote some function class, let
d(·,·)
denote a distance over
F
, and let
πn
denote
a posterior distribution over
F
. The DBAL approach is to score candidate queries
x∈ X
according to the
function
vn(x) = max
y∈Y
Ef,f0πnhd(f, f0) 1If(x) = y=f0(x)i,(1)
where
1I[·]
denotes the indicator function, and then choose the available query
x
that minimizes
vn(x)
. To
practically implement this, one can sample f1, . . . , fmπnand compute the Monte Carlo approximation
ˆvn(x) = max
y∈Y
1
m
2X
i<j
d(fi, f0
j) 1Ihfi(x) = y=f0
j(x)i.(2)
The idea behind eq. (1) is that in the realizable setting where the true model is in F, the posterior satisfies
πn(f)πn1(f) 1I[f(xn) = yn].
By choosing queries according to eq. (1), DBAL minimizes a function of the diameter of the posterior πn+1,
while also hedging its bets against the worst possible outcome.
When moving from discrete-valued functions to probabilistic models, there are a few issues that arise.
The first is that, unless our probabilistic models are deterministic,
Pθ(·;x)
will in general be a distribution
over outcomes and not a point mass. Thus, we should not use the indicator function to approximate the
posterior update. The second issue is that when our outcomes are continuous, it may be intractable to compute
a maximization over potential outcomes. Finally, even if computation over potential outcomes was not an
issue, the outcomes that achieve the maximum may be so unlikely that they should not really be considered at
all. Indeed, in the Bayesian setting it only makes sense to consider outcomes that have reasonable probability
under πn.
4
Extension to probabilistic models.
With the aim of addressing these issues in mind, we propose the
PDBAL objective: for x∈ X,
sn(x) = Eθ?,θ,θ0πnEyPθ?(x)hd(θ, θ0)Pθ(y;x)Pθ0(y;x)e2Hθ?(x)i.(3)
By using
Pθ(y;x)
, we are again minimizing some function of the diameter of the posterior
πn+1
. Moreover, by
switching from a maximization to an expectation over potential outcomes
y
, we avoid the tricky optimization
problem. Finally, the entropy term in eq. (3) balances out the possibility of
θ?
generating unlikely outcomes.
Despite these changes, in Section 4we show that PDBAL enjoys nice optimality properties, even as the scope
of its applications has grown considerably over DBAL.
Constant entropy models.
When
Θ
parameterizes location models with fixed scale parameters, the entropy
term is constant. We rewrite the objective for these models as
sn(x) = Eθ,θ0πn,yπn(x)d(θ, θ0)Pθ(y;x)Pθ0(y;x).(4)
For readability, we will discuss approximations of eq. (4). Extending these ideas to eq. (3) can easily be done
when we can compute
Hθ(x)
in closed form (as is the case for Gaussians, Laplacians,
t
-distributions, and
other common likelihoods) or approximate it sufficiently well.
A general approach to approximating eq. (4) is to draw
θ1, . . . , θmπn
and
yiPθi(x)
for
i=
1, . . . , m, and use the estimate
ˆsn(x) = 1
m
3X
i<j<k
d(θi, θj)Pθi(yk;x)Pθj(yk;x).(5)
Although eq. (5) is an unbiased estimator, it may have large variance due to the sampling
yiPθi(x)
. In
some cases, we can reduce this variance by avoiding sampling the
yi
s whenever we can compute the function
M(x;θ1, θ2, θ3) = EyPθ1(x)Pθ2(y;x)Pθ3(y;x).
This allows us to compute the alternate approximation
ˆsn(x) = 1
m
3X
i<j<k
d(θi, θj)M(x;θi, θj, θk).(6)
Computing the sums in eq. (5) or eq. (6) would take
O(m3)
time. In practice, we approximate these sums via
Monte Carlo by subsampling
Nmc
triples
(i, j, k)
. Thus, to compute our approximation of eq. (3) for a set of
Bpotential queries takes time O(BNmc). Algorithm 1presents the full PDBAL selection procedure.
The following proposition shows that for the important case of Gaussian likelihoods, we can indeed
compute the function Min closed form.
Proposition 1. Fix d1, and let µiRdand σ2
i>0for i= 1,2,3.
Ey∼N(µ12
1Id)hN(y;µ2, σ2
2Id)N(y;µ3, σ2
3Id)i=1
α(2π)2d/2
exp
σ2
1σ2
2σ2
3
2α2X
i6=j6=k
σ2
kkµiµjk2
,
where α=σ2
1σ2
2+σ2
2σ2
3+σ2
1σ2
3.
All proofs are deferred to the appendix. In Appendix B, we work out closed form solutions for other
important likelihoods, including multinomial and exponential.
5
摘要:

TargetedActiveLearningforProbabilisticModelsChristopherTosh1,MauricioTec2,andWesleyTansey11MemorialSloanKetteringCancerCenter,NewYork,NY2HarvardUniversity,Cambridge,MAOctober24,2022AbstractAfundamentaltaskinscienceistodesignexperimentsthatyieldvaluableinsightsaboutthesystemunderstudy.Mathematically,...

展开>> 收起<<
Targeted Active Learning for Probabilistic Models Christopher Tosh1 Mauricio Tec2 and Wesley Tansey1 1Memorial Sloan Kettering Cancer Center New York NY.pdf

共31页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:31 页 大小:1.19MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 31
客服
关注