performance, better aligns the model with the scientific objective of the study. It can also dramatically reduce
the sample complexity of the active learning algorithm. For instance, in the case of
d
-dimensional linear
regression, at least
Ω(d)
observations are required to learn the entire parameter vector. However, if the
targeted objective is to estimate
kd
coordinates, there is an active learning strategy that can do so with
O(k)
queries, provided it is given access to enough unlabeled data (see Appendix Afor a formal proof). This
toy example shows the potential savings in active learning when the end objective is explicitly taken into
account.
We propose Probabilistic Diameter-based Active Learning (PDBAL), a targeted active learning algorithm
compatible with any probabilistic model. PDBAL builds on diameter-based active learning (Tosh and Dasgupta,
2017;Tosh and Hsu,2020), a framework that allows a scientist to explicitly encode the targeted objective as
a distance function between two hypothetical models of the data. Parts of the model that are not important
to the scientific study can be ignored in the distance function, resulting in a targeted distance that directly
encodes scientific utility. PDBAL generalizes DBAL from the simple finite outcome setting (e.g. multiclass
classification) to arbitrary probabilistic models, greatly expanding the scope of its applicability.
We provide a theoretical analysis that bounds the number of queries for PDBAL to recover a model that is
close to the ground-truth with respect to the target distance. We additionally prove lower bounds showing
that under certain conditions, PDBAL is nearly optimal. In a suite of empirical evaluations on synthetic data,
PDBAL consistently outperforms untargeted active learning approaches based on expected information gain
and variance sampling. In a study using real cancer drug data to find drugs with high therapeutic index,
PDBAL learns a model that accurately detects effective drugs after seeing only 10% of the total dataset. The
generality and empirical success of PDBAL suggest it has the potential to significantly increase the scale of
modern scientific studies.
1.1 Related work
There is a substantial body of work on active learning and Bayesian experimental design. Here, we outline
some of the most relevant lines of work.
Bayesian active learning.
The seminal work of Lindley (1956) introduced expected information gain (EIG)
as a measure for the value of a new experiment in a Bayesian context. Roughly, EIG measures the change in
the entropy of the posterior distribution of a parameter after conditioning on new data. Inspired by this work,
others have proposed maximizing EIG as a Bayesian active learning strategy (MacKay,1992;Lawrence
et al.,2002). Noting that computing entropy in parameter space can be expensive for non-parametric models,
Houlsby et al. (2011) rewrite EIG as a mutual information problem over outcomes. Their method, Bayesian
active learning by disagreement (BALD), is used for Gaussian process classification. BALD has inspired
a large body of work in developing EIG-based active learning strategies, particularly for Bayesian neural
networks (Gal et al.,2017;Kirsch et al.,2019). However, despite its popularity, EIG can be shown to be
suboptimal for reducing prediction error in general (Freund et al.,1997).
One alternative to such information gain strategies is Query by committee (QBC), (Seung et al.,1992;
Freund et al.,1997), which more directly seeks to shrink the parameter space by querying points that elicit
maximum disagreement among a committee of predictors. Recently, Riis et al. (2022) applied QBC to
a Bayesian regression setup. Their method, B-QBC, reduces to choosing experiments that maximize the
posterior variance of the mean predictor. For the special case of Gaussian models with homoscedastic
observation noise, this is equivalent to EIG.
Another Bayesian active learning departure from EIG is the decision-theoretic approach of Fisher et al.
(2021), called GAUSSED, based on Bayes’ risk minimization. The objective function in GAUSSED is similar
to the PDBAL objective when in the special case of homoskedastic location models and an untargeted squared
error distance function over the entire latent parameter space.
2