Empirical Bayes Selection for Value Maximization Dominic Coey and Kenneth Hung

2025-05-01 0 0 378.54KB 29 页 10玖币
侵权投诉
Empirical Bayes Selection for Value
Maximization
Dominic Coey and Kenneth Hung
Meta
{coey, kenhung}@meta.com
January 3, 2023
Abstract
We study the problem of selecting the best munits from a set of nas
m/nα(0, 1), where noisy, heteroskedastic measurements of the units’
true values are available and the decision-maker wishes to maximize the
average true value of the units selected. Given a parametric prior distribu-
tion, the empirical Bayes decision rule incurs Op(n1)regret relative to the
Bayesian oracle that knows the true prior. More generally, if the error in the
estimated prior is of order Op(rn), regret is Op(r2
n). In this sense selecting
the best units is easier than estimating their values. We show this regret
bound is sharp in the parametric case, by giving an example in which it
is attained. Using priors calibrated from a dataset of over four thousand
internet experiments, we find that empirical Bayes methods perform well
in practice for detecting the best treatments given only a modest number
of experiments.
1 Introduction
In many important scientific and economic applications, decision-makers are
presented with data on the performance of nunits, from which they must se-
lect a strict subset for further investigation or treatment. Examples include
identifying the best teachers, hospitals, or athletes [6,17,11]; genes associated
with particular outcomes [23]; drug candidates [62]; or, in the application we
consider in this paper, the treatments in internet experiments with the largest
effects on a metric of interest. Each unit is associated with an unobserved true
value, which is measured with heteroskedastic noise. The constraint that only
munits can be selected arises naturally when the decision-maker has limited
resources to devote to the chosen units, and must restrict attention to the most
promising candidates.
1
arXiv:2210.03905v2 [stat.ME] 3 Jan 2023
A desirable feature of a selection procedure is that the aggregate value of its
selections is close to the maximum attainable value. Understanding how dif-
ferent selection procedures perform in this respect enables decision-makers to
assess how likely they are to succeed in the preceding applications. The empir-
ical Bayes approach to this question involves estimating the unknown, prior
distribution from which the true values are drawn, and selecting units with
the highest estimated posterior means. We show that if the prior distribution
is known to lie within some parametric class, empirical Bayes incurs regret of
order Op(n1)relative to the oracle Bayes procedure in which the prior dis-
tribution is known. This is faster than the usual n1/2 rate of convergence in
parametric models. In this sense selection is easier than estimation: picking
the set of units with the highest values is easier than pinning down the precise
values of those units. This generalizes directly to the nonparametric case: re-
gret converges to zero at the square of the rate that estimation error in the prior
converges to zero.
The basic intuition for this result is as follows. First, mistakes, whether of
inclusion or exclusion, are only likely to happen for those units whose true val-
ues are sufficiently close to a critical threshold. Units comfortably (or below)
above that threshold will be correctly selected (or omitted) with high probabil-
ity (w.h.p.). Second, even those mistakes cannot be too costly, as items incor-
rectly included or excluded are likely to be marginal — almost good enough to
be selected, or almost bad enough to be omitted. The regret, which is the prod-
uct of two terms corresponding to these factors, will therefore be second-order
small. We show that our Op(n1)bound is sharp in the parametric case, by
constructing an example in which regret is at least Cn1with non-vanishing
probability for some positive constant C.
We apply this methodology to internet experimentation, where units cor-
respond to experiments, and values to treatment effects. Heteroskedasticity
arises because experiments vary in sample size. Technology companies may
wish to identify a subset of best- or worst-performing experiments for further
investigation, in the former case, as candidates to launch to production, or in
the latter case, as candidates to stop early. When the follow-up investigation
incurs some cost, it may only be feasible to select a strict subset of experiments
for more analysis. In this application, as in others, the cost of mistakes depends
on their magnitude, that is, on the difference between the aggregate value of
the units selected and the units that should have been selected.
The oracle Bayes decision rule assumes knowledge of the prior distribution
of treatment effects.1Given the objective of maximizing the expected aggregate
values of the selection, the optimal procedure is to select the munits with treat-
ment effects with the highest posterior means. The empirical Bayes approach
differs from this only in using the estimated prior instead of the true prior. We
compare the performance of the empirical and oracle Bayes procedures in se-
lecting the top 10% of experiments from a sample of 4677. We estimate the
1We refer to this as the oracle Bayes decision rule rather than simply the Bayes decision rule
throughout, to emphasize that the prior is unknown to the decision-maker.
2
prior distribution of true effects with a scale mixture of mean-zero Gaussians,
fitted with ebnm [61]. Treating the estimated prior as the truth, we simulate
new experiments’ true and estimated treatment effects, and evaluate the re-
gret of the empirical Bayes approach on this semi-synthetic data. Consistent
with our theoretical results, we find that regret is Op(n1). By comparison,
identifying the set of the top 10% experiments with all misclassifications being
equally penalized regardless of their magnitude, or estimating the treatment
effects of the selected experiments, or estimating the prior distribution itself,
are all categorically harder problems, each of which only exhibits convergence
at the usual parametric rate.
Our work builds on several large and active strands of the statistics and
econometrics literature. Foundational work introducing and developing the
empirical Bayes approach to statistics includes [51,41,52,21]. Applications
of the selection problem have proliferated, as the general problem of discern-
ing between units which perform well or poorly on the basis of noisy, het-
eroskedastic measurements describes many real-world settings of interest. Pre-
vious work has studied identifying the best teachers [39,38,35,10,25], the
best medical facilities [56,27,17,36], the best baseball players [22,6]; differ-
entially expressed genes [23,54]; promising drug candidates [62]; geographic
areas associated with the greatest intergenerational mobility [4] or mortality
[46], and employers exhibiting the most evidence of discrimination [42]. In-
ternet experiments are particularly well-suited to empirical Bayes methods
[15,26,2,12,3,30] as datasets are often large enough for accurate estimation
of flexibly-specified priors, and the experiment-level sampling error is typi-
cally close to normally distributed. For these applications, the aggregate value
of the selected units will often be an important component of the decision-
maker’s utility function. Our results provide theoretical and empirical support
for selection based on such methods.
The literature on post-selection inference, including [14,13,33,24,37,1,31],
also studies selection problems, but differs from the present work in that its
chief focus is estimating the values, differences or ranks of the selected units,
rather than analyzing the regret associated with the selection. [14,13] provide
estimates for the value of a selection unit. [33,24,37,1,31] largely aim at
frequentist inferences. While the notion of regret we consider averages over
possible draws from the distribution of units’ true values, an alternative line
of inquiry beyond the scope of this paper would be to characterize admissible
and minimax decision rules for the frequentist analog of the regret we define,
considering the units’ values as fixed constants.
Closely related to our paper are [29,49]. [29] take an empirical Bayes ap-
proach to selecting the best units while controlling the marginal false discov-
ery rate; [49] assert frequentist control over the familywise error rate, which
amounts to a zero-one loss based on the correctness of the ranks. Both con-
sider loss functions different from ours. In their frameworks, the loss function
contains a discontinuity near the oracle Bayes decision rule, e.g. [29, Equation
3.1]. Mistakenly selecting or omitting any unit incurs a discrete cost, whereas
in ours the cost of mistakenly selecting or omitting a marginal unit near the
3
selection threshold is small, see also [29, Section 3.2]. We view these as com-
plementary perspectives. While in some decision problems mistakes may be
undesirable per se, the aggregate performance of the selected units is typically
still a relevant consideration. In the context of teacher evaluation, for example,
the policy-maker may rightly be concerned with guarantees over the number
of teachers who are incorrectly fired [48], but may also wish to understand how
well their selection procedure is performing from the students’ perspective, in
terms of the aggregate “value-added” of the teachers who keep their jobs. In
other contexts, as in internet experimentation or drug discovery, the aggregate
value of the selection is the primary concern, and it is harder to justify caring
about the number of mistakes per se.
Our selection problem may remind readers of the multi-armed bandit lit-
erature, e.g. [53,9] study the problem of identifying the best arm or best m
arms at a certain probability. However to target the probability of correct se-
lection is to consider discontinuous loss functions similar to ones in [29,49].
We also note that our selection problem is non-sequential which leads to new
challenges, as poor choices of parameter in the prior cannot be overcome with
additional samples in long run.
Finally, our work is particularly tied to the framework introduced in [59],
which generalizes the restrictive compound decision framework from [34] to a
class of simultaneous decision problems that are permutation invariant. How-
ever the optimal frequentist solution requires knowledge of the empirical c.d.f.
(e.c.d.f.), or equivalently the order statistic, of the µi’s. Thus for practical rea-
sons, we take a more Bayesian approach that also allows analysis of the perfor-
mance.
Our main contribution relative to this existing literature is to provide the
first regret bounds for empirical Bayes selection, and further to show that these
bounds are sharp in the parametric case. Our empirical work on internet exper-
imentation complements this by verifying that regret is quantitatively modest
in practice, given a reasonable number experiments of moderate precision. To-
gether, our theoretical and empirical results suggest optimism for empirical
Bayes approaches to selection when the decision-maker is primarily concerned
with maximizing the aggregate value of the selected units, as opposed to cor-
rectly classifying the top units or estimating their values.2
2Convergence and rate results are available for empirical Bayes as applied to compound deci-
sion problems, e.g. [34,58,63,32,50], but the compound decision framework is rather restrictive
[59], and does not encompass the value maximization problem studied here of selecting the best m
of nunits.
4
2 The Top-mSelection Problem
2.1 Setup
There are nunits, each of which is associated with a unobserved true value
µiRand an observed noise standard deviation σi>03. The µiand σiare
distributed independently from each other and independently across experi-
ments with unknown marginal distributions G0and H0, i.e.
(µi,σi)G0×H0.
We consider a potentially nonparametric family of nondegenerate prior distri-
butions G(η), where the parameter ηbelongs to a metric space (M,d), and the
family includes the truth G0, i.e. G0=G(η0)for some η0∈ M. For each unit i,
the decision-maker observes a measurement XiR, which is distributed as
Xi|µi,σi∼ N(µi,σ2
i).
The decision-maker must choose munits for some m<n. Their average utility
given the index set of choices J⊂ {1, 2, . . . , n}is U(J) = 1
nn
i=11(iJ)µi. Let
fη,σi(Xi) = Rµ1
σiφXiµ
σidG(η)
R1
σiφXiµ
σidG(η)(1)
denote the posterior mean of µigiven Xi,σi, assuming that the prior distri-
bution of µiis G(η), where φ(·)is the probability density function (p.d.f.)
of a standard Gaussian. The true posterior mean is fη0,σi(Xi). An estimator
b
η=b
η(X1, . . . , Xn,σ1, . . . , σn)of η0is available, where b
ηconverges to η0at some
rate rnin d(·,·). It is used to form the empirical Bayes prior, G(b
η), and con-
sequently to construct posterior mean estimates, fb
η,σi(Xi). For simplicity we
denote fη0,σi(Xi)and fb
η,σi(Xi), the oracle and empirical Bayes posterior means
for unit i, as θiand b
θirespectively. Note we can view θias being drawn i.i.d.
from a distribution, which we denote P.
Given the observed data, an oracle Bayesian decision-maker maximizes ex-
pected utility (where the expectation is with respect to the posterior distribu-
tion over the unknown true values) by selecting the munits with the highest
values of θi, breaking ties randomly. The empirical Bayes decision-maker mim-
ics this rule, by selecting the munits with the highest values of b
θi, breaking ties
randomly. Letting JEB and JBayes be the empirical Bayes and oracle Bayes choice
sets, the regret from empirical relative to oracle Bayes is
R=E[U(JBayes)|X1, . . . , Xn,σ1, . . . , σn]E[U(JEB)|X1, . . . , Xn,σ1, . . . , σn]
=1
n
n
i=1
(1(iJBayes)1(iJEB))θi, (2)
3We assume σito be known, in line with past applications of empirical Bayes methods, e.g.
[60,30,16].
5
摘要:

EmpiricalBayesSelectionforValueMaximizationDominicCoeyandKennethHungMetafcoey,kenhungg@meta.comJanuary3,2023AbstractWestudytheproblemofselectingthebestmunitsfromasetofnasm/n!a2(0,1),wherenoisy,heteroskedasticmeasurementsoftheunits'truevaluesareavailableandthedecision-makerwishestomaximizetheaveraget...

展开>> 收起<<
Empirical Bayes Selection for Value Maximization Dominic Coey and Kenneth Hung.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:378.54KB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注