A lower condence sequence for the changing mean of non-negative right heavy-tailed observations with bounded mean

2025-04-30 0 0 486.97KB 17 页 10玖币
侵权投诉
A lower confidence sequence for the changing mean of
non-negative right heavy-tailed observations with bounded
mean
Paul Mineiro
October 21, 2022
Abstract
A confidence sequence (CS) is an anytime-valid sequential inference primitive which produces
an adapted sequence of sets for a predictable parameter sequence with a time-uniform coverage
guarantee. This work constructs a non-parametric non-asymptotic lower CS for the running av-
erage conditional expectation whose slack converges to zero given non-negative right heavy-tailed
observations with bounded mean. Specifically, when the variance is finite the approach domi-
nates the empirical Bernstein supermartingale of Howard et al. [5]; with infinite variance, can
adapt to a known or unknown (1 + δ)-th moment bound; and can be efficiently approximated
using a sublinear number of sufficient statistics. In certain cases this lower CS can be converted
into a closed-interval CS whose width converges to zero, e.g., any bounded realization, or post
contextual-bandit inference with bounded rewards and unbounded importance weights. A refer-
ence implementation and example simulations demonstrate the technique.
1 Introduction
Recently the A/B testing and contextual bandit communities have embraced anytime-valid strate-
gies to facilitate composition of arbitrary decision logic into online experimental procedures.[6,8] A
confidence sequence (CS) is an anytime-valid sequential inference primitive which, for any α(0,1),
produces an adapted sequence CItof sets for a predictable parameter sequence of interest θtwith the
guarantee Pr (t1 : θtCIt)1α: non-parametric non-asymptotic variants have broad utility.
This work is a robust lower CS for the average conditional expectation of an adapted sequence
of non-negative right heavy-tailed observations. The basic approach assumes a non-negative scalar
observation with bounded mean and produces a lower CS for the running mean, i.e., a CS of the
form CIt= [Lt,). A lower CS has immediate utility for pessimistic decision scenarios such as
gated deployment, i.e., requiring changes to production environments to certify improvement with
high probability.[8] Furthermore, even with unbounded observations, this lower CS can sometimes be
converted into a closed-interval CS, e.g., post contextual-bandit inference with bounded rewards and
unbounded importance weights.[15] Bounded observations always permit constructing a CS: this is
sensible despite all moments being finite, as the proposed approach is both theoretically and empiri-
cally superior to the empirical Bernstein supermartingale of Howard et al. [5].
Contributions
In Section 3, we introduce a novel test supermartingale for the running mean. We prove the following:
the supermartingale dominates the empirical Bernstein supermartingale of Howard et al. [5]; does not
require finite variance to converge, but can instead adapt to a known or unknown (1 + δ)-th moment
bound; and can be efficiently approximated using a sublinear number of sufficient statistics. The result
1
arXiv:2210.11133v1 [stat.ML] 20 Oct 2022
is the imminently practical doubly-discrete robust mixture of Theorem 4. We provide simulations in
Section 4; code to reproduce all results is available at https://github.com/microsoft/csrobust.
2 Related Work
Anytime-valid sequential inference is an active research area with a rich history dating back to Wald
[12]. Here we only discuss aspects relevant to the current work, and we refer the interested reader to
the excellent overview contained in Waudby-Smith and Ramdas [14].
Waudby-Smith and Ramdas [14] derive time-uniform confidence sequences for a fixed parameter
and bounded observations. Their constructions can produce a lower CS when observations are un-
bounded above with bounded mean. Unfortunately their techniques are only applicable to a fixed
parameter: when the parameter is changing, their techniques cover a data-dependent weighted mix-
ture, providing limited utility. The fixed parameter case is not restricted to stationary processes,
e.g., sampling without replacement is governed by a fixed parameter despite the data distributions
predictably changing. Nonetheless, the fixed parameter restriction limits the applicability, e.g., when
the conditional mean is changing over time.
Wang and Ramdas [13] improve upon Waudby-Smith and Ramdas [14] by leveraging Catoni-
style influence functions, including an infinite variance result for a known (1 + δ)-th moment bound.
Adapting to an unknown bound is provably impossible in their setting due to Bahadur and Savage [1],
whereas our more restrictive assumption of bounded mean allows us to adapt. The approach shares
the limitations of Waudby-Smith and Ramdas [14] with respect to a changing parameter.
Howard et al. [5] describes multiple non-asymptotic boundaries for the running average conditional
expectation. In particular they propose the empirical Bernstein boundary, which has zero asymptotic
slack for lower-bounded right heavy-tailed observations with bounded mean and finite variance, e.g.,
log-normally distributed. However, all of their one-sided discrete time boundaries require finite vari-
ance for asymptotically zero slack.
3 Derivations
The following novel construction is a test martingale qua Shafer et al. [10].
Definition 1 (Heavy NSM).Let (Ω,F,{Ft}tN, P )be a filtered probability space, and let {Xt}tNbe
an adapted non-negative R-valued random process with bounded mean Et[Xt]1. Let {ˆ
Xt}tNbe a
predictable [0,1]-valued random process, and let λ[0,1) be a constant bet. Then
Et(λ).
= exp
λ
X
st
ˆ
XsX
st
Es1[Xs]
Y
st1 + λXsˆ
Xs.(1)
Definition 1is a non-negative supermartingale: letting Yt.
=XtEt1[Xt] and δt.
=ˆ
XtEt1[Xt],
Et1Et(λ)
Et1(λ)=Et1[exp (λtδt) (1 + λt(Ytδt))] (a)
= exp (λtδt) (1 λtδt)(b)
1,
where (a) is due to (Et1[Yt] = 0 and δtpredictable), and (b) is due to ex(1 x)1.
Statistical Considerations
Empirical Bernstein Dominance Definition 1is essentially the empirical Bernstein supermartin-
gale of Howard et al. [5, Section A.8] without the slack from the Fan et al. [3] inequality. Consequently
it is straightforward to show dominance.
2
Definition 2 (Empirical Bernstein Supermartingale).
Bt(λ) = exp
λX
st
YsψE(λ)X
stXsˆ
Xs2
,(2)
where ψE(x).
=xlog(1 x).
Theorem 1. For the same λand ˆ
X, Eq. (1)is at least as wealthy as Eq. (2).
Proof. The log wealth is
log (Et(λ)) = λX
st
YsX
stλXsˆ
Xslog 1 + λXsˆ
Xs
=λX
st
YsψE(λ)X
st
ψEλXsˆ
Xs
ψE(λ)(ψE(x).
=xlog (1 x))
λX
st
YsψE(λ)X
stXsˆ
Xs2(Fan et al. [3]) .
Thus Eq. (1) inherits the asymptotic guarantee of the mixed empirical Bernstein martingale.
Definition 3 (Mixture boundary).For any probability distribution Fon [0, λmax)and α[0,1],
MαXst,ˆ
Xst.
= sup (yR:Zλmax
0
Et(λ)dF (λ)1
α),
is a time-uniform crossing boundary with probability at most αfor Yt=Pst(XsEs1[Xs]).
Proposition Howard 2 (Howard et al. [5]).If Fis absolutely continuous wrt Lebesque measure in
a neighborhood around zero, the mixture boundary is upper bounded by rvlog v
2πα2f2(0) +o(1) as
v→ ∞, where v=PstXsˆ
Xs2
, and f(0) = dF
(0).
Computationally this is less felicitous as closed-form conjugate mixtures are not available for Eq. (1).
We revisit computational issues later in this section.
Heavy Tailed Results When the conditional second moment is not bounded, Proposition Howard 2
provides no guarantee because the variance process grows superlinearly. However, unlike the empirical
Bernstein process, Definition 1can induce confidence sequences that shrink to zero asymptotically even
if the conditional second moment is unbounded, and can adapt to an unknown (1 + δ)-th moment
bound. This is essentially because the function (xlog(1 + x)) asymptotically grows more slowly
than xqfor any q > 1.
Lemma 1 (q-growth).For any q(1,2] and λ0,1 + W0(e2)[0,0.841] where W0(z)is the
principal branch of the Lambert W function,
λx log (1 + λx)λq1x0x2+ 1x>0min x2, c(q)xq,
where for q < 2,c(q).
=x(q)2qwhere x(q)>0uniquely solves
q=x(q)2
(1 + x(q))(x(q)log(1 + x(q))) ,
and limq2c(q)=1defines c(2).
3
Proof. See Appendix A.
For example when q=3
2,c(q)1.35. Combining the q-growth lemma with a modified version of
Laplace’s method yields Theorem 2.
Theorem 2 (q-asymptotics).For the mixture boundary of Definition 3, for any q(1,2], any
λmax (0,1 + W0e2(0,0.841], and any Fabsolutely continous with Lebseque measure in a
neighborhood of zero with dF
(λ) = f(0)λq/21+O(λq/2)and f(0) >0, the mixture boundary is at
most
MαXt,ˆ
Xtv1/q log v
αf(0)a(q) (1 + o(1)) q1
q
,
where
v.
=X
st
1Xsˆ
XsXsˆ
Xs2+ 1Xs>ˆ
Xsmin Xsˆ
Xs2, c(q)Xsˆ
Xsq,
a(q).
=s2πq
4(q1)vexp 1
q
1
q11
q
q
q1!,
with c(q)as in Lemma 1.
Proof. See Appendix A.
Theorem 2gives a rate of Ov1/q (log v(q1)/q): for comparison the qth-moment law of the iterated
logarithm is Ov1/q(log log v(q1)/q).[11] Thus, like the finite variance case, the mixture method
achieves the LIL rate to within a logarithmic factor. Note the quantity vappearing in Theorem 2is
for analysis only and need not be explicitly computed; rather Definition 3is used. However Theorem 2
cannot directly adapt to an unknown moment bound, as it requires a specification of the moment being
bounded in order to construct the mixture distribution with the appropriate integrable singularity at
the origin. Given Lemma 1it is reasonable to seek adaptation to an unknown moment bound1which
we achieve via a discrete mixture over Theorem 2.
Corollary 1 (q-adaptive).For λmax (0,1 + W0e2](0,0.841], let
F(λ) =
X
k=0
wk
q(k)
2λq(k)/2
max
λq(k)/21,
where q(k) = 1+ηk,η(0,1), and 1 = P
k=0 wk. Then for any q(1,2], the mixture of Definition 3
guarantees
MαXt,ˆ
Xtv1/˜qlog v
αwk(q)f(0)a(˜q) (1 + o(1))˜q1
˜q
,
where
v.
=X
st
1Xsˆ
XsXsˆ
Xs2+ 1Xs>ˆ
Xsmin Xsˆ
Xs2, c(˜q)Xsˆ
Xs˜q,
k(q).
=dlogη(q1)e,
˜q.
= 1 + ηk(q)=q(q1) 1η∆(q)q(q1)(1 η),
∆(q).
=dlogη(q1)e − logη(q1) [0,1),
with a(q)as in Theorem 2and c(q)as in Lemma 1.
1Note the impossibility result of Bahadur and Savage [1] does not apply because the mean is bounded.
4
摘要:

Alowercon dencesequenceforthechangingmeanofnon-negativerightheavy-tailedobservationswithboundedmeanPaulMineiroOctober21,2022AbstractAcon dencesequence(CS)isananytime-validsequentialinferenceprimitivewhichproducesanadaptedsequenceofsetsforapredictableparametersequencewithatime-uniformcoverageguarante...

展开>> 收起<<
A lower condence sequence for the changing mean of non-negative right heavy-tailed observations with bounded mean.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:486.97KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注