A lower condence sequence for the changing mean of non-negative right heavy-tailed observations with bounded mean

2025-04-30 0 0 486.97KB 17 页 10玖币

侵权投诉

A lower conﬁdence sequence for the changing mean of

non-negative right heavy-tailed observations with bounded

mean

Paul Mineiro

October 21, 2022

Abstract

A conﬁdence sequence (CS) is an anytime-valid sequential inference primitive which produces

an adapted sequence of sets for a predictable parameter sequence with a time-uniform coverage

guarantee. This work constructs a non-parametric non-asymptotic lower CS for the running av-

erage conditional expectation whose slack converges to zero given non-negative right heavy-tailed

observations with bounded mean. Speciﬁcally, when the variance is ﬁnite the approach domi-

nates the empirical Bernstein supermartingale of Howard et al. [5]; with inﬁnite variance, can

adapt to a known or unknown (1 + δ)-th moment bound; and can be eﬃciently approximated

using a sublinear number of suﬃcient statistics. In certain cases this lower CS can be converted

into a closed-interval CS whose width converges to zero, e.g., any bounded realization, or post

contextual-bandit inference with bounded rewards and unbounded importance weights. A refer-

ence implementation and example simulations demonstrate the technique.

1 Introduction

Recently the A/B testing and contextual bandit communities have embraced anytime-valid strate-

gies to facilitate composition of arbitrary decision logic into online experimental procedures.[6,8] A

conﬁdence sequence (CS) is an anytime-valid sequential inference primitive which, for any α∈(0,1),

produces an adapted sequence CItof sets for a predictable parameter sequence of interest θtwith the

guarantee Pr (∀t≥1 : θt∈CIt)≥1−α: non-parametric non-asymptotic variants have broad utility.

This work is a robust lower CS for the average conditional expectation of an adapted sequence

of non-negative right heavy-tailed observations. The basic approach assumes a non-negative scalar

observation with bounded mean and produces a lower CS for the running mean, i.e., a CS of the

form CIt= [Lt,∞). A lower CS has immediate utility for pessimistic decision scenarios such as

gated deployment, i.e., requiring changes to production environments to certify improvement with

high probability.[8] Furthermore, even with unbounded observations, this lower CS can sometimes be

converted into a closed-interval CS, e.g., post contextual-bandit inference with bounded rewards and

unbounded importance weights.[15] Bounded observations always permit constructing a CS: this is

sensible despite all moments being ﬁnite, as the proposed approach is both theoretically and empiri-

cally superior to the empirical Bernstein supermartingale of Howard et al. [5].

Contributions

In Section 3, we introduce a novel test supermartingale for the running mean. We prove the following:

the supermartingale dominates the empirical Bernstein supermartingale of Howard et al. [5]; does not

require ﬁnite variance to converge, but can instead adapt to a known or unknown (1 + δ)-th moment

bound; and can be eﬃciently approximated using a sublinear number of suﬃcient statistics. The result

arXiv:2210.11133v1 [stat.ML] 20 Oct 2022

is the imminently practical doubly-discrete robust mixture of Theorem 4. We provide simulations in

Section 4; code to reproduce all results is available at https://github.com/microsoft/csrobust.

2 Related Work

Anytime-valid sequential inference is an active research area with a rich history dating back to Wald

[12]. Here we only discuss aspects relevant to the current work, and we refer the interested reader to

the excellent overview contained in Waudby-Smith and Ramdas [14].

Waudby-Smith and Ramdas [14] derive time-uniform conﬁdence sequences for a ﬁxed parameter

and bounded observations. Their constructions can produce a lower CS when observations are un-

bounded above with bounded mean. Unfortunately their techniques are only applicable to a ﬁxed

parameter: when the parameter is changing, their techniques cover a data-dependent weighted mix-

ture, providing limited utility. The ﬁxed parameter case is not restricted to stationary processes,

e.g., sampling without replacement is governed by a ﬁxed parameter despite the data distributions

predictably changing. Nonetheless, the ﬁxed parameter restriction limits the applicability, e.g., when

the conditional mean is changing over time.

Wang and Ramdas [13] improve upon Waudby-Smith and Ramdas [14] by leveraging Catoni-

style inﬂuence functions, including an inﬁnite variance result for a known (1 + δ)-th moment bound.

Adapting to an unknown bound is provably impossible in their setting due to Bahadur and Savage [1],

whereas our more restrictive assumption of bounded mean allows us to adapt. The approach shares

the limitations of Waudby-Smith and Ramdas [14] with respect to a changing parameter.

Howard et al. [5] describes multiple non-asymptotic boundaries for the running average conditional

expectation. In particular they propose the empirical Bernstein boundary, which has zero asymptotic

slack for lower-bounded right heavy-tailed observations with bounded mean and ﬁnite variance, e.g.,

log-normally distributed. However, all of their one-sided discrete time boundaries require ﬁnite vari-

ance for asymptotically zero slack.

3 Derivations

The following novel construction is a test martingale qua Shafer et al. [10].

Deﬁnition 1 (Heavy NSM).Let (Ω,F,{Ft}t∈N, P )be a ﬁltered probability space, and let {Xt}t∈Nbe

an adapted non-negative R-valued random process with bounded mean Et[Xt]≤1. Let {ˆ

Xt}t∈Nbe a

predictable [0,1]-valued random process, and let λ∈[0,1) be a constant bet. Then

Et(λ).

= exp 

λ

X

s≤t

Xs−X

s≤t

Es−1[Xs]



Y

s≤t1 + λXs−ˆ

Xs.(1)

Deﬁnition 1is a non-negative supermartingale: letting Yt.

=Xt−Et−1[Xt] and δt.

=ˆ

Xt−Et−1[Xt],

Et−1Et(λ)

Et−1(λ)=Et−1[exp (λtδt) (1 + λt(Yt−δt))] (a)

= exp (λtδt) (1 −λtδt)(b)

≤1,

where (a) is due to (Et−1[Yt] = 0 and δtpredictable), and (b) is due to ex(1 −x)≤1.

Statistical Considerations

Empirical Bernstein Dominance Deﬁnition 1is essentially the empirical Bernstein supermartin-

gale of Howard et al. [5, Section A.8] without the slack from the Fan et al. [3] inequality. Consequently

it is straightforward to show dominance.

Deﬁnition 2 (Empirical Bernstein Supermartingale).

Bt(λ) = exp 

λX

s≤t

Ys−ψE(λ)X

s≤tXs−ˆ

Xs2

,(2)

where ψE(x).

=−x−log(1 −x).

Theorem 1. For the same λand ˆ

X, Eq. (1)is at least as wealthy as Eq. (2).

Proof. The log wealth is

log (Et(λ)) = λX

s≤t

Ys−X

s≤tλXs−ˆ

Xs−log 1 + λXs−ˆ

Xs

=λX

s≤t

Ys−ψE(λ)X

s≤t

ψE−λXs−ˆ

Xs

ψE(λ)(ψE(x).

=−x−log (1 −x))

≥λX

s≤t

Ys−ψE(λ)X

s≤tXs−ˆ

Xs2(Fan et al. [3]) .

Thus Eq. (1) inherits the asymptotic guarantee of the mixed empirical Bernstein martingale.

Deﬁnition 3 (Mixture boundary).For any probability distribution Fon [0, λmax)and α∈[0,1],

MαXs≤t,ˆ

Xs≤t.

= sup (y∈R:Zλmax

Et(λ)dF (λ)≤1

α),

is a time-uniform crossing boundary with probability at most αfor Yt=Ps≤t(Xs−Es−1[Xs]).

Proposition Howard 2 (Howard et al. [5]).If Fis absolutely continuous wrt Lebesque measure in

a neighborhood around zero, the mixture boundary is upper bounded by rvlog v

2πα2f2(0) +o(1) as

v→ ∞, where v=Ps≤tXs−ˆ

Xs2

, and f(0) = dF

dλ (0).

Computationally this is less felicitous as closed-form conjugate mixtures are not available for Eq. (1).

We revisit computational issues later in this section.

Heavy Tailed Results When the conditional second moment is not bounded, Proposition Howard 2

provides no guarantee because the variance process grows superlinearly. However, unlike the empirical

Bernstein process, Deﬁnition 1can induce conﬁdence sequences that shrink to zero asymptotically even

if the conditional second moment is unbounded, and can adapt to an unknown (1 + δ)-th moment

bound. This is essentially because the function (x−log(1 + x)) asymptotically grows more slowly

than xqfor any q > 1.

Lemma 1 (q-growth).For any q∈(1,2] and λ∈0,1 + W0(−e−2)≈[0,0.841] where W0(z)is the

principal branch of the Lambert W function,

λx −log (1 + λx)≤λq1x≤0x2+ 1x>0min x2, c∗(q)xq,

where for q < 2,c∗(q).

=x∗(q)2−qwhere x∗(q)>0uniquely solves

q=x∗(q)2

(1 + x∗(q))(x∗(q)−log(1 + x∗(q))) ,

and limq↑2c∗(q)=1deﬁnes c∗(2).

Proof. See Appendix A.

For example when q=3

2,c∗(q)≈1.35. Combining the q-growth lemma with a modiﬁed version of

Laplace’s method yields Theorem 2.

Theorem 2 (q-asymptotics).For the mixture boundary of Deﬁnition 3, for any q∈(1,2], any

λmax ∈(0,1 + W0−e−2≈(0,0.841], and any Fabsolutely continous with Lebseque measure in a

neighborhood of zero with dF

dλ (λ) = f(0)λq/2−1+O(λq/2)and f(0) >0, the mixture boundary is at

most

MαX≤t,ˆ

X≤t≤v1/q log √v

αf(0)a(q) (1 + o(1)) q−1

where

s≤t

1Xs≤ˆ

XsXs−ˆ

Xs2+ 1Xs>ˆ

Xsmin Xs−ˆ

Xs2, c∗(q)Xs−ˆ

Xsq,

a(q).

=s2πq

4(q−1)vexp 1

q−1−1

q−1!,

with c∗(q)as in Lemma 1.

Proof. See Appendix A.

Theorem 2gives a rate of Ov1/q (log v(q−1)/q): for comparison the qth-moment law of the iterated

logarithm is Ov1/q(log log v(q−1)/q).[11] Thus, like the ﬁnite variance case, the mixture method

achieves the LIL rate to within a logarithmic factor. Note the quantity vappearing in Theorem 2is

for analysis only and need not be explicitly computed; rather Deﬁnition 3is used. However Theorem 2

cannot directly adapt to an unknown moment bound, as it requires a speciﬁcation of the moment being

bounded in order to construct the mixture distribution with the appropriate integrable singularity at

the origin. Given Lemma 1it is reasonable to seek adaptation to an unknown moment bound1which

we achieve via a discrete mixture over Theorem 2.

Corollary 1 (q-adaptive).For λmax ∈(0,1 + W0−e−2]≈(0,0.841], let

F(λ) =

∞

k=0

q(k)

2λq(k)/2

max

λq(k)/2−1,

where q(k) = 1+ηk,η∈(0,1), and 1 = P∞

k=0 wk. Then for any q∈(1,2], the mixture of Deﬁnition 3

guarantees

MαX≤t,ˆ

X≤t≤v1/˜qlog √v

αwk(q)f(0)a(˜q) (1 + o(1))˜q−1

˜q

where

s≤t

1Xs≤ˆ

XsXs−ˆ

Xs2+ 1Xs>ˆ

Xsmin Xs−ˆ

Xs2, c∗(˜q)Xs−ˆ

Xs˜q,

k(q).

=dlogη(q−1)e,

˜q.

= 1 + ηk(q)=q−(q−1) 1−η∆(q)≥q−(q−1)(1 −η),

∆(q).

=dlogη(q−1)e − logη(q−1) ∈[0,1),

with a(q)as in Theorem 2and c∗(q)as in Lemma 1.

1Note the impossibility result of Bahadur and Savage [1] does not apply because the mean is bounded.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Alowercondencesequenceforthechangingmeanofnon-negativerightheavy-tailedobservationswithboundedmeanPaulMineiroOctober21,2022AbstractAcondencesequence(CS)isananytime-validsequentialinferenceprimitivewhichproducesanadaptedsequenceofsetsforapredictableparametersequencewithatime-uniformcoverageguarante...

展开>> 收起<<

A lower condence sequence for the changing mean of non-negative right heavy-tailed observations with bounded mean.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A lower condence sequence for the changing mean of non-negative right heavy-tailed observations with bounded mean

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: