Constant regret for sequence prediction with limited advice El Mehdi Saad1 Gilles Blanchard12 1Laboratoire de Mathématiques dOrsay CNRS Université Paris-Saclay2Inria

2025-05-01 0 0 719.29KB 45 页 10玖币

侵权投诉

Constant regret for sequence prediction with limited advice

El Mehdi Saad1, Gilles Blanchard1,2

1Laboratoire de Mathématiques d’Orsay, CNRS, Université Paris-Saclay; 2Inria

Abstract

We investigate the problem of cumulative regret minimization for individual sequence

prediction with respect to the best expert in a ﬁnite family of size

under limited access

to information. We assume that in each round, the learner can predict using a convex

combination of at most

experts for prediction, then they can observe a posteriori the

losses of at most

experts. We assume that the loss function is range-bounded and

exp-concave. In the standard multi-armed bandits setting, when the learner is allowed to

play only one expert per round and observe only its feedback, known optimal regret bounds

are of the order

(

√KT

). We show that allowing the learner to play one additional expert

per round and observe one additional feedback improves substantially the guarantees on

regret. We provide a strategy combining only

= 2 experts per round for prediction

and observing

m≥

2experts’ losses. Its randomized regret (wrt. internal randomization

of the learners’ strategy) is of order

O(K/m) log(Kδ−1)

with probability 1

−δ

, i.e., is

independent of the horizon

(“constant” or “fast rate” regret) if (

p≥

2and

m≥

3). We

prove that this rate is optimal up to a logarithmic factor in

. In the case

= 2, we

provide an upper bound of order

(

K2log

(

Kδ−1

)), with probability 1

−δ

. Our strategies

do not require any prior knowledge of the horizon

nor of the conﬁdence parameter

Finally, we show that if the learner is constrained to observe only one expert feedback

per round, the worst-case regret is the “slow rate” Ω(

√KT

), suggesting that synchronous

observation of at least two experts per round is necessary to have a constant regret.

Keywords:

Online Learning, Prediction with expert advice, Frugal Learning, Bandits feed-

back, Partial monitoring.

1 Introduction

We study the problem of online individual sequence prediction with expert advice, based on

the setting presented by Cesa-Bianchi and Lugosi [2006, Chap. 2], under limited access to

information. In this game, the learner’s aim is to predict an unknown sequence (

y1, y2, . . .

)

of an outcome space

. The mismatch between the learner’s predictions (

z1, z2, . . .

), taking

values in a closed convex subset

of a real vector space, and the target sequence is measured

via a loss function

(

z, y

). The learner’s predictions may only depend on past observations.

Following standard terminology used in prediction games, we will use the word “play” to mean

the prediction output by the learner.

In each round

t∈JTK

(for a non-negative integer

, we denote

JnK

{

, . . . , n}

), the

learner has access to

experts predictions (

F1,t, . . . , FK,t

). The performance of the learner is

arXiv:2210.02256v1 [math.ST] 5 Oct 2022

compared to that of the best single expert. More precisely, the objective is to have a cumulated

regret as small as possible, where the regret is deﬁned by

RT=

t=1

`(zt, yt)−min

i∈JKK

t=1

`(Fi,t, yt).

Experts aggregation is a standard problem in machine learning, where the learner observes

the predictions of all experts in each round and plays a convex combination of those. However,

in many practical situations, querying the advice of every expert is unrealistic. Natural

constraints arise, such as the ﬁnancial cost of consultancy, time limitations in online systems,

or computational budget constraints if each expert is actually the output of a complex

prediction model. One might hope to make predictions in these scenarios while minimizing

the underlying cost. Furthermore, we will distinguish between the constraint on the number of

experts’ advices used for prediction, and the number of feedbacks (losses of individual experts)

observed a posteriori. This diﬀerence naturally arises in online settings where the advices are

costly prior to the prediction task but just observing reported experts’ losses after prediction

can be cheaper. If the learner picks one single expert per round, plays the prediction of that

expert and observes the resulting loss, the game is the standard multi-armed bandits problem.

In this paper, we investigate intermediate settings, where the player has a constraint

p≤K

the number of experts used for prediction (via convex combination) in each round and several

feedbacks

m≤K

of actively chosen experts to see their losses. In the standard multi-armed

bandit problem, the played arm is necessarily the observed arm, this restriction is known as

the coupling between exploitation and exploration. In our protocol, we consider a generalization

of that restriction through the Inclusion Condition (IC): when

m≥p

, if

True

, we require

that the set of played experts for prediction at round

, denoted

, is included in the set

of observed experts, denoted

. More precisely, if

True

, in each round

, the player

ﬁrst chooses

experts out of

and plays a convex combination of their prediction, then she

observes the feedback (loss) of the individual selected experts, then picks

m−p

additional

experts to observe their losses. When

False

, the choice of played and observed experts

is decoupled; this means that the loss incurred by the

experts used for prediction is not

necessarily observed.

A closely related question was considered by Seldin et al. [2014], obtaining

(

√T

)regret

bounds for a general loss function (see extended discussion in the next section.) Our emphasis

here is on obtaining constant bounds guarantees on regret (i.e. independent of the time horizon

). Such “fast" rates, linked to assumptions related to strong convexity of the loss function

have been the subject of many works in learning (batch and online, in the stochastic setting)

and optimization, but are comparatively under-explored in ﬁxed sequence prediction.

In the literature on the prediction of ﬁxed individual sequences, no assumptions are made

about the distribution of the sequences. The attainability of fast rates (or constant regrets) is

also possible under certain assumptions on the loss function

: the full information setting

was studied, mainly by Vovk [1990], Vovk [1998], Vovk [2001], where it was shown that fast

rates are attainable under the

mixability

assumption on the loss function. The reader can ﬁnd

an extensive discussion of diﬀerent assumptions considered in the literature for this problem

in van Erven et al. [2015]. In the present paper, we make the following assumption on the loss

function:

Protocol 1 The Game Protocol (p, m, IC).

Parameters:

p, the number of experts allowed for prediction.

m, the number of experts allowed for observation as feedback.

IC ∈ {False,True}, inclusion condition (if IC =True, we must have p≤m).

for each round t= 1,2, . . . , T do

Choose a subset

St⊆JKK

such that

|St|

, and convex combination weights (

αi

)

i∈St

Play the convex combination Pi∈Stαi,tFi,t and incur its loss.

if IC =True, then

Choose a subset Ct⊆JKKsuch that: |Ct|=mand St⊆Ct.

else if IC =False, then

Choose a subset Ct⊆JKKsuch that: |Ct|=m.

end if

The environment reveals the losses (`(Fi,t, yt))i∈Ct.

end for

Assumption 1. There exist B, η > 0, such that

•Exp-concavity: For all y∈ Y,`(., y)is η-exp-concave over domain X.

•Range-boundedness: For all y∈ Y:supx,x0∈X |`(x, y)−`(x0, y)| ≤ B.

Remarks.

This assumption is satisﬁed in some usual settings of learning theory such as the

least squares loss with bounded outputs:

= [

xmin, xmax

]and

(

x, x0

)=(

x−x0

)

. Then

`satisﬁes Assumption 1, with B= (xmax −xmin)2and η= 1/(2B).

Remarks.

The regret as well as all the algorithms to follow remain unchanged if we replace

X →

, B

]deﬁned by

(

x, y

) :=

(

x, y

)

−minx∈X `

(

x, y

), so we can assume without

loss of generality

`∈

, B

]instead of range-boundedness; the results obtained still hold in the

latter more general case.

Assumption 1 was considered in several previous works tracking fast rates both in batch

and online learning (Koren and Levy, 2015, Mehta, 2017, Gonen and Shalev-Shwartz, 2016,

Mahdavi et al., 2015, van Erven et al., 2015). We introduce a new characterization for the

class of functions satisfying Assumption 1. Let

c >

0, deﬁne

(

)as the class of functions

f:X → R, such that

∀x, x0∈ X :fx+x0

2≤1

2f(x) + 1

2f(x0)−1

2cf(x)−f(x0)2.(1)

We introduce this class to highlight the suﬃcient and minimal property of

required for the

proofs in this paper to work, namely we will only make use of

(1)

in the proofs of the results

to come.

Lemma 1.1 below relates the class of functions

(

)to the set of functions satisfying

Assumption 1 as well a suﬃcient condition (Lipschitz and Strongly Convex or LIST condition).

Lemma 1.1. Let y∈ Y be ﬁxed.

•If `(., y)is B-range-bounded and η-exp-concave, then: `(., y)∈ E

ηB2

4 log1+ η2B2

2

.

•

(

., y

)

∈ E

(

)and is continuous, then:

(

., y

)is

-range-bounded and (4

)-exp-concave.

•If `(., y)is L-Lipschitz and ρ-strongly convex, then `(., y)∈ E(4L2/ρ).

Figure 1 summarizes bounds on regret for bounded and exp-concave loss functions. We

only consider ﬁxed individual sequences, which corresponds to fully oblivious adversaries (see

Audibert and Bubeck, 2010 for a deﬁnition of diﬀerent types of adversaries).

p= 1 p≥2

Lower bound Upper bound Lower bound Upper bound (p= 2)

m= 1 √KT √KT √KT √KT

[1] [2] [Thm 5.2] [2]

IC =True :K2log(K)

m= 2 √KT √KT K IC =False :Klog(K)

[3] [2] [Thm 5.1] [Thm 4.2 and 4.1]

m≥3qK

mTqK

mTlog(K)K

mlog(K)

[3] [3] [Thm 5.1] [Thm 4.1]

Figure 1: Existing bounds from the literature ([1] = Auer et al., 2002, [2]=Audibert and

Bubeck, 2010, [3]=Seldin et al., 2014) and new bounds presented in this paper. All bounds hold

up to numerical constant factors. Under Assumption 1, all new upper bounds hold with high

probability if we replace the factor

log

(

)with

log

(

Kδ−1

being the conﬁdence parameter.

Lower bounds are in expectation. When bounds are the same, we omit the distinction between

the settings

True

and

False

(coupling between exploration and exploitation, see

Protocol 1).

The remainder of this paper is organized as follows. Section 2 presents some results from

the literature relevant to the studied problem. Section 3 introduces algorithms satisfying

constant regrets in expectation in the case

= 2 and

m≥

3; that section aims to present a

preliminary view of the intuitions for attaining our objective. Next, we present in Section 4

our main results consisting of algorithms satisfying constant regrets with a high probability

for p, m ≥2. Finally, in Section 5, we present lower bounds for all the possible settings.

2 Discussion of related work

Games with limited feedback and O√Tregret:

In the standard setting of multi-

armed bandit problem, the learner has to repeatedly obtain rewards (or incur losses) by

choosing from a ﬁxed set of

actions and gets to see only the reward of the chosen action.

Algorithms such as EXP3-IX [Neu, 2015] or EXP3.P [Auer et al., 2002] achieve the optimal

regret of order

O√KT 

up to a logarithmic factor, with high probability. A more general

setting closer to ours was introduced by Seldin et al. [2014]. Given a budget

m∈JKK

, in

each round

, the learner plays the prediction of one expert

, then gets to choose a subset

of experts

such that

It∈Ct

in order to see their prediction. A careful adaptation of the

EXP3 algorithm to this setting leads to an expected regret of order

Op(K/m)T

, which is

optimal up to logarithmic factor in K.

There are two signiﬁcant diﬀerences between our framework and the setting presented

by Seldin et al. [2014]. First, we allow the player to combine up to

experts out of

each round for prediction. Second, we make an additional exp-concavity-type assumption

(Assumption 1) on the loss function. These two diﬀerences allow us to achieve constant regrets

bounds (independent of T).

Playing multiple arms per round was considered in the literature of multiple-play multi-

armed bandits. This problem was investigated under a budget constraint

by Zhou and

Tomlin [2018] and Xia et al. [2016]. In each round, the player picks

out of

arms, incurs

the sum of their losses. In addition to observing the losses of the played arms, the learner

learns a vector of costs which has to be covered by a pre-deﬁned budget

. Once the budget is

consumed, the game ﬁnishes. An extension of the EXP3 algorithm allows deriving a strategy

in the adversarial setting with regret of order

OpKC log(K/m)

. The cost of each arm is

supposed to be in an interval [

cmin,

1], for a positive constant

cmin

. Hence the total number of

rounds in this game

satisﬁes

= Θ(

C/m

). Another online problem aims at minimizing the

cumulative regret in an adversarial setting with a small eﬀective range of losses. Gerchinovitz

and Lattimore [2016] have shown the impossibility of regret scaling with the eﬀective range

of losses in the bandit setting, while Thune and Seldin [2018] showed that it is possible to

circumvent this impossibility result if the player is allowed one additional observation per

round. However, it is impossible to achieve a regret dependence on

better than the rate of

order O√Tin this setting.

Decoupling exploration and exploitation was considered by Avner et al. [2012]. In each

round, the player plays one arm, then chooses one arm out of

to see its prediction (not

necessarily the played arm as in the canonical multi-armed bandits problem). They devised

algorithms for this setting and showed that the dependence on the number of arms

can be

improved. However, it is impossible to achieve a regret dependence on

better than

O√T

Prediction with limited expert advice was also investigated by Helmbold and Panizza

[1997],Cesa-Bianchi and Lugosi [2006, Chap. 6] and Cesa-Bianchi et al. [2005]. However, in

these problems, known as label eﬃcient prediction, the forecaster has full access to the experts

advice but limited information about the past outcomes of the sequence to be predicted. More

precisely, the outcome ytis not necessarily revealed to the learner. In such a framework, the

optimal regret is of order O√T.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ConstantregretforsequencepredictionwithlimitedadviceElMehdiSaad1,GillesBlanchard1;21LaboratoiredeMathématiquesd'Orsay,CNRS,UniversitéParis-Saclay;2InriaAbstractWeinvestigatetheproblemofcumulativeregretminimizationforindividualsequencepredictionwithrespecttothebestexpertinanitefamilyofsizeKunderlimi...

展开>> 收起<<

Constant regret for sequence prediction with limited advice El Mehdi Saad1 Gilles Blanchard12 1Laboratoire de Mathématiques dOrsay CNRS Université Paris-Saclay2Inria.pdf

共45页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Constant regret for sequence prediction with limited advice El Mehdi Saad1 Gilles Blanchard12 1Laboratoire de Mathématiques dOrsay CNRS Université Paris-Saclay2Inria

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: