Constant regret for sequence prediction with limited advice El Mehdi Saad1 Gilles Blanchard12 1Laboratoire de Mathématiques dOrsay CNRS Université Paris-Saclay2Inria

2025-05-01 0 0 719.29KB 45 页 10玖币
侵权投诉
Constant regret for sequence prediction with limited advice
El Mehdi Saad1, Gilles Blanchard1,2
1Laboratoire de Mathématiques d’Orsay, CNRS, Université Paris-Saclay; 2Inria
Abstract
We investigate the problem of cumulative regret minimization for individual sequence
prediction with respect to the best expert in a finite family of size
K
under limited access
to information. We assume that in each round, the learner can predict using a convex
combination of at most
p
experts for prediction, then they can observe a posteriori the
losses of at most
m
experts. We assume that the loss function is range-bounded and
exp-concave. In the standard multi-armed bandits setting, when the learner is allowed to
play only one expert per round and observe only its feedback, known optimal regret bounds
are of the order
O
(
KT
). We show that allowing the learner to play one additional expert
per round and observe one additional feedback improves substantially the guarantees on
regret. We provide a strategy combining only
p
= 2 experts per round for prediction
and observing
m
2experts’ losses. Its randomized regret (wrt. internal randomization
of the learners’ strategy) is of order
O(K/m) log(Kδ1)
with probability 1
δ
, i.e., is
independent of the horizon
T
(“constant” or “fast rate” regret) if (
p
2and
m
3). We
prove that this rate is optimal up to a logarithmic factor in
K
. In the case
p
=
m
= 2, we
provide an upper bound of order
O
(
K2log
(
Kδ1
)), with probability 1
δ
. Our strategies
do not require any prior knowledge of the horizon
T
nor of the confidence parameter
δ
.
Finally, we show that if the learner is constrained to observe only one expert feedback
per round, the worst-case regret is the “slow rate” Ω(
KT
), suggesting that synchronous
observation of at least two experts per round is necessary to have a constant regret.
Keywords:
Online Learning, Prediction with expert advice, Frugal Learning, Bandits feed-
back, Partial monitoring.
1 Introduction
We study the problem of online individual sequence prediction with expert advice, based on
the setting presented by Cesa-Bianchi and Lugosi [2006, Chap. 2], under limited access to
information. In this game, the learner’s aim is to predict an unknown sequence (
y1, y2, . . .
)
of an outcome space
Y
. The mismatch between the learner’s predictions (
z1, z2, . . .
), taking
values in a closed convex subset
X
of a real vector space, and the target sequence is measured
via a loss function
`
(
z, y
). The learner’s predictions may only depend on past observations.
Following standard terminology used in prediction games, we will use the word “play” to mean
the prediction output by the learner.
In each round
tJTK
(for a non-negative integer
n
, we denote
JnK
=
{
1
, . . . , n}
), the
learner has access to
K
experts predictions (
F1,t, . . . , FK,t
). The performance of the learner is
1
arXiv:2210.02256v1 [math.ST] 5 Oct 2022
compared to that of the best single expert. More precisely, the objective is to have a cumulated
regret as small as possible, where the regret is defined by
RT=
T
X
t=1
`(zt, yt)min
iJKK
T
X
t=1
`(Fi,t, yt).
Experts aggregation is a standard problem in machine learning, where the learner observes
the predictions of all experts in each round and plays a convex combination of those. However,
in many practical situations, querying the advice of every expert is unrealistic. Natural
constraints arise, such as the financial cost of consultancy, time limitations in online systems,
or computational budget constraints if each expert is actually the output of a complex
prediction model. One might hope to make predictions in these scenarios while minimizing
the underlying cost. Furthermore, we will distinguish between the constraint on the number of
experts’ advices used for prediction, and the number of feedbacks (losses of individual experts)
observed a posteriori. This difference naturally arises in online settings where the advices are
costly prior to the prediction task but just observing reported experts’ losses after prediction
can be cheaper. If the learner picks one single expert per round, plays the prediction of that
expert and observes the resulting loss, the game is the standard multi-armed bandits problem.
In this paper, we investigate intermediate settings, where the player has a constraint
pK
on
the number of experts used for prediction (via convex combination) in each round and several
feedbacks
mK
of actively chosen experts to see their losses. In the standard multi-armed
bandit problem, the played arm is necessarily the observed arm, this restriction is known as
the coupling between exploitation and exploration. In our protocol, we consider a generalization
of that restriction through the Inclusion Condition (IC): when
mp
, if
IC
=
True
, we require
that the set of played experts for prediction at round
t
, denoted
St
, is included in the set
of observed experts, denoted
Ct
. More precisely, if
IC
=
True
, in each round
t
, the player
first chooses
p
experts out of
K
and plays a convex combination of their prediction, then she
observes the feedback (loss) of the individual selected experts, then picks
mp
additional
experts to observe their losses. When
IC
=
False
, the choice of played and observed experts
is decoupled; this means that the loss incurred by the
p
experts used for prediction is not
necessarily observed.
A closely related question was considered by Seldin et al. [2014], obtaining
O
(
T
)regret
bounds for a general loss function (see extended discussion in the next section.) Our emphasis
here is on obtaining constant bounds guarantees on regret (i.e. independent of the time horizon
T
). Such “fast" rates, linked to assumptions related to strong convexity of the loss function
`
,
have been the subject of many works in learning (batch and online, in the stochastic setting)
and optimization, but are comparatively under-explored in fixed sequence prediction.
In the literature on the prediction of fixed individual sequences, no assumptions are made
about the distribution of the sequences. The attainability of fast rates (or constant regrets) is
also possible under certain assumptions on the loss function
`
: the full information setting
was studied, mainly by Vovk [1990], Vovk [1998], Vovk [2001], where it was shown that fast
rates are attainable under the
mixability
assumption on the loss function. The reader can find
an extensive discussion of different assumptions considered in the literature for this problem
in van Erven et al. [2015]. In the present paper, we make the following assumption on the loss
function:
2
Protocol 1 The Game Protocol (p, m, IC).
Parameters:
p, the number of experts allowed for prediction.
m, the number of experts allowed for observation as feedback.
IC ∈ {False,True}, inclusion condition (if IC =True, we must have pm).
for each round t= 1,2, . . . , T do
Choose a subset
StJKK
such that
|St|
=
p
, and convex combination weights (
αi
)
iSt
.
Play the convex combination PiStαi,tFi,t and incur its loss.
if IC =True, then
Choose a subset CtJKKsuch that: |Ct|=mand StCt.
else if IC =False, then
Choose a subset CtJKKsuch that: |Ct|=m.
end if
The environment reveals the losses (`(Fi,t, yt))iCt.
end for
Assumption 1. There exist B, η > 0, such that
Exp-concavity: For all y∈ Y,`(., y)is η-exp-concave over domain X.
Range-boundedness: For all y∈ Y:supx,x0∈X |`(x, y)`(x0, y)| ≤ B.
Remarks.
This assumption is satisfied in some usual settings of learning theory such as the
least squares loss with bounded outputs:
X
=
Y
= [
xmin, xmax
]and
`
(
x, x0
)=(
xx0
)
2
. Then
`satisfies Assumption 1, with B= (xmax xmin)2and η= 1/(2B).
Remarks.
The regret as well as all the algorithms to follow remain unchanged if we replace
`
by
˜
`
:
X →
[0
, B
]defined by
˜
`
(
x, y
) :=
`
(
x, y
)
minx∈X `
(
x, y
), so we can assume without
loss of generality
`
[0
, B
]instead of range-boundedness; the results obtained still hold in the
latter more general case.
Assumption 1 was considered in several previous works tracking fast rates both in batch
and online learning (Koren and Levy, 2015, Mehta, 2017, Gonen and Shalev-Shwartz, 2016,
Mahdavi et al., 2015, van Erven et al., 2015). We introduce a new characterization for the
class of functions satisfying Assumption 1. Let
c >
0, define
E
(
c
)as the class of functions
f:X R, such that
x, x0∈ X :fx+x0
21
2f(x) + 1
2f(x0)1
2cf(x)f(x0)2.(1)
We introduce this class to highlight the sufficient and minimal property of
`
required for the
proofs in this paper to work, namely we will only make use of
(1)
in the proofs of the results
to come.
Lemma 1.1 below relates the class of functions
E
(
.
)to the set of functions satisfying
Assumption 1 as well a sufficient condition (Lipschitz and Strongly Convex or LIST condition).
3
Lemma 1.1. Let y∈ Y be fixed.
If `(., y)is B-range-bounded and η-exp-concave, then: `(., y)∈ E
ηB2
4 log1+ η2B2
2
.
If
`
(
., y
)
∈ E
(
c
)and is continuous, then:
`
(
., y
)is
c
-range-bounded and (4
/c
)-exp-concave.
If `(., y)is L-Lipschitz and ρ-strongly convex, then `(., y)∈ E(4L2).
Figure 1 summarizes bounds on regret for bounded and exp-concave loss functions. We
only consider fixed individual sequences, which corresponds to fully oblivious adversaries (see
Audibert and Bubeck, 2010 for a definition of different types of adversaries).
p= 1 p2
Lower bound Upper bound Lower bound Upper bound (p= 2)
m= 1 KT KT KT KT
[1] [2] [Thm 5.2] [2]
IC =True :K2log(K)
m= 2 KT KT K IC =False :Klog(K)
[3] [2] [Thm 5.1] [Thm 4.2 and 4.1]
m3qK
mTqK
mTlog(K)K
m
K
mlog(K)
[3] [3] [Thm 5.1] [Thm 4.1]
Figure 1: Existing bounds from the literature ([1] = Auer et al., 2002, [2]=Audibert and
Bubeck, 2010, [3]=Seldin et al., 2014) and new bounds presented in this paper. All bounds hold
up to numerical constant factors. Under Assumption 1, all new upper bounds hold with high
probability if we replace the factor
log
(
K
)with
log
(
Kδ1
),
δ
being the confidence parameter.
Lower bounds are in expectation. When bounds are the same, we omit the distinction between
the settings
IC
=
True
and
IC
=
False
(coupling between exploration and exploitation, see
Protocol 1).
The remainder of this paper is organized as follows. Section 2 presents some results from
the literature relevant to the studied problem. Section 3 introduces algorithms satisfying
constant regrets in expectation in the case
p
= 2 and
m
3; that section aims to present a
preliminary view of the intuitions for attaining our objective. Next, we present in Section 4
our main results consisting of algorithms satisfying constant regrets with a high probability
for p, m 2. Finally, in Section 5, we present lower bounds for all the possible settings.
4
2 Discussion of related work
Games with limited feedback and OTregret:
In the standard setting of multi-
armed bandit problem, the learner has to repeatedly obtain rewards (or incur losses) by
choosing from a fixed set of
k
actions and gets to see only the reward of the chosen action.
Algorithms such as EXP3-IX [Neu, 2015] or EXP3.P [Auer et al., 2002] achieve the optimal
regret of order
OKT
up to a logarithmic factor, with high probability. A more general
setting closer to ours was introduced by Seldin et al. [2014]. Given a budget
mJKK
, in
each round
t
, the learner plays the prediction of one expert
It
, then gets to choose a subset
of experts
Ct
such that
ItCt
in order to see their prediction. A careful adaptation of the
EXP3 algorithm to this setting leads to an expected regret of order
Op(K/m)T
, which is
optimal up to logarithmic factor in K.
There are two significant differences between our framework and the setting presented
by Seldin et al. [2014]. First, we allow the player to combine up to
p
experts out of
K
in
each round for prediction. Second, we make an additional exp-concavity-type assumption
(Assumption 1) on the loss function. These two differences allow us to achieve constant regrets
bounds (independent of T).
Playing multiple arms per round was considered in the literature of multiple-play multi-
armed bandits. This problem was investigated under a budget constraint
C
by Zhou and
Tomlin [2018] and Xia et al. [2016]. In each round, the player picks
m
out of
K
arms, incurs
the sum of their losses. In addition to observing the losses of the played arms, the learner
learns a vector of costs which has to be covered by a pre-defined budget
C
. Once the budget is
consumed, the game finishes. An extension of the EXP3 algorithm allows deriving a strategy
in the adversarial setting with regret of order
OpKC log(K/m)
. The cost of each arm is
supposed to be in an interval [
cmin,
1], for a positive constant
cmin
. Hence the total number of
rounds in this game
T
satisfies
T
= Θ(
C/m
). Another online problem aims at minimizing the
cumulative regret in an adversarial setting with a small effective range of losses. Gerchinovitz
and Lattimore [2016] have shown the impossibility of regret scaling with the effective range
of losses in the bandit setting, while Thune and Seldin [2018] showed that it is possible to
circumvent this impossibility result if the player is allowed one additional observation per
round. However, it is impossible to achieve a regret dependence on
T
better than the rate of
order OTin this setting.
Decoupling exploration and exploitation was considered by Avner et al. [2012]. In each
round, the player plays one arm, then chooses one arm out of
K
to see its prediction (not
necessarily the played arm as in the canonical multi-armed bandits problem). They devised
algorithms for this setting and showed that the dependence on the number of arms
K
can be
improved. However, it is impossible to achieve a regret dependence on
T
better than
OT
.
Prediction with limited expert advice was also investigated by Helmbold and Panizza
[1997],Cesa-Bianchi and Lugosi [2006, Chap. 6] and Cesa-Bianchi et al. [2005]. However, in
these problems, known as label efficient prediction, the forecaster has full access to the experts
advice but limited information about the past outcomes of the sequence to be predicted. More
precisely, the outcome ytis not necessarily revealed to the learner. In such a framework, the
optimal regret is of order OT.
5
摘要:

ConstantregretforsequencepredictionwithlimitedadviceElMehdiSaad1,GillesBlanchard1;21LaboratoiredeMathématiquesd'Orsay,CNRS,UniversitéParis-Saclay;2InriaAbstractWeinvestigatetheproblemofcumulativeregretminimizationforindividualsequencepredictionwithrespecttothebestexpertinanitefamilyofsizeKunderlimi...

展开>> 收起<<
Constant regret for sequence prediction with limited advice El Mehdi Saad1 Gilles Blanchard12 1Laboratoire de Mathématiques dOrsay CNRS Université Paris-Saclay2Inria.pdf

共45页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:45 页 大小:719.29KB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 45
客服
关注