Adaptive Oracle-Ecient Online Learning Guanghui WangyZihao HuyVidya MuthukumarzJacob Abernethyy College of Computingy

2025-04-30 0 0 810.38KB 38 页 10玖币

侵权投诉

Adaptive Oracle-Eﬃcient Online Learning

Guanghui Wang†Zihao Hu†Vidya Muthukumar‡Jacob Abernethy†

College of Computing†

School of Electrical and Computer Engineering‡

School of Industrial and Systems Engineering‡

Georgia Institute of Technology

Atlanta, GA 30339

{gwang369,zihaohu,vmuthukumar8,prof}@gatech.edu

Abstract

The classical algorithms for online learning and decision-making have the beneﬁt of achieving the optimal

performance guarantees, but suﬀer from computational complexity limitations when implemented at

scale. More recent sophisticated techniques, which we refer to as oracle-eﬃcient methods, address this

problem by dispatching to an oﬄine optimization oracle that can search through an exponentially-large

(or even inﬁnite) space of decisions and select that which performed the best on any dataset. But despite

the beneﬁts of computational feasibility, oracle-eﬃcient algorithms exhibit one major limitation: while

performing well in worst-case settings, they do not adapt well to friendly environments. In this paper

we consider two such friendly scenarios, (a) “small-loss” problems and (b) IID data. We provide a new

framework for designing follow-the-perturbed-leader algorithms that are oracle-eﬃcient and adapt well to

the small-loss environment, under a particular condition which we call approximability (which is spiritually

related to suﬃcient conditions provided in (Dud´ık et al., 2020)). We identify a series of real-world settings,

including online auctions and transductive online classiﬁcation, for which approximability holds. We

also extend the algorithm to an IID data setting and establish a “best-of-both-worlds” bound in the

oracle-eﬃcient setting.

1 Introduction

Online learning is a fundamental paradigm for modeling sequential decision making problems (Cesa-Bianchi

& Lugosi, 2006; Shalev-Shwartz, 2011; Hazan, 2016). Online learning is usually formulated as a zero-sum

game between a learner and an adversary. In each round

= 1

, . . . , T

, the learner ﬁrst picks an action

from

a (ﬁnite) set

{x(1), . . . , x(K)}

with cardinality equal to

. In the meantime, an adversary reveals its

action

yt∈ Y

. As a consequence, the learner observes

, and suﬀers a loss

(

xt, yt

), where

X ×Y 7→

1].

The goal is to minimize the regret, which is deﬁned as the diﬀerence between the cumulative loss of the

learner PT

t=1 f(xt, yt), and the cumulative loss of the best action in hindsight L∗

T= minx∈X PT

t=1 f(x, yt).

A wide variety of algorithms have been proposed for the goal of minimizing worst-case regret (without any

consideration of computational complexity per iteration); see (Cesa-Bianchi & Lugosi, 2006; Shalev-Shwartz,

2011; Hazan, 2016) for representative surveys of this literature. These algorithms all obtain a worst-case

regret bound of the order

(

√Tlog K

), which is known to be minimax-optimal (Cesa-Bianchi & Lugosi,

2006). Over the last two decades, sophisticated adaptive algorithms have been designed that additionally

enjoy problem-dependent performance guarantees, which can automatically lead to better results in friendly

environments. One of the most important example for this kind of guarantees is the so-called “small-loss”

bound (Hutter & Poland, 2005; Cesa-Bianchi & Lugosi, 2006; Van Erven et al., 2014). Such a bound depends

on the best cumulative loss in hindsight (i.e.

L∗

) instead of the total number of rounds (

). Thus, this

bound is much tighter than the worst-case bound, especially when the best decision performs well in the sense

of incurring a very small loss. Another example is the “best-of-both-worlds” bound (Van Erven et al., 2014),

which results in an even tighter regret bound for independent and identically distributed (IID) loss functions.

However, all of these algorithms applied out-of-the-box suﬀer a linear dependence on the number of

decisions

. This is prohibitively expensive, especially in problems such as network routing (Awerbuch &

Kleinberg, 2008) and combinatorial market design (Cesa-Bianchi et al., 2014), where the cardinality of the

arXiv:2210.09385v1 [cs.LG] 17 Oct 2022

decision set grows exponentially with the natural expression of the problem. Several eﬃcient algorithms do

exist, even for uncountably inﬁnite decision sets, when the loss functions have certain special structure (such

as linearity (Kalai & Vempala, 2005) or convexity (Zinkevich, 2003)). However, such structure is often absent

in the above applications of interests.

Notice that the eﬃciency of the above specialized methods is usually made possible by assuming that the

corresponding oﬄine optimization problem (i.e., minimizing the (averaged) loss) can be solved eﬃciently.

This observation motivates the oracle-eﬃcient online learning problem (Hazan & Koren, 2016). In this setting,

the learner has access to a black-box oﬄine oracle, which, given a real-weighted dataset

{

(

w(j), y(j)

)

j=1

can eﬃciently return the solution to the following problem:

argmin

x∈X

j=1

w(j)f(x, y(j)).(1)

The goal is to design oracle-eﬃcient algorithms which can query the oﬄine-oracle

(1) times each round.

Concrete examples of such an oracle include algorithms for empirical risk minimization (Bishop, 2007),

data-driven market design (Nisan & Ronen, 2007), and dynamic programming (Bertsekas, 2019).

As pointed out by Hazan & Koren (2016), the design of oracle-eﬃcient algorithms is extremely challenging

and such an algorithm does not exist in the worst case. Nevertheless, recent work (Daskalakis & Syrgkanis,

2016; Syrgkanis et al., 2016; Dud´ık et al., 2020) has introduced a series of algorithms which are oracle-eﬃcient

when certain suﬃcient conditions are met. Among them, the state-of-the-art method is the generalized-

follow-the-perturbed-leader algorithm (GFTPL, Dud´ık et al., 2020), which is a variant of the classical

follow-the-perturbed-leader (FTPL) algorithm (Kalai & Vempala, 2005). Similar to FTPL, GFTPL perturbs

the cumulative loss of each decision by adding a random variable, and chooses the decision with the smallest

perturbed loss as

. However, the vanilla FTPL perturbs each decision independently, which requires to

generate

independent random variables in total. Moreover, the oracle in

(1)

can not be applied here since

as it cannot handle the perturbation term. To address these limitations, GFTPL only generates a noise

vector of low dimension (in particular, much smaller dimension than the size of the decision set) in the

beginning, and constructs

dependent perturbations based on the multiplication between the noise vector

and a perturbation translation matrix (PTM). Therefore, the PTM critically ensures that the computational

complexity for the noise generation itself is largely reduced. Furthermore, oracle-eﬃciency can be achieved

by setting the elements in the PTM as carefully designed synthetic losses. Dud´ık et al. (2020) show that

a worst-case optimal regret bound can be obtained when the PTM is admissible, i.e., every two rows are

substantially distinct. This serves as a suﬃcient condition for achieving oracle-eﬃciency.

While these results form a solid foundation for general worst-case oracle-eﬃcient online learning, it

remains unclear whether problem-dependent, or data-adaptive bounds are achievable in conjunction with

oracle-eﬃciency. In other words, the design of a generally applicable oracle-eﬃcient and adaptive online

learning algorithm has remained open. In this paper, we provide an aﬃrmative answer to this problem, and

make the following contributions.

•

We propose a variant of the GFTPL algorithm (Dud´ık et al., 2020), and derive a new suﬃcient condition

for ensuring oracle-eﬃciency while achieving the small-loss bound. Our key observation is that while

the admissibility condition of the PTM in GFTPL successfully stabilizes the algorithm (by ensuring

that

[

xt6

xt+1

] is small), it does not always enable adaptation. We address this challenge via a new

condition for PTM, called approximability. This condition ensures a stronger stability measure, i.e.,

the ratio of

[

x(i)

] and

[

xt+1

x(i)

] is upper-bounded by a universal constant for any

i∈

[

which is critical for proving the small-loss bound. In summary, we obtain the small-loss bound by

equipping GFTPL with an approximable PTM, a data-dependent step-size and Laplace distribution for

the perturbation noise. As a result of these changes, our analysis path diﬀers signiﬁcantly from that

of Dud´ık et al. (2020). Our new condition of approximability is simple and interpretable, and can be

easily veriﬁed for an arbitrary PTM. It shares both similarities and diﬀerences from the admissibility

condition proposed in Dud´ık et al. (2020). We demonstrate this through several examples where one of

the suﬃcient conditions holds, but not the other.

•

We identify a series of real-world applications for which we can construct approximable PTMs: (a) a

series of online auctions problems (Dud´ık et al., 2020); (b) problems with a small adversary action

space

|Y|

(Daskalakis & Syrgkanis, 2016); and (c) transductive online classiﬁcation (Syrgkanis et al.,

2016; Dud´ık et al., 2020). This is the ﬁrst-time that the small-loss bound is obtained in all of these

applications. To achieve this, we introduce novel PTMs and analysis for showing the approximability

condition on these PTMs.

•

We achieve the “best-of-both-worlds” bound, which enjoys even tighter results when the data is IID or

the number of leader changes is small. The main idea is to combine our proposed algorithm with vanilla

FTL leveraging ideas from a meta-algorithm called FlipFlop introduced in Van Erven et al. (2014).

2 Related Work

Our work contributes to two bodies of work: oracle-eﬃcient online learning and adaptive online learning. In

this section, we brieﬂy review the related work in these areas.

2.1 Oracle-eﬃcient online learning

For oracle-eﬃcient online learning, the pioneering work of Hazan & Koren (2016) points out that oracle-eﬃcient

methods do not exist when dealing with general hostile adversaries, which implies that additional assumptions

on the problem structure have to be made. Daskalakis & Syrgkanis (2016) consider the setting in which the

cardinality of the adversary’s action set

is ﬁnite and small, and propose to add a series of “fake” losses

to the learning history based on random samples from

. They prove that for this setting an

(

|Y|√T

)

regret bound can be obtained. Syrgkanis et al. (2016) study the contextual combinatorial online learning

problem, where each action is associated with a binary vector. They make the assumption that the loss

function set contains all linear functions as a sub-class. The approach in Syrgkanis et al. (2016) constructs a

set of synthetic losses for perturbation based on randomly-selected contexts, and achieves worst-case optimal

bounds when all the contextual information can be obtained beforehand, or when there exists a small set

of contexts that can tell each decision apart. Dud´ık et al. (2020) is the ﬁrst work to focus on the general

non-contextual setting, and propose the generalized FTPL algorithm. This algorithm generates a small

number of random variables at the beginning, and then perturbs the learning history via the innter product

between the PTM matrix and the random variables. The algorithm can be implemented eﬃciently by setting

the entries of the PTM as carefully designed loss values. Niazadeh et al. (2021) consider a more complicated

combinatorial setting where the oﬄine problem is NP-hard, but a robust approximation oracle exists. For

this case, they propose an online algorithm based on a multiplicative approximation oracle, and prove that it

has low approximate regret, which is a measure weaker than regret, since it only compares with a fraction of

the cumulative loss of the best decision in hindsight. Note that none of the aforementioned methods can be

easily shown to adapt to friendly structure in data. Recently, several concurrent works (Block et al., 2022;

Haghtalab et al., 2022a) investigate how to obtain tighter bounds oracle-eﬃciently in the smoothed-analysis

setting where the distribution of data is close to the uniform distribution (Rakhlin et al., 2011; Haghtalab

et al., 2022b). The main focus is to adapt to the VC dimension of the hypothesis class, rather than improve

the dependence on the number of rounds T.

In this paper, we mainly focus on the so-called the learning with expert advice setting (Cesa-Bianchi &

Lugosi, 2006), where the action set is discrete, and the loss can be highly non-convex. On the other hand,

eﬃcient algorithms can be obtained even for continuous action sets when the loss functions have certain

properties, such as linearity (Kalai & Vempala, 2005; Hutter & Poland, 2005; Awerbuch & Kleinberg, 2008),

convexity (Zinkevich, 2003; Hazan et al., 2007) or submodularity (Hazan & Kale, 2012). Finally, we note

that, in this paper we mainly focus on the full-information setting, where the learner can observe the whole

loss function after the action is submitted. Oracle-eﬃcient online learning has also been widely studied in the

contextual bandit setting (Langford & Zhang, 2008; Dudik et al., 2011; Agarwal et al., 2014; Foster et al.,

2018; Foster & Rakhlin, 2020). The nature of the oracle-eﬃcient guarantees for the contextual bandit problem

is much weaker compared to full-information online learning: positive results either assume a stochastic

probability model on the responses given covariates (e.g. Foster et al. (2018); Foster & Rakhlin (2020)) or

signiﬁcantly stronger oracles than Eq. (1) (e.g. Agarwal et al. (2014)).

2.2 Adaptive online learning

In this paper, we focus on designing oracle-eﬃcient algorithms with problem-dependent regret guarantees.

Note that this kind of bound can be achieved by many ineﬃcient algorithms in general, such as Hedge and

its variants (Cesa-Bianchi & Lugosi, 2006; De Rooij et al., 2014; Luo & Schapire, 2015), follow-the-perturbed-

leader (Kalai & Vempala, 2005; Van Erven et al., 2014) or follow-the-regularized-leader (Orabona, 2019).

Small-loss bounds can also be obtained eﬃciently when the loss functions are simply linear (Hutter & Poland,

2005; Syrgkanis et al., 2016). On the other hand, in online convex optimization, small-loss bounds can be

obtained when the loss functions are additionally smooth (Srebro et al., 2010; Orabona et al., 2012; Wang

et al., 2020). However, these algorithms heavily rely on the special structure of the loss functions. In this

paper, we take the ﬁrst step to extend these methods to support the more complicated (generally non-convex)

problems which appear in real-world applications.

Apart from the small-loss, there exist other types of problem-dependent bounds, such as second-order

bound (Cesa-Bianchi et al., 2005; Gaillard et al., 2014), quantile bound (Chaudhuri et al., 2009; Koolen

& Erven, 2015), or parameter-free bound (Luo & Schapire, 2015; Cutkosky & Orabona, 2018). Moreover,

advanced adaptive results can also be obtained by minimizing more advanced performances measures other

than regret, such as adaptive regret (Hazan & Seshadhri, 2007; Zhang et al., 2019), or dynamic regret (Zhang

et al., 2018; Zhao et al., 2020). How to obtain these more reﬁned theoretical guarantees in the oracle-eﬃcient

setting remains an interesting open problem.

3 GFTPL with Small-Loss Bound

In this section, we ignore computational complexity for the moment and we provide a new FTPL-type

algorithm that enjoys the small-loss bound. We then show that the proposed algorithm can be implemented

eﬃciently by the oﬄine oracle in Section 4. Before diving into the details, we ﬁrst brieﬂy recall the deﬁnition

of online learning and regret.

Preliminaries.

The online decision problem we consider can be described as follows. In each round

, a

learner picks an action

xt∈ X

= [

x(1), . . . , x(K)

]. After observing the adversary’s decision

yt∈ Y

, the learner

suﬀers a loss

(

xt, yt

) where the loss function

X × Y 7→

1] is known to the learner and adversary. The

regret of an online learning algorithm Ais deﬁned as

T:= EhPT

t=1 f(xt, yt)−L∗

Ti,

where

L∗

min

k∈[K]PT

j=1 f

(

x(k), yj

) is the cumulative loss of the best action in hindsight, and the expectation

is taken only with respect to the potentially randomized strategy of the learner.

Our proposed algorithm follows the framework of GFTPL (Dud´ık et al., 2020). We ﬁrst brieﬂy introduce

to the intuition behind this method. Speciﬁcally, in each round

, GFTPL picks

by solving the following

optimization problem:

xt= argmin

k∈[K]Pt−1

j=1 f(x(k), yj) + Γ(k), α,

where

is a

-dimensional noise vector (

NK

) generated from a uniform distribution, and Γ

(k)

is the

-th

row of a matrix Γ

∈

K×N

, which is referred to as the perturbation translation matrix (PTM). Compared

to vanilla FTPL, which generates

random variables (one for each expert), GFTPL only generates

random

variables, where

is much smaller than

. Each expert is perturbed by a diﬀerent linear combination

of these random variables based on the PTM Γ. The results of Dud´ık et al. (2020) rely on the following

assumption on Γ.

Deﬁnition 1. (δ-admissibility (Dud´ık et al., 2020))

Let Γ

∈

K×N

be a matrix, and denote Γ

(k)

the

-th row of Γ, and Γ

(k,i)

the

-th element of Γ

(k)

. Then, Γis

-admissible if (a)

∀k, k0∈

[

∃i∈

[

such that Γ(k,i)6= Γ(k0,i); and (b) ∀i∈[N], k, k0∈[K], such that Γ(k,i)6= Γ(k0,i), then |Γ(k,i)−Γ(k0,i)| ≥ δ.

The

-admissibility guarantees that every two rows in Γ are signiﬁcantly distinct. As pointed out by

Dud´ık et al. (2020), this is the essential property required by GFTPL, and is used to stabilize the algorithm

in the analysis, i.e., ensuring that

[

xt6

xt+1

] is small. However, the adaptive analysis of ineﬃcient FTPL

(Hutter & Poland, 2005) (i.e. using a noise vector of dimension equal to the size of the decision set) reveals

that this type of stability is insuﬃcient. Instead, one needs to control the following ∀tand ∀i∈[K],

P[xt=x(i)]

P[xt+1 =x(i)],(2)

the ratio of the probability of picking the

-th decision in two consecutive rounds. We note that

-admissibility

is not suﬃcient to ensure this quantity is bounded, as we establish in the following counter-example lemma.

(See Appendix A.1 for proof).

Lemma 1.

There is an instance of a

-admissible Γ, and a sequence

{yt

= 1

, . . .}

, such that if we run

GFTPL we can have P[xt=x(i)]

P[xt+1=x(i)]=∞for some i∈[K]and some t > 0.

To address this problem, we propose a new property for Γ. Deﬁne

{s∈RN

ksk1≤γ}

as the

`1-ball of size γ.

Deﬁnition 2. (γ-approximability) Let Γ∈[0,1]K×N. We say that Γis γ-approximable if

∀k∈[K], y ∈ Y ∃s∈B1

γ∀j∈[K] : DΓ(k)−Γ(j), sE≥f(x(k), y)−f(x(j), y).

It may not be immediately obvious how we arrived at this condition, so let us provide some intuition. The

goal of perturbation methods in sequential decision problems, going back to the early work of Hannan (1957),

is to ensure that the algorithm is “hedging” across all available alternative decisions. A newly observed

data point

may make expert

suddenly look more attractive than expert

, as we have now introduced a

new gap

(

x(k), y

)

−f

(

x(j), y

) in their measured loss values. With this in mind, we say that Γ is a “good”

(i.e. approximable) choice for the PTM, if this gap can be overcome (hedged) by some small (i.e. likely)

perturbation s, so that Γ(k)−Γ(j), smakes up the diﬀerence. The inequality makes this property ﬂexible

and much easier to satisfy in real-world applications: we only need the gap approximation from above. Later,

we will show that

-approximability guarantees the required stability measure in

(2)

, and thus is critical for

the small-loss bound.

We want to emphasize two ﬁnal points. First, the

-approximability condition is purely for analysis

purposes and we don’t need compute the quantity

in response to

and

. Second, much of the computational

and decision-theoretic challenges rest heavily on the careful design of Γ. The PTM allows the algorithm to

perform the appropriate hedging across an exponentially-sized set of

experts with only

NK

dimensions

of perturbation. As we demonstrate in the following example, we can always construct a

-approximable Γ,

with

(

log K

), but at the expense of computational eﬃciency. The proposed Γ will not generally be

compatible with the given oracle, in the sense that the optimization problem underlying GFTPL cannot be

written in the form of Eq.

(1)

. In the next section, we will show how to address this problem via another

condition on Γ called implementablity.

Simple Example

For any online learning problem we may construct Γ as follows. Let

dlog2Ke

, and

deﬁne the

th row Γ

(k)

to be the binary representation of the index

, with +1

/−

1 values instead of 0

We claim that this Γ is

-approximable, for

dlog2Ke

. We can satisfy the condition of Deﬁnition 2, by

setting

= Γ

(k)

. It is easy to see that for any

we have

Γ(k)−Γ(j), s

Γ(k)−Γ(j),Γ(k)≥

≥

f(x(k), y)−f(x(j), y), where the last inequality holds because |f(x(i), y)| ≤ 1 for any i∈[K].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdaptiveOracle-EcientOnlineLearningGuanghuiWangyZihaoHuyVidyaMuthukumarzJacobAbernethyyCollegeofComputingySchoolofElectricalandComputerEngineeringzSchoolofIndustrialandSystemsEngineeringzGeorgiaInstituteofTechnologyAtlanta,GA30339fgwang369,zihaohu,vmuthukumar8,profg@gatech.eduAbstractTheclassicalal...

展开>> 收起<<

Adaptive Oracle-Ecient Online Learning Guanghui WangyZihao HuyVidya MuthukumarzJacob Abernethyy College of Computingy.pdf

共38页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adaptive Oracle-Ecient Online Learning Guanghui WangyZihao HuyVidya MuthukumarzJacob Abernethyy College of Computingy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: