Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles Yuxuan HanyJialin ZengyYang WangyzYang XiangyxJiheng Zhangz

2025-05-02 1 0 641.27KB 25 页 10玖币

侵权投诉

Optimal Contextual Bandits with Knapsacks under Realizability via

Regression Oracles

Yuxuan Han∗† Jialin Zeng∗† Yang Wang†‡ Yang Xiang†§ Jiheng Zhang‡

†Department of Mathematics, HKUST

‡Department of Industrial Engineering and Decision Analytics, HKUST

§HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute

Abstract

We study the stochastic contextual bandit with

knapsacks (CBwK) problem, where each action,

taken upon a context, not only leads to a

random reward but also costs a random resource

consumption in a vector form. The challenge is

to maximize the total reward without violating

the budget for each resource. We study this

problem under a general realizability setting

where the expected reward and expected cost are

functions of contexts and actions in some given

general function classes Fand G, respectively.

Existing works on CBwK are restricted to the

linear function class since they use UCB-type

algorithms, which heavily rely on the linear

form and thus are difﬁcult to extend to general

function classes. Motivated by online regression

oracles that have been successfully applied to

contextual bandits, we propose the ﬁrst universal

and optimal algorithmic framework for CBwK

by reducing it to online regression. We also

establish the lower regret bound to show the

optimality of our algorithm for a variety of

function classes.

1 INTRODUCTION

Contextual Bandits (CB) is a fundamental online learning

framework with an exploration-exploitation tradeoff. At

each step, the decision maker, to maximize the total

reward, takes one out of Kactions upon observing the

context and then receives a random reward. It has received

extensive attention due to its wide range of applications,

such as recommendation systems, clinic trials, and

∗Equal contribution. Correspondence to Yang Xiang, Jiheng

Zhang (<maxiang@ust.hk>,<jiheng@ust.hk>).

Proceedings of the 26th International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS) 2023, Valencia, Spain.

online advertisement (Bietti et al., 2021; Lattimore and

Szepesv´

ari, 2020; Slivkins et al., 2019).

However, the standard contextual bandit setting ignores

the budget constraint that commonly arises in real-world

applications. For example, when a search engine picks

an ad to display, it needs to consider both the advertiser’s

feature and his remaining budget. Thus, canonical

contextual bandits have been generalized to Contextual

Bandit with Knapsacks (CBwK) setting, where a selected

action at each time twill lead to a context-dependent

consumption of several constrained resources and there

exists a global budget Bon the resources.

In contextual decision-making, it is crucial to model the

relationship between the outcome (reward and costs) and

the context to design an effective and efﬁcient policy. In a

non-budget setting, i.e., without the knapsack constraints,

the contextual bandit algorithms fall into two main

categories: agnostic and realizability-based approaches.

The agnostic approaches aim to ﬁnd the best policy out

of a given policy set Πand make no assumptions on the

context-outcome relationships. In this case, one often

requires access to a cost-sensitive classiﬁcation oracle over

Πto achieve computational efﬁciency (Agarwal et al.,

2014; Dudik et al., 2011; Langford and Zhang, 2007).

On the other hand, realizability-based approaches assume

that there exists a given function class Fthat models the

outcome distribution with the given context. When Fis

a linear class, the optimal regret has been achieved by

UCB-type algorithms (Abbasi-Yadkori et al., 2011; Chu

et al., 2011). Moreover, a uniﬁed approach to developing

optimal algorithms has recently been studied in Foster

et al. (2018), Foster and Rakhlin (2020), Simchi-Levi

and Xu (2021) by introducing regression oracles over the

class F. While the realizability assumption could face

the problem of model mismatch, it has been shown that

empirically realizability-based approaches outperform

agnostic-based ones when no model misspeciﬁcation

exists (Foster et al., 2018; Krishnamurthy et al., 2016).

However, these oracle-based algorithms only apply to

unconstrained contextual bandits without considering the

arXiv:2210.11834v2 [cs.LG] 22 Feb 2023

Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles

knapsack constraints.

In the CBwK setting, two similar approaches as above have

been explored. For the agnostic approach, Badanidiyuru

et al. (2014) and Agrawal et al. (2016) extend the general

contextual bandits to the knapsack setting by generalizing

the techniques in contextual bandits without knapsack

constraints (Agarwal et al., 2014; Dudik et al., 2011).

However, as shown in Agrawal and Devanur (2016), the

optimization problem using classiﬁcation oracle in the

agnostic setting could be NP-hard even for the linear

case. On the other hand, CBwK under linear realizability

assumption (i.e., both the reward function class Fand

the cost function class Gare linear function classes) and

its variants have been studied. Based on the conﬁdence

ellipsoid results guaranteed by the linear model, Agrawal

and Devanur (2016) develop algorithms for the stochastic

linear CBwK with near-optimal regret bounds. Sivakumar

et al. (2022) study the linear CBwK under the smoothed

contextual setting where the contexts are perturbed by

Gaussian noise. However, to our knowledge, no work

has considered realizability beyond the linear case. The

difﬁculty lies in the lack of conﬁdence ellipsoid results for

the general function class as in linear model assumptions.

In this paper, we try to address the following open problem:

Is there a uniﬁed algorithm that applies to CBwK under the

general realizability assumption?

We answer this question positively by proposing

SquareCBwK, the ﬁrst general and optimal algorithmic

framework for stochastic CBwK based on online regression

oracles. Compared to the contextual bandits, additional

resource constraints in CBwK lead to more intricate

couplings across contexts, which makes it substantially

more challenging to strike a balance between exploration

and exploitation. It is thus nontrivial to apply regression

oracle techniques to CBwK. The key challenge lies in

designing a proper score of actions for decision-making

that can well balance the reward and the cost gained over

time to maximize the rewards while ensuring no resource

is run out.

1.1 Our contributions

In this paper, we address the above issue and successfully

apply regression oracles to CBwK under the realizability

assumption. Our contributions are summarized in the

following three aspects.

Algorithm for CBwK with Online Regression Oracles:

In Section 3.1, we propose the ﬁrst uniﬁed algorithm

SquareCBwK for solving CBwK under the general

realizability assumption, which addresses the above open

problem. Motivated by the works in contextual bandits

that apply regression oracles to estimate the reward (to

which we refer as the reward oracle), we propose to

apply a new cost oracle in SquareCBwK to tackle the

additional resource constraints in CBwK. Speciﬁcally,

under the setting that the budget Bscales linearly with the

time horizon T, we construct a penalized version of the

unconstrained score based on the cost oracle and adjust

the penalization adaptive to the resource consumption over

time. Our algorithm is able to be instantiated with any

available efﬁcient oracle for any general function class to

obtain an upper regret bound for CBwK. In this sense, we

provide a meta-algorithm that reduces CBwK to regression

problems, which is a well-studied area in machine learning

with various computationally efﬁcient algorithms for

different function classes. In Section 3.3, we show that

through different instantiations, SquareCBwK can achieve

the optimal regret bound in the linear CBwK and derive

new guarantees for more general function classes like

non-parametric classes.

Lower Bound Results: In Section 3.2, we develop a new

approach for establishing lower bounds for CBwK under

the general realizability assumption. Previous works in

CBwK under the realizability assumption usually show the

optimality of their results by matching the lower bound in

the unconstrained setting. However, when the cost function

class Gis more complicated than reward function class F,

the lower bound in the unconstrained setting will be very

loose since it does not consider the information of G. Our

method, in contrast, can provide lower bounds that include

information from both Fand G. Moreover, we apply this

new method to construct lower bounds that match the regret

bound given by SquareCBwK for various choices of F

and G, which demonstrates SquareCBwK’s optimality for

different function classes.

Relaxed Assumption on the Budget: While we present

our main result under the regime that the budget Bscales

linearly with the time horizon T, such an assumption may

be restrictive compared with previous works. For example,

in the linear CBwK, both Sivakumar et al. (2022) and

Agrawal and Devanur (2016) allow a relaxed budget B=

Ω(T3/4). To close this gap, in Section 4, we design a

two-stage algorithm based on SquareCBwK that allows a

weaker condition on the budget for solving CBwK under

the general realizability assumption. In particular, in the

linear CBwK setting, our relaxed condition recovers the

B= Ω(T3/4)condition.

1.2 Other related works

Bandit with Knapsack and Demand Learning: One

area closely related to our work is bandits with knapsacks

(BwK), which does not consider contextual information.

The non-contextual BwK problem is ﬁrst investigated in

a general formulation by Badanidiyuru et al. (2018) and

further generalized to concave reward/convex constraint

setting by Agrawal and Devanur (2014a). Both papers

Yuxuan Han∗†, Jialin Zeng∗†, Yang Wang†‡, Yang Xiang†§, Jiheng Zhang‡

use the idea of Upper-Conﬁdence Bound as in the non-

constrained setting (Auer et al., 2002). Ferreira et al.

(2018) apply the BwK framework to the network revenue

management problem and develop a Thompson-sampling

counterpart for BwK. Their assumption on the demand

function is further generalized in the recent works (Chen

et al., 2022; Miao and Wang, 2021).

Online Optimization with Knapsacks: One closely

related area is online optimization with knapsacks, which

can be seen as a full-information variant of the BwK

problem: after a decision is made, the feedback for all

actions is available to the learner. Such problem often

leads to solving online linear/convex programming, which

has been studied in Balseiro et al. (2022), Jenatton et al.

(2016), Agrawal and Devanur (2014b), Mahdavi et al.

(2012), Liu and Grigas (2022) and Castiglioni et al. (2022).

In the work of Liu and Grigas (2022), they study online

contextual decision-making with knapsack constraints.

Speciﬁcally, they study the continuous action setting by

developing an online primal-dual algorithm based on the

“Smart Predict-then-Optimize” framework (Elmachtoub

and Grigas, 2022) in the unconstrained setting and leave

the problem with bandit feedback open. Our work provides

a partial answer to this open problem in the ﬁnite-armed

setting.

Concurrent works in CBwK: During the review period of

our paper, two very recent works appeared on the arxiv.org

(Ai et al., 2022; Slivkins and Foster, 2022) which also

consider the CBwK problem with general function classes.

In independent and concurrent work, Slivkins and Foster

(2022) consider the CBwK problem via regression oracles

similar to our work. They propose an algorithm based

on the perspective of Lagrangian game that results in a

slightly different choice of arm selection strategy than ours.

They also attain the same regret bound up to scalar as

our Theorem 3.1. Nevertheless, the focus of their work

is limited to the setting described in section 3 (i.e., the

B= Ω(T)regime), where they provide only an upper

regret bound. In contrast, our work demonstrates the

optimality by proving lower bound results for B= Ω(T)

regime. Moreover, we derive a two-stage algorithm and

corresponding regret guarantees for the B=o(T)regime

that requires less stringent budget assumptions.

Another recent work (Ai et al., 2022) study the CBwK

problem with general function classes by combining

the re-solving heuristic and the distribution estimation

techniques. Their result is interesting in that it may

work in some function classes without an efﬁcient online

oracle. They also achieve logarithmic regret under

suitable conditions thus are more problem-dependent, in

contrast to the oracle-based approach, as discussed in

section 4 of Foster and Rakhlin (2020). However, their

framework involves a distribution estimation procedure

that requires additional assumptions about the regularity

of the underlying distribution. Consequently, their

method’s regret is inﬂuenced by the regularity parameters,

potentially leading to suboptimal results.

2 PRELIMINARIES

Notations Throughout this paper, a.band a=O(b)

a&band a= Ω(b)means a≤Cb a≥Cbfor some

absolute constant C.a=˜

O(b)a=˜

Ω(b)means a=

O(bmax{1,polylog(b)})a= Ω(bmax{1,polylog(b)})

and abmeans a=O(b)and b=O(a).

2.1 Basic setup

We consider the stochastic CBwK setting. Given the

budget B∈Rdfor ddifferent resources and the time

horizon T, at each step, the decision maker needs to select

an arm at∈[K]upon observing a context xt∈ X drawn

i.i.d. from some unknown distribution PX. Then a reward

rt,at∈[0,1] and a consumption vector ct,at∈[0,1]dis

observed. We assume that rt,a and ct,a are generated i.i.d.

from a ﬁxed distribution parameterized by the given xtand

a. The goal is to learn a policy, a mapping from context

to action, that maximizes the total reward while ensuring

that the consumption of each resource does not exceed the

budget.

Without loss of generality we assume B1=B2=··· =

Bd=B. We focus on the regime B= Ω(T). That is the

budget scales linearly with time. (We relax this assumption

on the budget in Section 4.) Moreover, we make a standard

assumption that the K-th arm is a null arm that generates no

reward or consumption of any resource when being pulled.

Similar to the unconstrained setting (Foster et al., 2018;

Foster and Rakhlin, 2020; Simchi-Levi and Xu, 2021), we

assume access to two classes of functions F ⊂ (X×[K]→

[0,1]) and G ⊂ (X × [K]→[0,1]d)that characterize

the expectation of reward and consumption distributions

respectively. Note that only one function class Ffor

the reward distribution is considered in the unconstrained

setting, while to ﬁt CBwK, we add another regression class

Gto model the consumption distribution. We assume the

following realizability condition.

Assumption 2.1. There exists some f?∈ F and g∗∈

Gsuch that f∗(x, a) = E[rt,a|xt=x]and g∗(x, a) =

E[ct,a|xt=x],∀a∈[K].

The algorithm benchmark is the best dynamic policy, which

knows the contextual distribution PX,f∗, and g∗and can

dynamically maximize total reward given the historical

information and the current context. We denote OPTDP as

the expected total reward of the best dynamic policy. The

goal of the decision-maker is to minimize the following

regret:

Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles

Deﬁnition 2.1. Reg(T) := OPTDP −E[Pτ

t=1 rt,at], where

τis the stopping time when there exists some j∈[d]s.t.

Pτ

t=1(ct,at)j> B −1.

Due to the intractability of the best dynamic policy in

Deﬁnition 2.1, we consider a static relaxation problem that

provides an upper bound of OPTDP.

Denote ∆Kthe set of probability distributions over the

action set [K]. The following program aims to ﬁnd the best

static randomized policy p∗:X → ∆Kthat maximizes the

expected per-round reward while ensuring resources are not

exceeded in expectation.

max

p:X→∆K

Ex∼PX[X

a∈[K]

pa(x)f∗(x, a)]

s.t. Ex∼PX[X

a∈[K]

pa(x)g∗(x, a)] ≤B/T ·1.

(1)

Lemma 2.1. Let OPT denote the value of the optimal static

policy (1), then we have TOPT ≥OPTDP.

Lemma 2.1 can be derived directly following the proof

of Lemma 1 in Agrawal and Devanur (2016), where they

consider the linear CBwK case but the reasoning therein

is completely independent of the linear structure. With

Lemma 2.1, we can control the regret bound by considering

TOPT −E[Pτ

t=1 rt,at].

2.2 Online Regression Oracles

Under the realizability Assumption 2.1, We introduce

the online regression problem and the notion of online

regression oracles.

The general setting of online regression problem with input

space Z, output space Y, function class Hand loss `:

Y×Y → R+is described as following: At the beginning of

the game, the environment chooses an h∗:Z → Y, h∗∈

Has the underlying response generating function. Then

at each round t, (i) the learner receives an input zt∈ Z

possibly chosen in adversarial by the environment, (ii) the

learner then predicts a value ˆy(zt)∈ Y based on historical

information, (iii) the learner observes a noisy response of

h∗(zt)and suffers a loss `(ˆy(zt), h∗(zt). The goal of the

learner is to minimize the cumulative loss

Regsq(T;h∗) :=

t=1

`ˆy(zt), h∗(zt).

In our problem, we assume our access to oracles Rr,Rc

for two online regression problems, respectively:

Reward Regression Oracle The reward regression

oracle Rris assumed to be an algorithm for the online

regression problem with Z=X × [K],Y= [0,1],H=

F, `(y1, y2) = (y1−y2)2so that when ˆytis generated by

Rr, there exists some Regr

sq as a function of Tsuch that

Regsq(T;f∗)≤Regr

sq(T),∀f∗∈ F.(2)

Cost Regression Oracle The cost regression oracle Rc

is assumed to be an algorithm for the online regression

problem with Z=X × [K],Y= [0,1]d,H=

G, `(y1,y2) = ky1−y2k2

∞so that when ˆytis generated

by Rc, there exists some Regc

sq as a function of Tsuch that

Regsq(T;g∗)≤Regc

sq(T),∀g∗∈ G.(3)

2.3 Online Mirror Descent

Our algorithm adopts an online primal-dual framework to

tackle the challenge brought by knapsack constraints. Our

strategy of updating the dual variable λtfalls into the

general online convex optimization (OCO) framework. In

the OCO problem with parameter set Λ, adversary class L

and time horizon T, the learner needs to choose λt∈Λ

adaptively at each round t. After each choice, he will

observe an adversarial convex loss Lt∈ L and pay the

cost Lt(λt).The goal of the learner is to minimize the

cumulative regret:

RegOCO(T; Λ,L) :=

t=1

Lt(λt)−min

λ∈Λ

t=1

Lt(λ).

In our designed adversary, Lt(λ) = hB

T·1−ct,at,λi

and minimizing Ltcorresponds to penalize the violation

of the budget. We focus on the following adversary class

and parameter set:

L={L(λ) := hθ,λi:θ∈Rd,kθk∞≤1}

Λ = {λ∈Rd:λ≥0,kλk1≤Z},

where Z > 0is the `1radius of Λto be determined.

The online mirror descent (OMD) algorithm (Hazan et al.,

2016; Shalev-Shwartz et al., 2012) is a simple and fast

algorithm achieving the optimal OCO regret in the non-

euclidean geometry. OMD follows the update rule

λt= arg min

λ∈Λh∇Lt−1(λt−1),λi+1

ηt

Dh(λ,λt−1),(4)

where his the metric generating function, and Dhis

the associated Bregman divergence. After adding a slack

variable and re-scaling, the OCO problem over L,Λis

equivalent to the OCO problem over

L={˜

L(˜

λ) := h˜

θ,˜

λi:˜

θ∈Rd+1,k˜

θk∞≤Z, ˜

θd+1 = 0}

Λ = {˜

λ∈Rd+1 :˜

λ≥0,k˜

λk1= 1},

which can be solved via the normalized Exponentiated

Gradient (i.e., selecting has the negative entropy function).

In this case, the OMD algorithm has the following OCO

regret guarantee (Hazan et al., 2016; Shalev-Shwartz et al.,

2012):

Yuxuan Han∗†, Jialin Zeng∗†, Yang Wang†‡, Yang Xiang†§, Jiheng Zhang‡

Lemma 2.2. Setting ηt=η=O(log d

√T), the OMD yields

regret

RegOCO(T; Λ,L).ZpTlog d(5)

As in the literature (Agrawal and Devanur, 2016;

Castiglioni et al., 2022) that introduces the online primal-

dual framework, Lemma 2.2 plays an essential role in

controlling the regret induced by knapsack constraints in

SquareCBwK.

3 ALGORITHM AND THEORETICAL

GUARANTEES

We are ready to present our main algorithm SquareCBwK

for solving stochastic CBwK under the general realizability

assumption. We ﬁrst present the algorithm and theoretical

results under B= Ω(T)regime for simplicity of algorithm

design. Algorithm and analysis for more general choices of

Bare presented in section 4.

3.1 The SquareCBwK Algorithm

The main body of SquareCBwK has a similar structure as

the SquareCB algorithm for contextual bandits (Foster and

Rakhlin, 2020), but with signiﬁcant changes necessary to

handle the knapsack constraints. SquareCBwK is presented

in Algorithm 1 with three key modules: prediction through

oracles, arm selection scheme, and dual update through

OMD.

Prediction through oracles At each step, after observing

the context, SquareCBwK will simultaneously access the

reward oracle Rrand cost oracle Rcto predict the reward

and cost of each action for this round. Then these two

predicted scores are incorporated through the following

Lagrangian to form a ﬁnal predicted score ˆ

`t:

`t,a := ˆrt,a +λT

t(1·B/T −ˆ

ct,a),∀a∈[K].

Arm selection scheme After computing the predicted

score ˆ

`t, we employ the probability selection strategy in

Abe and Long (1999): We choose the greedy action score

evaluated by ˆ

`tas the benchmark and select each arm

awith the probability pt,a that is inversely proportional

to the gap between the arm’s score and the benchmark.

This strategy strikes a balance between exploration and

exploitation: When the predicted score for an action is

close to the greedy action, we tend to explore it with the

probability roughly as 1/K, otherwise with a very small

chance.

Dual update through OMD After choosing the arm at,

the new data {xt, at, rt,at,ct,at}will be fed into Rc,Rd

and OMD. We then update the dual variable λt∈Λ

successively through OMD.

Algorithm 1: SquareCBwK

Input: Time horizon T, total budget initial B,

learning rate γ>0, online regression oracle

for reward Rrand cost Rc, radius Z=T /B

of the parameter set Λ.

1Initialization:initialize Rr,Rcand Rd.

2for t= 1, . . . , T do

3Observe context xt.

4Rrpredicts ˆrt,a,Rcpredicts ˆ

ct,a,∀a∈[K]

5Compute ˆ

`t,a := ˆrt,a +λT

t(B/T ·1−ˆct,a),

∀a∈[K].

6Let bt= arg max

a∈[K]

`t,a.

7For each a6=bt, deﬁne

pt,a := 1

K+γˆ

`t,bt−ˆ

`t,aand let

pt,bt= 1 −Pa6=btpt,a.

8Sample arm at∼pt. Observe reward rt,atand

cost ct,at.

9if ∃j∈[d],Pt

t0=1 ct,at[j]≥B−1then

10 Exit

11 end

12 Feed Rrwith data {(xt, at), rt,at)}and Rcwith

data {(xt, at),ct,at}.

13 Feed OMD with ct,atand OMD updates

λt+1 ∈Λ.

14 end

Compared with SquareCB in Foster and Rakhlin (2020),

we introduce three novel technical elements to deal with

knapsack constraints: First, we apply a new cost oracle to

generate the cost prediction. Next, to balance the reward

and the cost prediction over time, we propose the predicted

Lagrangian to construct a proper score function ˆ

`t. Finally,

we introduce OMD to update the dual variable so that

the predicted scores ˆ

`tadapt to the resource consumption

throughout the process. Notably, the l1radius of the

parameter set Λis carefully chosen to be T/B since we

expect λtcan capture the sensitivity of the optimal static

policy (1) to knapsack constraints violations over time.

Speciﬁcally, if we increase the budget Bby ε, the increased

reward over T rounds is at most TOPT

Bε. This observation

suggests that Z, the radius of Λshould be at least TOPT

B. On

the other hand, Zshould be of constant level so that OMD

achieves optimal regret bounds O(√T)by Lemma 2.2.

Therefore, setting Z=TOPT

Bwill be the desired choice. In

practice, since OPT is unknown, we need to estimate Zso

that TOPT

B≤Z.TOPT

B. The regime B= Ω(T)guarantees

that T

Bis approximately TOPT

B, without the need to further

estimate OPT. This is why we set Z=T

Bin SquareCBwK.

We will discuss estimating OPT when B= Ω(T)fails to

hold in Section 4.

Now we state the theoretical guarantee of SquareCBwK:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OptimalContextualBanditswithKnapsacksunderRealizabilityviaRegressionOraclesYuxuanHanyJialinZengyYangWangyzYangXiangyxJihengZhangzyDepartmentofMathematics,HKUSTzDepartmentofIndustrialEngineeringandDecisionAnalytics,HKUSTxHKUSTShenzhen-HongKongCollaborativeInnovationResearchInstituteAbstractWestudyt...

展开>> 收起<<

Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles Yuxuan HanyJialin ZengyYang WangyzYang XiangyxJiheng Zhangz.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles Yuxuan HanyJialin ZengyYang WangyzYang XiangyxJiheng Zhangz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: