A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design Rui AiBoxiang LyuZhaoran WangZhuoran YangMichael I. Jordan

2025-04-30 0 0 808.07KB 38 页 10玖币

侵权投诉

A Reinforcement Learning Approach in Multi-Phase Second-Price

Auction Design

Rui Ai∗Boxiang Lyu†Zhaoran Wang‡Zhuoran Yang§Michael I. Jordan¶

October 20, 2022

Abstract

We study reserve price optimization in multi-phase second price auctions, where seller’s prior

actions aﬀect the bidders’ later valuations through a Markov Decision Process (MDP). Compared

to the bandit setting in existing works, the setting in ours involves three challenges. First,

from the seller’s perspective, we need to eﬃciently explore the environment in the presence of

potentially nontruthful bidders who aim to manipulates seller’s policy. Second, we want to

minimize the seller’s revenue regret when the market noise distribution is unknown. Third, the

seller’s per-step revenue is unknown, nonlinear, and cannot even be directly observed from the

environment.

We propose a mechanism addressing all three challenges. To address the ﬁrst challenge, we use

a combination of a new technique named “buﬀer periods” and inspirations from Reinforcement

Learning (RL) with low switching cost to limit bidders’ surplus from untruthful bidding, thereby

incentivizing approximately truthful bidding. The second one is tackled by a novel algorithm

that removes the need for pure exploration when the market noise distribution is unknown. The

third challenge is resolved by an extension of LSVI-UCB, where we use the auction’s underlying

structure to control the uncertainty of the revenue function. The three techniques culminate in

the Contextual-LSVI-UCB-Buﬀer (CLUB) algorithm which achieves

O(H5/2√K) revenue regret

when the market noise is known and

O(H3√K) revenue regret when the noise is unknown with

no assumptions on bidders’ truthfulness.

1 Introduction

Second price auction with reserve prices is one of the most popular auctions both in theory (Nisan et al.,

2007) and in practice (Roth and Ockenfels,2002). While closed form expressions for the optimal reserve price

is known ever since the seminal work of Myerson (1981), directly applying the result requires population

information, such as the bidders’ valuations’ distribution, is known a priori. Various attempts have been

made to weaken the assumption, with one of the most prominent lines of literature being reserve price

optimization for repeated auctions in the contextual bandit setting (Amin et al.,2014;Golrezaei et al.,2019;

Javanmard and Nazerzadeh,2019;Deng et al.,2020).

A limitation of existing works lies in the bandit assumption. Indeed, while reserve price optimization is

already challenging as-is, allowing the auction to be both contextual and introducing temporal dependent

dynamics, particularly, incorporating Markov Decision Process (MDP) induced dynamics in the evolution

of bidders’ preferences, open up a wider range of problems for studying. For example, Dolgov and Durfee

(2006) studies optimal auction under the setting and developed novel resource allocation mechanisms, Jiang

∗Peking University; ruiai@pku.edu.cn

†The University of Chicago; blyu@chicagobooth.edu

‡Northwestern University; zhaoranwang@gmail.com

§Yale University; zhuoran.yang@yale.edu

¶University of California, Berkeley; jordan@cs.berkeley.edu

arXiv:2210.10278v1 [cs.LG] 19 Oct 2022

et al. (2015) leverages both MDP and auctions to better analyze resource allocation in IaaS cloud computing,

and Zhao et al. (2018) uses deep Reinforcement Learning (RL) to study sponsored search auctions. We refer

interested readers to Athey and Segal (2013) for more motivating examples. A question naturally arises: is

it possible to optimize reserve prices when bidders’ preferences evolve according to MDPs?

In this article, we provide an aﬃrmative answer. Our work assumes that the state of the auction is

aﬀected by the state and the seller’s action in the preceding step. To facilitate interpretation, we refer to the

seller’s action in this context as “item choice”: bidders’ later preferences could be aﬀected by the types of

items sold in previous rounds, a phenomenon well-documented by empirical works in auctions (Lusht,1994;

Jones et al.,2004;Lange et al.,2010;Ginsburgh and Van Ours,2007). As is the case in many real-world

problems, we assume that the underlying transition dynamics and the bidder’s valuations are both unknown.

We further emphasize that we do not make any truthfulness assumption on the bidders, allowing them to

be strategic with their reporting. Under such a challenging setting, our goal is to learn the optimal policy

of the seller in the unknown environment, in the presence of nontruthful bidders.

Our Contributions. We begin by summarizing the three key challenges we face. First, bidders have the

incentive to report their valuation untruthfully, in hopes of manipulating the seller’s learned policy, through

either overbidding or underbidding, making it diﬃcult to estimate their true preferences and the underlying

MDP dynamics. Existing works such as Amin et al. (2014); Golrezaei et al. (2019); Deng et al. (2020)

do not apply due to technical challenges unique to MDP. Second, when the market noise distribution is

unknown, even in the bandit setting existing literature often only obtains e

O(K2/3) guarantee (Amin et al.,

2014;Golrezaei et al.,2019) and Ω(K2/3) revenue regret lower bound exists (Kleinberg and Leighton,2003).

Third, the seller’s reward function, namely revenue, is unknown, nonlinear, and can not be directly observed

from the bidders’ submitted bids and LSVI-UCB cannot be directly applied.

We are able to address all three challenges with the CLUB algortihm. Motivated by the ever increas-

ing learning periods in existing works (Amin et al.,2014;Golrezaei et al.,2019;Deng et al.,2020), our

work further draws inspiration from RL with low switching cost (Wang et al.,2021) and proposes a novel

concept dubbed “buﬀer periods” to ensure that the bidders are suﬃciently truthful. Additionally, we fea-

ture a novel algorithm we dub “simulation” which, combined with a novel proof technique leveraging the

Dvoretzky–Kiefer–Wolfowitz inequality (Dvoretzky et al.,1956), yields e

O(√K) revenue regret under only

mild additional assumptions. Finally, by exploiting the mathematical properties of the revenue function, our

work provides a provably eﬃcient RL algorithm for when the reward function is nonlinear.

1.1 Related Works

We summarize below two lines of existing literature pertinent to our work.

Reserve Price Optimization. There is a vast amount of literature on price estimation (Cesa-Bianchi

et al.,2014;Qiang and Bayati,2016;Shah et al.,2019;Drutsa,2020;Kanoria and Nazerzadeh,2020;Keskin

et al.,2021;Guo et al.,2022). Deng et al. (2020) considers a model where buyers and sellers are equipped

with diﬀerent discount rates, proposing a robust mechanism for revenue maximization in contextual auctions.

Javanmard et al. (2020) proposes an algorithm with e

O(√T) regret while Fan et al. (2021) achieves sublinear

regret in a more complex setting. Cesa-Bianchi et al. (2014) studies reserve price optimization in non-

contextual second price auctions, obtaining e

O(√T) revenue regret bound. Drutsa (2017,2020) studies revenue

maximization in repeated second-price auction with one or multiple bidders, proposing an algorithm with a

O(log log T) worst-case regret bound. However, their setting is non-contextual and the cannot be applied to

our setting.

Among this line of research, Golrezaei et al. (2019) is possibly the closest to our work. The work assumes

a linear stochastic contextual bandit setting, where the contexts independent and identically distributed,

achieving e

O(1) regret when the market noise distribution is known and e

O(K2/3) when it is unknown and

nonparametric. While the e

O(1) regret under known market noise distribution seems to be better than our

bound, we emphasize that their stochastic bandit setting does not require exploration over the action space

required in our work and, even in generic linear MDPs, a Ω(√K) regret lower bound exists (Jin et al.,2020).

Moreover, our algorithm achieves e

O(√K) regret when the market noise distribution is not known, only with

mild additional assumptions. Lastly, as we discussed previously, the approaches in Golrezaei et al. (2019)

cannot be directly applied in the MDP setting, necessitating our novel algorithmic structure.

RL with Linear Function Approximation. Linear contextual bandit is a popular model for online

decision making (Rusmevichientong and Tsitsiklis,2010;Abbasi-Yadkori et al.,2011;Chu et al.,2011;Li

et al.,2019;Lattimore and Szepesv´ari,2020) that has also been extensively studied from the auction design

perspective (Amin et al.,2014;Golrezaei et al.,2019). Its dynamic counterpart, Linear MDP, remains

popular in the analysis of provably eﬃcient RL (Yang and Wang,2019;Jin et al.,2020,2021b;Yang et al.,

2020;Zanette et al.,2020;Jin et al.,2021a;Uehara et al.,2021;Yu et al.,2022;Wang et al.,2021;Gao

et al.,2021). In particular, Jin et al. (2020) is one of the ﬁrst papers to introduce the concept, proposing a

provably eﬃcient RL algorithm with O(√K) regret. Jin et al. (2021b) generalizes the idea to oﬄine RL.

While we use linear function approximation, the seller’s per-step reward function, revenue, is non-linear.

Our work also features novel per-step optimization problems to combat eﬀects from untruthful reporting.

While our work draws inspirations from Wang et al. (2020) and Gao et al. (2021), as we discussed previously,

these inspirations are needed to for obtaining high quality estimates when the bidders are untruthful. Thus,

our work diﬀers signiﬁcantly from prior works on linear MDPs.

Notations. For any positive integer nwe let [n] denote the set {1, . . . , n}. For any set Awe let ∆(A) denote

the set of probability measures over A. For sets A,B, we let A×Bbe the Cartesian product of the two.

2 Preliminaries

We consider a repeated (lazy) multi-phase second-price auction with personalized reserve prices. Particularly,

we assume that there are Nrational bidders, indexed by [N], and one seller participating in the auction.

For ease of presentation, we use “he” to refer to a speciﬁc bidder and “she” the seller.

Second Price Auction with Personalized Reserve Prices. We begin by describing a single round of

the auction. Each bidder i∈[N] submits some bid bi∈R≥0and the seller determines the personalized

reserve prices for the bidders in the form of reserve price vector ρ∈RN

≥0, with ρidenoting bidder i’s reserve

price. The bidder with the highest bid only if he also clear his personal reserve price, i.e., bi≥ρi. If the

bidder ireceives the item, he pay the seller the maximum of his personalized reserve and the second highest

bid, namely max{ρi,maxj6=ibj}, which we dub mifor simplicity. When the bidder with the highest bid fails

to clear his personalized reserve price, the auction fails, the seller gains zero, and the item remains unsold.

In summary, bidder ireceives the item if and only if bi≥miand the price he pays is mi. For any round

of auction, we let qi=1(bidder ireceives the item) indicate whether bidder ireceived the item or not. For

the sake of convenience, throughout the paper we assume that there are no ties in the submitted bids.

A Multi-Phase Second Price Auction. We now characterize the dynamics of the multi-phase auction

setting we study. Assume that the transition dynamic between rounds can be modeled as an episodic Markov

Decision Process (MDP)1. A multi-phase second price auction with personalized reserves is parameterized as

(S,Υ, H, P,{ri}N

i=1), with the state space denoted by S, seller’s item choice space Υ2, horizon H, transition

kernel P={Ph}H

h=1 where Ph:S × Υ→∆(S), and the individual bidders’ reward functions ri={rih}H

h=1

for all i∈[N]. The choice of item υ∈Υ aﬀects the bidders’ rewards as well as the transition.

The interaction between the bidders and the seller is then deﬁned as follows. We assume without loss

of generality that the state at the initial step is ﬁxed at some x1∈ S. For each h∈[H], the seller and the

bidders engage in a single round of second price auction. Given the seller’s item choice at step h,υh, nature

transitions to the next state according to the transition kernel Ph.

Bidder Rewards. We assume that for each bidder i∈[N] at time h∈[H], his reward3depends on both

the state xand item being auctioned oﬀ at that round υ∈Υ, which we formalize as

rih(x, υ) = 1 + µih(x, υ) + zih, where zih

i.i.d.

∼F.

1We can easily extend our setting to that of an inﬁnite-horizon MDP by improperly learning the process as an episodic one.

Here we focus on the ﬁnite-horizon case purely for simplicity of presentation.

2Here we use “item choice” to better illustrate what Υ intuitively represents. The term can be extended to more generic

notions of seller’s action.

3Wes use the term “reward” to maintain consistency with existing RL literature.

Here, zih denotes the randomness within bidders’ rewards and are drawn i.i.d. from the market noise

distribution F(·). We assume that F(·) is supported on [−1,1] and has mean 0. Let µi,h :S × Υ→[0,1]

denote the conditional expectation of the reward less one, where the constant is added to ensure rih(x, υ)∈

[0,3].

Policies and Value Functions. Before we describe the seller’s policy, we ﬁrst discuss the action space

A= Υ ×RN

≥0. At each h∈[H], the seller chooses some action ah= (υh, ρh), comprising of item choice

υ∈Υ and reserve price vector ρ∈RN

≥0. The seller’s policy is then π={πh}H

h=1, where πh:S → ∆(A). We

let πυand πρdenote the marginal item choice and reserve price policies, respectively. Recall that the seller

garners revenue only when the item is sold to a bidder. At each h∈[H], her per-step expected revenue is

then

Rh=E{zih}N

i=1 "N

i=1

mih 1(mih ≤bih)#(2.1)

as we recall that mih = max{ρih,maxj6=ibjh}and bidder ipays the seller mih if and only if bih ≥mih. The

value function (V-function) of the seller’s revenue for any policy πand the action-value function (Q-function)

is Qπ

h:S × A → Rare then

Vπ

h(x) = Eπ"H

h0=h

Rh0(xh0, ah0)|sh=x#

and

Qπ

h(x, a) = Eπ"H

h0=h

Rh0(xh0, ah0)|sh=x, ah=a#,

respectively.

Since the bidder reward only depends on state xand the choice of item υinstead of reserve ρ, we have

a family of mappings from S × Υ to RN

≥0that determines ρ. Therefore, with a slight abuse of notation, we

can rewrite our Q-function as Q(x, a) = Q(x, (υ, ρ(x, υ))), restricting the role of setting reserve prices using

such mappings without loss of generality. From now, we use Q(s, v) to denote Q-function for simplicity. For

any function f:S → R, we deﬁne the transition operator Pand the Bellman operator Bas

(Phf)(x, a) = E[f(sh+1)|sh=x, ah=a],(Bhf)(x, a) = E[Rh(sh, ah)] + (Phf)(x, a)

respectively. Finally, we let π?denote the optimal policy when the bidders’ reward functions, the MDP’s

underlying transition, and the market noise distribution are all known to the seller. We remark that when

these parameters are known, second price auctions with personalized reserve prices are inherently incentive

compatible and rational bidders will bid truthfully.

Performance Metric. The revenue suboptimality for each episode k∈[K] is

SubOptk(πk) = Vπ∗

1(x1)−Vπk

1(x1),

with πkbeing the strategy used in episode k. Our evaluation metric is then the revenue regret attained over

Kepisodes, namely

Regret(K) =

k=1

SubOptk(πk).(2.2)

Impatient Utility-Maximizing Bidders. We assume the bidders are equipped with some discount rate

γ∈(0,1) while the seller’s reward is not discounted. For sake of simplicity, we assume γis common

knowledge. Bidder i’s utility at step his given by (rih(sh, νh)−mih)1(bih ≥mih), as we note that he only

receives nonzero utility upon winning the auction. His objective is to maximize his discounted cumulative

utility

Utilityi=

k=1

γkEπk"H

h=1

(rih(sk

h, νk

h)−mk

ih)1(bk

ih ≥mk

ih)|sk

1=x1#.

Note that in practical applications, sellers are usually more patient than bidders and discount their future

rewards less. Consider sponsored search auction, where the seller usually auctions oﬀ large numbers of ad

slots every day. Bidders usually urgently need advertisement and value future rewards less. On the other

hand, the seller is not especially concerned with slight decreases in immediate rewards. We refer the readers

to Drutsa (2017); Golrezaei et al. (2019) for a more detailed discussion on the economic justiﬁcations of the

assumption and emphasize that the assumption is necessary, as Amin et al. (2013) shows that when the

bidders are as patient as the seller, achieving sub-linear revenue regret is impossible.

Linear Markov Decision Process. As a concrete setting, we study linear function approximation.

Assumption 2.1. Assume that there exists known feature mapping φ:S × Υ→Rdsuch that there exist

d-dimension unknown (signed) measures Mhover Sand unknown vectors {θih}N

i=1 ∈Rdthat satisfy

Ph(x0|x, υ) = hφ(x, υ),Mh(x0)i, µih(x, υ) = hφ(x, υ), θihi

for all (x, υ, x0)∈ S ×Υ×S,i∈[N], and h∈[H]. Without loss of generality, we assume that kφ(x, υ)k ≤ 1

for all (x, υ)∈ S × Υ, kMh(S)k ≤ √d, and kθihk ≤ √dfor all h∈[H] and i∈[N].

We close oﬀ the section by remarking that while the transition kernel Phand the bidders’ individual

expected reward functions {µi}N

i=1 are linear, the seller’s objective, revenue, is not linear, diﬀerentiating our

work from typical linear MDP literature (see Yang and Wang (2019); Jin et al. (2020) for representative

works).

3 Known Market Noise Distribution

We remind the readers of our three main challenges, with the ﬁrst challenge being exploring the environment

even when the bidders submit their bids potentially untruthfully. The second challenge emerges only when

the market noise distribution is unknown and we defer its resolution to Section 4. The third challenge is

performing provably eﬃcient RL even when the seller’s per-step revenue, detailed in (2.1), is nonlinear and

not directly observable.

In this section, we present a version of CLUB when the market noise distribution is known. We assume

for convenience that Kis known, as we can use the doubling trick (see Auer et al. (2002) and Besson and

Kaufmann (2018) for discussions) to achieve the same order of regretwhen Kis unknown or inﬁnite.

3.1 CLUB Algorithm When F(·)is Known

We start with the ﬁrst challenge, which we address by a collection of algorithms that successfully induce

approximately truthful bids from the bidders.

Addressing Challenge 1: Untruthfulness. To curb the sellers’ untruthfulness, we need to punish such

behavior, achieved through a random pricing policy in the form of Algorithm 1. For each h∈[H], πrand

randomly chooses an item and a bidder, oﬀering him the item with a reserve price drawn uniformly at

random. The bidder’s utility decreases whenever he reports untruthfully, risking either not receiving the

item when he underbids, or overpaying for an item when he overbids.

Algorithm 1 Deﬁnition of πrand

1: for h= 1, . . . , H do

2: Randomly chooses an item υh∈Υh.

3: Choose a bidder i∈[N] uniformly at random and oﬀer him the item with reserve price ρih ∼

Unif([0,3]). Set other bidders’ reserve prices to inﬁnite.

4: end for

A typical algorithm in bandit setting features πrand and a sequence of learning periods that double in

length (Amin et al.,2014;Golrezaei et al.,2019;Deng et al.,2020). Data collected in all previous periods is

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AReinforcementLearningApproachinMulti-PhaseSecond-PriceAuctionDesignRuiAi*BoxiangLyuZhaoranWangZhuoranYang§MichaelI.Jordan¶October20,2022AbstractWestudyreservepriceoptimizationinmulti-phasesecondpriceauctions,whereseller'sprioractionsaectthebidders'latervaluationsthroughaMarkovDecisionProcess(MDP...

展开>> 收起<<

A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design Rui AiBoxiang LyuZhaoran WangZhuoran YangMichael I. Jordan.pdf

共38页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design Rui AiBoxiang LyuZhaoran WangZhuoran YangMichael I. Jordan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: