A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design Rui AiBoxiang LyuZhaoran WangZhuoran YangMichael I. Jordan

2025-04-30 0 0 808.07KB 38 页 10玖币
侵权投诉
A Reinforcement Learning Approach in Multi-Phase Second-Price
Auction Design
Rui AiBoxiang LyuZhaoran WangZhuoran Yang§Michael I. Jordan
October 20, 2022
Abstract
We study reserve price optimization in multi-phase second price auctions, where seller’s prior
actions affect the bidders’ later valuations through a Markov Decision Process (MDP). Compared
to the bandit setting in existing works, the setting in ours involves three challenges. First,
from the seller’s perspective, we need to efficiently explore the environment in the presence of
potentially nontruthful bidders who aim to manipulates seller’s policy. Second, we want to
minimize the seller’s revenue regret when the market noise distribution is unknown. Third, the
seller’s per-step revenue is unknown, nonlinear, and cannot even be directly observed from the
environment.
We propose a mechanism addressing all three challenges. To address the first challenge, we use
a combination of a new technique named “buffer periods” and inspirations from Reinforcement
Learning (RL) with low switching cost to limit bidders’ surplus from untruthful bidding, thereby
incentivizing approximately truthful bidding. The second one is tackled by a novel algorithm
that removes the need for pure exploration when the market noise distribution is unknown. The
third challenge is resolved by an extension of LSVI-UCB, where we use the auction’s underlying
structure to control the uncertainty of the revenue function. The three techniques culminate in
the Contextual-LSVI-UCB-Buffer (CLUB) algorithm which achieves
e
O(H5/2K) revenue regret
when the market noise is known and
e
O(H3K) revenue regret when the noise is unknown with
no assumptions on bidders’ truthfulness.
1 Introduction
Second price auction with reserve prices is one of the most popular auctions both in theory (Nisan et al.,
2007) and in practice (Roth and Ockenfels,2002). While closed form expressions for the optimal reserve price
is known ever since the seminal work of Myerson (1981), directly applying the result requires population
information, such as the bidders’ valuations’ distribution, is known a priori. Various attempts have been
made to weaken the assumption, with one of the most prominent lines of literature being reserve price
optimization for repeated auctions in the contextual bandit setting (Amin et al.,2014;Golrezaei et al.,2019;
Javanmard and Nazerzadeh,2019;Deng et al.,2020).
A limitation of existing works lies in the bandit assumption. Indeed, while reserve price optimization is
already challenging as-is, allowing the auction to be both contextual and introducing temporal dependent
dynamics, particularly, incorporating Markov Decision Process (MDP) induced dynamics in the evolution
of bidders’ preferences, open up a wider range of problems for studying. For example, Dolgov and Durfee
(2006) studies optimal auction under the setting and developed novel resource allocation mechanisms, Jiang
Peking University; ruiai@pku.edu.cn
The University of Chicago; blyu@chicagobooth.edu
Northwestern University; zhaoranwang@gmail.com
§Yale University; zhuoran.yang@yale.edu
University of California, Berkeley; jordan@cs.berkeley.edu
1
arXiv:2210.10278v1 [cs.LG] 19 Oct 2022
et al. (2015) leverages both MDP and auctions to better analyze resource allocation in IaaS cloud computing,
and Zhao et al. (2018) uses deep Reinforcement Learning (RL) to study sponsored search auctions. We refer
interested readers to Athey and Segal (2013) for more motivating examples. A question naturally arises: is
it possible to optimize reserve prices when bidders’ preferences evolve according to MDPs?
In this article, we provide an affirmative answer. Our work assumes that the state of the auction is
affected by the state and the seller’s action in the preceding step. To facilitate interpretation, we refer to the
seller’s action in this context as “item choice”: bidders’ later preferences could be affected by the types of
items sold in previous rounds, a phenomenon well-documented by empirical works in auctions (Lusht,1994;
Jones et al.,2004;Lange et al.,2010;Ginsburgh and Van Ours,2007). As is the case in many real-world
problems, we assume that the underlying transition dynamics and the bidder’s valuations are both unknown.
We further emphasize that we do not make any truthfulness assumption on the bidders, allowing them to
be strategic with their reporting. Under such a challenging setting, our goal is to learn the optimal policy
of the seller in the unknown environment, in the presence of nontruthful bidders.
Our Contributions. We begin by summarizing the three key challenges we face. First, bidders have the
incentive to report their valuation untruthfully, in hopes of manipulating the seller’s learned policy, through
either overbidding or underbidding, making it difficult to estimate their true preferences and the underlying
MDP dynamics. Existing works such as Amin et al. (2014); Golrezaei et al. (2019); Deng et al. (2020)
do not apply due to technical challenges unique to MDP. Second, when the market noise distribution is
unknown, even in the bandit setting existing literature often only obtains e
O(K2/3) guarantee (Amin et al.,
2014;Golrezaei et al.,2019) and Ω(K2/3) revenue regret lower bound exists (Kleinberg and Leighton,2003).
Third, the seller’s reward function, namely revenue, is unknown, nonlinear, and can not be directly observed
from the bidders’ submitted bids and LSVI-UCB cannot be directly applied.
We are able to address all three challenges with the CLUB algortihm. Motivated by the ever increas-
ing learning periods in existing works (Amin et al.,2014;Golrezaei et al.,2019;Deng et al.,2020), our
work further draws inspiration from RL with low switching cost (Wang et al.,2021) and proposes a novel
concept dubbed “buffer periods” to ensure that the bidders are sufficiently truthful. Additionally, we fea-
ture a novel algorithm we dub “simulation” which, combined with a novel proof technique leveraging the
Dvoretzky–Kiefer–Wolfowitz inequality (Dvoretzky et al.,1956), yields e
O(K) revenue regret under only
mild additional assumptions. Finally, by exploiting the mathematical properties of the revenue function, our
work provides a provably efficient RL algorithm for when the reward function is nonlinear.
1.1 Related Works
We summarize below two lines of existing literature pertinent to our work.
Reserve Price Optimization. There is a vast amount of literature on price estimation (Cesa-Bianchi
et al.,2014;Qiang and Bayati,2016;Shah et al.,2019;Drutsa,2020;Kanoria and Nazerzadeh,2020;Keskin
et al.,2021;Guo et al.,2022). Deng et al. (2020) considers a model where buyers and sellers are equipped
with different discount rates, proposing a robust mechanism for revenue maximization in contextual auctions.
Javanmard et al. (2020) proposes an algorithm with e
O(T) regret while Fan et al. (2021) achieves sublinear
regret in a more complex setting. Cesa-Bianchi et al. (2014) studies reserve price optimization in non-
contextual second price auctions, obtaining e
O(T) revenue regret bound. Drutsa (2017,2020) studies revenue
maximization in repeated second-price auction with one or multiple bidders, proposing an algorithm with a
O(log log T) worst-case regret bound. However, their setting is non-contextual and the cannot be applied to
our setting.
Among this line of research, Golrezaei et al. (2019) is possibly the closest to our work. The work assumes
a linear stochastic contextual bandit setting, where the contexts independent and identically distributed,
achieving e
O(1) regret when the market noise distribution is known and e
O(K2/3) when it is unknown and
nonparametric. While the e
O(1) regret under known market noise distribution seems to be better than our
bound, we emphasize that their stochastic bandit setting does not require exploration over the action space
required in our work and, even in generic linear MDPs, a Ω(K) regret lower bound exists (Jin et al.,2020).
Moreover, our algorithm achieves e
O(K) regret when the market noise distribution is not known, only with
2
mild additional assumptions. Lastly, as we discussed previously, the approaches in Golrezaei et al. (2019)
cannot be directly applied in the MDP setting, necessitating our novel algorithmic structure.
RL with Linear Function Approximation. Linear contextual bandit is a popular model for online
decision making (Rusmevichientong and Tsitsiklis,2010;Abbasi-Yadkori et al.,2011;Chu et al.,2011;Li
et al.,2019;Lattimore and Szepesv´ari,2020) that has also been extensively studied from the auction design
perspective (Amin et al.,2014;Golrezaei et al.,2019). Its dynamic counterpart, Linear MDP, remains
popular in the analysis of provably efficient RL (Yang and Wang,2019;Jin et al.,2020,2021b;Yang et al.,
2020;Zanette et al.,2020;Jin et al.,2021a;Uehara et al.,2021;Yu et al.,2022;Wang et al.,2021;Gao
et al.,2021). In particular, Jin et al. (2020) is one of the first papers to introduce the concept, proposing a
provably efficient RL algorithm with O(K) regret. Jin et al. (2021b) generalizes the idea to offline RL.
While we use linear function approximation, the seller’s per-step reward function, revenue, is non-linear.
Our work also features novel per-step optimization problems to combat effects from untruthful reporting.
While our work draws inspirations from Wang et al. (2020) and Gao et al. (2021), as we discussed previously,
these inspirations are needed to for obtaining high quality estimates when the bidders are untruthful. Thus,
our work differs significantly from prior works on linear MDPs.
Notations. For any positive integer nwe let [n] denote the set {1, . . . , n}. For any set Awe let ∆(A) denote
the set of probability measures over A. For sets A,B, we let A×Bbe the Cartesian product of the two.
2 Preliminaries
We consider a repeated (lazy) multi-phase second-price auction with personalized reserve prices. Particularly,
we assume that there are Nrational bidders, indexed by [N], and one seller participating in the auction.
For ease of presentation, we use “he” to refer to a specific bidder and “she” the seller.
Second Price Auction with Personalized Reserve Prices. We begin by describing a single round of
the auction. Each bidder i[N] submits some bid biR0and the seller determines the personalized
reserve prices for the bidders in the form of reserve price vector ρRN
0, with ρidenoting bidder i’s reserve
price. The bidder with the highest bid only if he also clear his personal reserve price, i.e., biρi. If the
bidder ireceives the item, he pay the seller the maximum of his personalized reserve and the second highest
bid, namely max{ρi,maxj6=ibj}, which we dub mifor simplicity. When the bidder with the highest bid fails
to clear his personalized reserve price, the auction fails, the seller gains zero, and the item remains unsold.
In summary, bidder ireceives the item if and only if bimiand the price he pays is mi. For any round
of auction, we let qi=1(bidder ireceives the item) indicate whether bidder ireceived the item or not. For
the sake of convenience, throughout the paper we assume that there are no ties in the submitted bids.
A Multi-Phase Second Price Auction. We now characterize the dynamics of the multi-phase auction
setting we study. Assume that the transition dynamic between rounds can be modeled as an episodic Markov
Decision Process (MDP)1. A multi-phase second price auction with personalized reserves is parameterized as
(S,Υ, H, P,{ri}N
i=1), with the state space denoted by S, seller’s item choice space Υ2, horizon H, transition
kernel P={Ph}H
h=1 where Ph:S × Υ∆(S), and the individual bidders’ reward functions ri={rih}H
h=1
for all i[N]. The choice of item υΥ affects the bidders’ rewards as well as the transition.
The interaction between the bidders and the seller is then defined as follows. We assume without loss
of generality that the state at the initial step is fixed at some x1∈ S. For each h[H], the seller and the
bidders engage in a single round of second price auction. Given the seller’s item choice at step h,υh, nature
transitions to the next state according to the transition kernel Ph.
Bidder Rewards. We assume that for each bidder i[N] at time h[H], his reward3depends on both
the state xand item being auctioned off at that round υΥ, which we formalize as
rih(x, υ) = 1 + µih(x, υ) + zih, where zih
i.i.d.
F.
1We can easily extend our setting to that of an infinite-horizon MDP by improperly learning the process as an episodic one.
Here we focus on the finite-horizon case purely for simplicity of presentation.
2Here we use “item choice” to better illustrate what Υ intuitively represents. The term can be extended to more generic
notions of seller’s action.
3Wes use the term “reward” to maintain consistency with existing RL literature.
3
Here, zih denotes the randomness within bidders’ rewards and are drawn i.i.d. from the market noise
distribution F(·). We assume that F(·) is supported on [1,1] and has mean 0. Let µi,h :S × Υ[0,1]
denote the conditional expectation of the reward less one, where the constant is added to ensure rih(x, υ)
[0,3].
Policies and Value Functions. Before we describe the seller’s policy, we first discuss the action space
A= Υ ×RN
0. At each h[H], the seller chooses some action ah= (υh, ρh), comprising of item choice
υΥ and reserve price vector ρRN
0. The seller’s policy is then π={πh}H
h=1, where πh:S ∆(A). We
let πυand πρdenote the marginal item choice and reserve price policies, respectively. Recall that the seller
garners revenue only when the item is sold to a bidder. At each h[H], her per-step expected revenue is
then
Rh=E{zih}N
i=1 "N
X
i=1
mih 1(mih bih)#(2.1)
as we recall that mih = max{ρih,maxj6=ibjh}and bidder ipays the seller mih if and only if bih mih. The
value function (V-function) of the seller’s revenue for any policy πand the action-value function (Q-function)
is Qπ
h:S × A Rare then
Vπ
h(x) = Eπ"H
X
h0=h
Rh0(xh0, ah0)|sh=x#
and
Qπ
h(x, a) = Eπ"H
X
h0=h
Rh0(xh0, ah0)|sh=x, ah=a#,
respectively.
Since the bidder reward only depends on state xand the choice of item υinstead of reserve ρ, we have
a family of mappings from S × Υ to RN
0that determines ρ. Therefore, with a slight abuse of notation, we
can rewrite our Q-function as Q(x, a) = Q(x, (υ, ρ(x, υ))), restricting the role of setting reserve prices using
such mappings without loss of generality. From now, we use Q(s, v) to denote Q-function for simplicity. For
any function f:S R, we define the transition operator Pand the Bellman operator Bas
(Phf)(x, a) = E[f(sh+1)|sh=x, ah=a],(Bhf)(x, a) = E[Rh(sh, ah)] + (Phf)(x, a)
respectively. Finally, we let π?denote the optimal policy when the bidders’ reward functions, the MDP’s
underlying transition, and the market noise distribution are all known to the seller. We remark that when
these parameters are known, second price auctions with personalized reserve prices are inherently incentive
compatible and rational bidders will bid truthfully.
Performance Metric. The revenue suboptimality for each episode k[K] is
SubOptk(πk) = Vπ
1(x1)Vπk
1(x1),
with πkbeing the strategy used in episode k. Our evaluation metric is then the revenue regret attained over
Kepisodes, namely
Regret(K) =
K
X
k=1
SubOptk(πk).(2.2)
Impatient Utility-Maximizing Bidders. We assume the bidders are equipped with some discount rate
γ(0,1) while the seller’s reward is not discounted. For sake of simplicity, we assume γis common
knowledge. Bidder i’s utility at step his given by (rih(sh, νh)mih)1(bih mih), as we note that he only
receives nonzero utility upon winning the auction. His objective is to maximize his discounted cumulative
utility
Utilityi=
K
X
k=1
γkEπk"H
X
h=1
(rih(sk
h, νk
h)mk
ih)1(bk
ih mk
ih)|sk
1=x1#.
4
Note that in practical applications, sellers are usually more patient than bidders and discount their future
rewards less. Consider sponsored search auction, where the seller usually auctions off large numbers of ad
slots every day. Bidders usually urgently need advertisement and value future rewards less. On the other
hand, the seller is not especially concerned with slight decreases in immediate rewards. We refer the readers
to Drutsa (2017); Golrezaei et al. (2019) for a more detailed discussion on the economic justifications of the
assumption and emphasize that the assumption is necessary, as Amin et al. (2013) shows that when the
bidders are as patient as the seller, achieving sub-linear revenue regret is impossible.
Linear Markov Decision Process. As a concrete setting, we study linear function approximation.
Assumption 2.1. Assume that there exists known feature mapping φ:S × ΥRdsuch that there exist
d-dimension unknown (signed) measures Mhover Sand unknown vectors {θih}N
i=1 Rdthat satisfy
Ph(x0|x, υ) = hφ(x, υ),Mh(x0)i, µih(x, υ) = hφ(x, υ), θihi
for all (x, υ, x0)∈ S ×Υ×S,i[N], and h[H]. Without loss of generality, we assume that kφ(x, υ)k ≤ 1
for all (x, υ) S × Υ, kMh(S)k ≤ d, and kθihk ≤ dfor all h[H] and i[N].
We close off the section by remarking that while the transition kernel Phand the bidders’ individual
expected reward functions {µi}N
i=1 are linear, the seller’s objective, revenue, is not linear, differentiating our
work from typical linear MDP literature (see Yang and Wang (2019); Jin et al. (2020) for representative
works).
3 Known Market Noise Distribution
We remind the readers of our three main challenges, with the first challenge being exploring the environment
even when the bidders submit their bids potentially untruthfully. The second challenge emerges only when
the market noise distribution is unknown and we defer its resolution to Section 4. The third challenge is
performing provably efficient RL even when the seller’s per-step revenue, detailed in (2.1), is nonlinear and
not directly observable.
In this section, we present a version of CLUB when the market noise distribution is known. We assume
for convenience that Kis known, as we can use the doubling trick (see Auer et al. (2002) and Besson and
Kaufmann (2018) for discussions) to achieve the same order of regretwhen Kis unknown or infinite.
3.1 CLUB Algorithm When F(·)is Known
We start with the first challenge, which we address by a collection of algorithms that successfully induce
approximately truthful bids from the bidders.
Addressing Challenge 1: Untruthfulness. To curb the sellers’ untruthfulness, we need to punish such
behavior, achieved through a random pricing policy in the form of Algorithm 1. For each h[H], πrand
randomly chooses an item and a bidder, offering him the item with a reserve price drawn uniformly at
random. The bidder’s utility decreases whenever he reports untruthfully, risking either not receiving the
item when he underbids, or overpaying for an item when he overbids.
Algorithm 1 Definition of πrand
1: for h= 1, . . . , H do
2: Randomly chooses an item υhΥh.
3: Choose a bidder i[N] uniformly at random and offer him the item with reserve price ρih
Unif([0,3]). Set other bidders’ reserve prices to infinite.
4: end for
A typical algorithm in bandit setting features πrand and a sequence of learning periods that double in
length (Amin et al.,2014;Golrezaei et al.,2019;Deng et al.,2020). Data collected in all previous periods is
5
摘要:

AReinforcementLearningApproachinMulti-PhaseSecond-PriceAuctionDesignRuiAi*BoxiangLyu„ZhaoranWang…ZhuoranYang§MichaelI.Jordan¶October20,2022AbstractWestudyreservepriceoptimizationinmulti-phasesecondpriceauctions,whereseller'sprioractionsa ectthebidders'latervaluationsthroughaMarkovDecisionProcess(MDP...

展开>> 收起<<
A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design Rui AiBoxiang LyuZhaoran WangZhuoran YangMichael I. Jordan.pdf

共38页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:38 页 大小:808.07KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 38
客服
关注