
et al. (2015) leverages both MDP and auctions to better analyze resource allocation in IaaS cloud computing,
and Zhao et al. (2018) uses deep Reinforcement Learning (RL) to study sponsored search auctions. We refer
interested readers to Athey and Segal (2013) for more motivating examples. A question naturally arises: is
it possible to optimize reserve prices when bidders’ preferences evolve according to MDPs?
In this article, we provide an affirmative answer. Our work assumes that the state of the auction is
affected by the state and the seller’s action in the preceding step. To facilitate interpretation, we refer to the
seller’s action in this context as “item choice”: bidders’ later preferences could be affected by the types of
items sold in previous rounds, a phenomenon well-documented by empirical works in auctions (Lusht,1994;
Jones et al.,2004;Lange et al.,2010;Ginsburgh and Van Ours,2007). As is the case in many real-world
problems, we assume that the underlying transition dynamics and the bidder’s valuations are both unknown.
We further emphasize that we do not make any truthfulness assumption on the bidders, allowing them to
be strategic with their reporting. Under such a challenging setting, our goal is to learn the optimal policy
of the seller in the unknown environment, in the presence of nontruthful bidders.
Our Contributions. We begin by summarizing the three key challenges we face. First, bidders have the
incentive to report their valuation untruthfully, in hopes of manipulating the seller’s learned policy, through
either overbidding or underbidding, making it difficult to estimate their true preferences and the underlying
MDP dynamics. Existing works such as Amin et al. (2014); Golrezaei et al. (2019); Deng et al. (2020)
do not apply due to technical challenges unique to MDP. Second, when the market noise distribution is
unknown, even in the bandit setting existing literature often only obtains e
O(K2/3) guarantee (Amin et al.,
2014;Golrezaei et al.,2019) and Ω(K2/3) revenue regret lower bound exists (Kleinberg and Leighton,2003).
Third, the seller’s reward function, namely revenue, is unknown, nonlinear, and can not be directly observed
from the bidders’ submitted bids and LSVI-UCB cannot be directly applied.
We are able to address all three challenges with the CLUB algortihm. Motivated by the ever increas-
ing learning periods in existing works (Amin et al.,2014;Golrezaei et al.,2019;Deng et al.,2020), our
work further draws inspiration from RL with low switching cost (Wang et al.,2021) and proposes a novel
concept dubbed “buffer periods” to ensure that the bidders are sufficiently truthful. Additionally, we fea-
ture a novel algorithm we dub “simulation” which, combined with a novel proof technique leveraging the
Dvoretzky–Kiefer–Wolfowitz inequality (Dvoretzky et al.,1956), yields e
O(√K) revenue regret under only
mild additional assumptions. Finally, by exploiting the mathematical properties of the revenue function, our
work provides a provably efficient RL algorithm for when the reward function is nonlinear.
1.1 Related Works
We summarize below two lines of existing literature pertinent to our work.
Reserve Price Optimization. There is a vast amount of literature on price estimation (Cesa-Bianchi
et al.,2014;Qiang and Bayati,2016;Shah et al.,2019;Drutsa,2020;Kanoria and Nazerzadeh,2020;Keskin
et al.,2021;Guo et al.,2022). Deng et al. (2020) considers a model where buyers and sellers are equipped
with different discount rates, proposing a robust mechanism for revenue maximization in contextual auctions.
Javanmard et al. (2020) proposes an algorithm with e
O(√T) regret while Fan et al. (2021) achieves sublinear
regret in a more complex setting. Cesa-Bianchi et al. (2014) studies reserve price optimization in non-
contextual second price auctions, obtaining e
O(√T) revenue regret bound. Drutsa (2017,2020) studies revenue
maximization in repeated second-price auction with one or multiple bidders, proposing an algorithm with a
O(log log T) worst-case regret bound. However, their setting is non-contextual and the cannot be applied to
our setting.
Among this line of research, Golrezaei et al. (2019) is possibly the closest to our work. The work assumes
a linear stochastic contextual bandit setting, where the contexts independent and identically distributed,
achieving e
O(1) regret when the market noise distribution is known and e
O(K2/3) when it is unknown and
nonparametric. While the e
O(1) regret under known market noise distribution seems to be better than our
bound, we emphasize that their stochastic bandit setting does not require exploration over the action space
required in our work and, even in generic linear MDPs, a Ω(√K) regret lower bound exists (Jin et al.,2020).
Moreover, our algorithm achieves e
O(√K) regret when the market noise distribution is not known, only with
2