Offline congestion games How feedback type affects data coverage requirement Haozhe Jiang

2025-05-06 0 0 579.68KB 20 页 10玖币
侵权投诉
Offline congestion games: How feedback type affects data coverage
requirement
Haozhe Jiang
jianghz20@mails.tsinghua.edu.cn
Qiwen Cui
qwcui@cs.washington.edu
Zhihan Xiong
zhihanx@cs.washington.edu
Maryam Fazel
mfazel@uw.edu
Simon S. Du
ssdu@cs.washington.edu
Abstract
This paper investigates when one can efficiently recover an approximate Nash Equilibrium (NE)
in offline congestion games. The existing dataset coverage assumption in offline general-sum games
inevitably incurs a dependency on the number of actions, which can be exponentially large in congestion
games. We consider three different types of feedback with decreasing revealed information. Starting from
the facility-level (a.k.a., semi-bandit) feedback, we propose a novel one-unit deviation coverage condition
and give a pessimism-type algorithm that can recover an approximate NE. For the agent-level (a.k.a.,
bandit) feedback setting, interestingly, we show the one-unit deviation coverage condition is not sufficient.
On the other hand, we convert the game to multi-agent linear bandits and show that with a generalized
data coverage assumption in offline linear bandits, we can efficiently recover the approximate NE. Lastly,
we consider a novel type of feedback, the game-level feedback where only the total reward from all agents
is revealed. Again, we show the coverage assumption for the agent-level feedback setting is insufficient in
the game-level feedback setting, and with a stronger version of the data coverage assumption for linear
bandits, we can recover an approximate NE. Together, our results constitute the first study of offline
congestion games and imply formal separations between different types of feedback.
1 Introduction
Congestion game is a special class of general-sum matrix games that models the interaction of players with
shared facilities (Rosenthal, 1973). Each player chooses some facilities to utilize and each facility will incur
a different reward depending on how congested it is. For instance, in the routing game (Koutsoupias &
Papadimitriou, 1999), each player decides a path to travel from the starting point to the destination point in
a traffic graph. The facilities are the edges and the joint decision of all the players determines the congestion
in the graph. The more players utilizing one edge, the longer the travel time on that edge will be. As
one of the most well-known classes of games, congestion game has been successfully deployed in numerous
real-world applications such as resource allocation (Johari & Tsitsiklis, 2003), electrical grids (Ibars et al.,
2010) and cryptocurrency ecosystem (Altman et al., 2019).
Nash equilibrium (NE), one of the most important concepts in game theory (Nash Jr, 1950), characterizes
the emerging behavior in a multi-agent system with selfish players. It is commonly known that solving for the
NE is computationally efficient in congestion games as they are isomorphic to potential games (Monderer &
Shapley, 1996). Assuming full information access, classic dynamics such as best response dynamics (Fanelli
et al., 2008), replicator dynamics (Drighes et al., 2014) and no-regret dynamics (Kleinberg et al., 2009)
Equal contribution.
1
arXiv:2210.13396v2 [cs.GT] 4 Oct 2024
provably converge to NE in congestion games. Recently Heliou et al. (2017) and Cui et al. (2022) relaxed
the full information setting to the online (semi-) bandit feedback setting, achieving asymptotic and non-
asymptotic convergence, respectively. It is worth noting that Cui et al. (2022) proposed the first algorithm
that has sample complexity independent of the number of actions.
Offline reinforcement learning has been studied in many real-world applications (Levine et al., 2020).
From the theoretical perspective, a line of work provides understandings of offline single-agent decision
making, including bandits and Markov Decision Processes (MDPs), where researchers derived favorable
sample complexity under the single policy coverage (Rashidinejad et al., 2021; Xie et al., 2021b). However,
how to learn in offline multi-agent games with offline data is still far from clear. Recently, the unilateral
coverage assumption has been proposed as the minimal assumption for offline zero-sum games and offline
general-sum games with corresponding algorithms to learn the NE (Cui & Du, 2022a,b; Zhong et al., 2022).
Though their coverage assumption and the algorithms apply to the most general class of normal-form games,
when specialized to congestion games, the sample complexity will scale with the number of actions, which can
be exponentially large. Since congestion games admit specific structures, one may hope to find specialized
data coverage assumptions that permit sample-efficient offline learning.
In different applications, the types of feedback, i.e., the revealed reward information, can be different in
the offline dataset. For instance, the dataset may include the reward of each facility, the reward of each
player, or the total reward of the game. With decreasing information contained in the dataset, different
coverage assumptions and algorithms are necessary. In addition, the main challenge in solving congestion
games lies in the curse of an exponentially large action set, as the number of actions can be exponential in
the number of facilities. In this work, we aim to answer the following question:
When can we find approximate NE in offline congestion games with different types of feedback, without
suffering from the curse of large action set?
We provide an answer that reveals striking differences between different types of feedback.
1.1 Main Contributions
We provide both positive and negative results for each type of feedback. See Table 1 for a summary.
One-Unit
Deviation
Weak
Covariance
Domination
Strong
Covariance
Domination
Facility-Level ✔ ✔
Agent-Level ✔ ✔
Game-Level ✘ ✘
Table 1: A summary of how data coverage assumptions
affect offline learnability. In particular, represents
under this pair of feedback type and assumption, an
NE can be learned with a sufficient amount of data;
on the other hand, represents there exists some
instances in which a NE cannot be learned no matter
how much data is collected.
1. Three types of feedback and corresponding
data coverage assumptions. We consider
three types of feedback: facility-level feedback,
agent-level feedback, and game-level feedback to
model different real-world applications and what
dataset coverage assumptions permit finding an
approximate NE. In offline general-sum games, Cui
& Du (2022b) proposes the unilateral coverage
assumption. Although their result can be applied to
offline congestion games with agent-level feedback,
their unilateral coverage coefficient is at least as
large as the number of actions and thus has an
exponential dependence on the number of facilities.
Therefore, for each type of feedback, we propose a
corresponding data coverage assumption to escape
the curse of the large action set. Specifically:
Facility-Level Feedback: For facility-level
feedback, the reward incurred in each facility is provided in the offline dataset. This type of feedback
has the strongest signal. We propose the One-Unit Deviation coverage assumption (cf. Assumption 2) for
this feedback.
Agent-Level Feedback: For agent-level feedback, only the sum of the facilities’ rewards for each agent
is observed. This type of feedback has weaker signals than the facility-level feedback does, and therefore we
require a stronger data coverage assumption (cf. Assumption 3).
2
Game-Level Feedback: For the game-level feedback, only the sum of the agent rewards is obtained.
This type of feedback has the weakest signals, and we require the strongest data coverage assumption
(Assumption 4).
Notably, for the latter two types of feedback, we leverage the connections between congestion games and
linear bandits.
2. Sample complexity analyses for different types of feedback. We adopt the surrogate minimization
idea in Cui & Du (2022b) and show a unified algorithm (cf. Algorithm 1) with carefully designed bonus
terms tailored to different types of feedback can efficiently find an approximate NE, therefore showing our
proposed data coverage assumptions are sufficient. For each type of feedback, we give a polynomial upper
bound under its corresponding dataset coverage assumption.
3. Separations between different types of feedback. To rigorously quantify the signal strengths
in the three types of feedback, we provide concrete hard instances. Specifically, we show there exists a
problem instance that satisfies Assumption 2, but with only agent-level feedback, we provably cannot find
an approximate NE, yielding a separation between the facility-level feedback and the agent-level feedback.
Furthermore, we also show there exists a problem instance that satisfies Assumption 3, but with only game-
level feedback, we provably cannot find an approximate NE, yielding a separation between the agent-level
feedback and game-level feedback.
1.2 Motivating Examples
Here, we provide several concrete scenarios to exemplify and motivate the aforementioned three types of
feedback. Formal definitions of the problem can be found in Section 2.
Example 1 (Facility-level feedback).Suppose Google Maps is trying to improve its route assigning
algorithm through historical data based on certain regions. Then, each edge (road) on the traffic graph
of this region can be considered as a facility and the action that a user will take is a path that connects a
certain origin and destination. In this setting, the cost of each facility is the waiting time on that road, which
may increase as the number of users choosing this facility increases. In the historical data, each data point
contains the path chosen by each user and his/her waiting time on each road, which is an offline dataset
with facility-level feedback.
Example 2 (Agent-level feedback).Suppose a company is trying to learn a policy to advertise its products
from historical data. We can consider a certain set of websites as the facility set, and the products as
the players. The action chosen for each product is a subset of websites where the company will place
advertisements for that product. The reward for each product is measured by its sales. In the historical
data, each data point contains the websites chosen for each product advertisement and the total amount of
sales within a certain range of time. This offline dataset inherently has only agent-level feedback since the
company cannot measure each website’s contribution to sales.
Example 3 (Game-level feedback).Under the same setting above, suppose now another company (called
B) is also trying to learn such a policy but lacks internal historical data. Therefore, B decides to use the
data from the company mentioned in the above example (called A). However, since company B does not
have internal access to company A’s database, the precise sales of each product is not visible to company B.
As a result, company B can only record the total amount of sales of all concerning products from company
A’s public financial reports, making its offline dataset have only the game-level feedback.
1.3 Related Work
Potential Games and Congestion Games. Potential games are a special class of normal-form games
with a potential function to quantify the changes in the payoff of each player and deterministic NE is proven
to exist (Monderer & Shapley, 1996). Asymptotic convergence to the NE can be achieved by classic game
theory dynamics such as best response dynamic (Durand, 2018; Swenson et al., 2018), replicator dynamic
(Sandholm et al., 2008; Panageas & Piliouras, 2016) and no-regret dynamic (Heliou et al., 2017). Recently,
Cen et al. (2021) proved that natural policy gradient has a convergence rate independent of the number
3
of actions in entropy regularized potential games. Anagnostides et al. (2022) provided the non-asymptotic
convergence rate for mirror descent and O(1) individual regret for optimistic mirror descent.
Congestion games are proposed in the seminal work (Rosenthal, 1973) and the equivalence with potential
games is proven in (Monderer & Shapley, 1996). Note that congestion games can have exponentially large
action sets, so efficient algorithms for potential games are not necessarily efficient for congestion games.
Non-atomic congestion games consider separable players, which enjoy a convex potential function if the
cost function is non-decreasing (Roughgarden & Tardos, 2004). For atomic congestion games, the potential
function is usually non-convex, making the problem more difficult. (Kleinberg et al., 2009; Krichene et al.,
2014) show that no-regret algorithms asymptotically converge to NE with full information feedback and
(Krichene et al., 2015) provide averaged iterate convergence for bandit feedback. (Chen & Lu, 2015, 2016)
provide non-asymptotic convergence by assuming the atomic congestion game is close to a non-atomic one,
and thus approximately enjoys the convex potential. Recently, Cui et al. (2022) proposed two Nash-regret
minimizing algorithms, an upper-confidence-bound-type algorithm and a Frank-Wolfe-type algorithm for
semi-bandit feedback and bandit feedback settings respectively, and showed both algorithms converge at a
rate that does not depend on the number of actions. To the best of our knowledge, all of these works either
consider the full information setting or the online feedback setting instead of the offline setting in this paper.
Offline Bandits and Reinforcement Learning. For related works in empirical offline reinforcement
learning, interested readers can refer to (Levine et al., 2020). From the theoretical perspective, researchers
have been putting effort into understanding what dataset coverage assumptions allow for learning the optimal
policy. The most basic assumption is the uniform coverage, i.e., every state-action pair is covered by the
dataset (Szepesv´ari & Munos, 2005). Provably efficient algorithms have been proposed for both single-agent
and multi-agent reinforcement learning (Yin et al., 2020, 2021; Ren et al., 2021; Sidford et al., 2020; Cui
& Yang, 2021; Zhang et al., 2020; Subramanian et al., 2021). In single-agent bandits and reinforcement
learning, with the help of pessimism, only single policy coverage is required, i.e., the dataset only needs
to cover the optimal policy (Jin et al., 2021a; Rashidinejad et al., 2021; Xie et al., 2021b,a). For offline
multi-agent Markov games, Cui & Du (2022a) first show that it is impossible to learn with the single policy
coverage and they identify the unilateral coverage assumption as the minimal coverage assumption while
Zhong et al. (2022); Xiong et al. (2022) provide similar results with linear function approximation. Recently,
Yan et al. (2022); Cui & Du (2022b) give minimax sample complexity for offline zero-sum Markov games. In
addition, Cui & Du (2022b) proposes an algorithm for offline multi-player general-sum Markov games that
do not suffer from the curse of multiagents.
2 Preliminary
2.1 Congestion Game
General-Sum Matrix Game. A general-sum matrix game is defined by a tuple G= ({Ai}m
i=1, R), where
mis the number of players, Aiis the action space for player iand R(·|a) is a distribution over [0, rmax]m
with mean r(a). When playing the game, all players simultaneously select actions, constituting joint action
aand the reward is sampled as rR(·|a), where player igets reward ri.
Let A=Qm
i=1 Ai. A joint policy is a distribution π∆(A) while a product policy is π=Nm
i=1 πiwith
πi∆(Ai), where ∆(X) denotes the probability simplex over X. If the players follow a policy π, their
actions are sampled from the distribution aπ. The expected return of player iunder some policy πis
defined as value Vπ
i=Eaπ[ri(a)].
Let πibe the joint policy of all the players except for player i. The best response of the player ito policy
πiis defined as πi
i= arg maxµ∆(Ai)Vµ,πi
i. Here µis the policy for player iand (µ, πi) constitutes a
joint policy for all players. We can always set the best response to be a pure strategy since the value function
is linear in µ. We also denote Vi
i:= Vπi
ii
ias the best response value. To evaluate a policy π, we
4
摘要:

Offlinecongestiongames:HowfeedbacktypeaffectsdatacoveragerequirementHaozheJiang∗jianghz20@mails.tsinghua.edu.cnQiwenCui∗qwcui@cs.washington.eduZhihanXiongzhihanx@cs.washington.eduMaryamFazelmfazel@uw.eduSimonS.Dussdu@cs.washington.eduAbstractThispaperinvestigateswhenonecanefficientlyrecoveranapproxi...

展开>> 收起<<
Offline congestion games How feedback type affects data coverage requirement Haozhe Jiang.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:579.68KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注