Offline congestion games How feedback type affects data coverage requirement Haozhe Jiang

2025-05-06 1 0 579.68KB 20 页 10玖币

侵权投诉

Oﬄine congestion games: How feedback type aﬀects data coverage

requirement

Haozhe Jiang∗

jianghz20@mails.tsinghua.edu.cn

Qiwen Cui∗

qwcui@cs.washington.edu

Zhihan Xiong

zhihanx@cs.washington.edu

Maryam Fazel

mfazel@uw.edu

Simon S. Du

ssdu@cs.washington.edu

Abstract

This paper investigates when one can eﬃciently recover an approximate Nash Equilibrium (NE)

in oﬄine congestion games. The existing dataset coverage assumption in oﬄine general-sum games

inevitably incurs a dependency on the number of actions, which can be exponentially large in congestion

games. We consider three diﬀerent types of feedback with decreasing revealed information. Starting from

the facility-level (a.k.a., semi-bandit) feedback, we propose a novel one-unit deviation coverage condition

and give a pessimism-type algorithm that can recover an approximate NE. For the agent-level (a.k.a.,

bandit) feedback setting, interestingly, we show the one-unit deviation coverage condition is not suﬃcient.

On the other hand, we convert the game to multi-agent linear bandits and show that with a generalized

data coverage assumption in oﬄine linear bandits, we can eﬃciently recover the approximate NE. Lastly,

we consider a novel type of feedback, the game-level feedback where only the total reward from all agents

is revealed. Again, we show the coverage assumption for the agent-level feedback setting is insuﬃcient in

the game-level feedback setting, and with a stronger version of the data coverage assumption for linear

bandits, we can recover an approximate NE. Together, our results constitute the ﬁrst study of oﬄine

congestion games and imply formal separations between diﬀerent types of feedback.

1 Introduction

Congestion game is a special class of general-sum matrix games that models the interaction of players with

shared facilities (Rosenthal, 1973). Each player chooses some facilities to utilize and each facility will incur

a diﬀerent reward depending on how congested it is. For instance, in the routing game (Koutsoupias &

Papadimitriou, 1999), each player decides a path to travel from the starting point to the destination point in

a traﬃc graph. The facilities are the edges and the joint decision of all the players determines the congestion

in the graph. The more players utilizing one edge, the longer the travel time on that edge will be. As

one of the most well-known classes of games, congestion game has been successfully deployed in numerous

real-world applications such as resource allocation (Johari & Tsitsiklis, 2003), electrical grids (Ibars et al.,

2010) and cryptocurrency ecosystem (Altman et al., 2019).

Nash equilibrium (NE), one of the most important concepts in game theory (Nash Jr, 1950), characterizes

the emerging behavior in a multi-agent system with selﬁsh players. It is commonly known that solving for the

NE is computationally eﬃcient in congestion games as they are isomorphic to potential games (Monderer &

Shapley, 1996). Assuming full information access, classic dynamics such as best response dynamics (Fanelli

et al., 2008), replicator dynamics (Drighes et al., 2014) and no-regret dynamics (Kleinberg et al., 2009)

∗Equal contribution.

arXiv:2210.13396v2 [cs.GT] 4 Oct 2024

provably converge to NE in congestion games. Recently Heliou et al. (2017) and Cui et al. (2022) relaxed

the full information setting to the online (semi-) bandit feedback setting, achieving asymptotic and non-

asymptotic convergence, respectively. It is worth noting that Cui et al. (2022) proposed the ﬁrst algorithm

that has sample complexity independent of the number of actions.

Oﬄine reinforcement learning has been studied in many real-world applications (Levine et al., 2020).

From the theoretical perspective, a line of work provides understandings of oﬄine single-agent decision

making, including bandits and Markov Decision Processes (MDPs), where researchers derived favorable

sample complexity under the single policy coverage (Rashidinejad et al., 2021; Xie et al., 2021b). However,

how to learn in oﬄine multi-agent games with oﬄine data is still far from clear. Recently, the unilateral

coverage assumption has been proposed as the minimal assumption for oﬄine zero-sum games and oﬄine

general-sum games with corresponding algorithms to learn the NE (Cui & Du, 2022a,b; Zhong et al., 2022).

Though their coverage assumption and the algorithms apply to the most general class of normal-form games,

when specialized to congestion games, the sample complexity will scale with the number of actions, which can

be exponentially large. Since congestion games admit speciﬁc structures, one may hope to ﬁnd specialized

data coverage assumptions that permit sample-eﬃcient oﬄine learning.

In diﬀerent applications, the types of feedback, i.e., the revealed reward information, can be diﬀerent in

the oﬄine dataset. For instance, the dataset may include the reward of each facility, the reward of each

player, or the total reward of the game. With decreasing information contained in the dataset, diﬀerent

coverage assumptions and algorithms are necessary. In addition, the main challenge in solving congestion

games lies in the curse of an exponentially large action set, as the number of actions can be exponential in

the number of facilities. In this work, we aim to answer the following question:

When can we ﬁnd approximate NE in oﬄine congestion games with diﬀerent types of feedback, without

suﬀering from the curse of large action set?

We provide an answer that reveals striking diﬀerences between diﬀerent types of feedback.

1.1 Main Contributions

We provide both positive and negative results for each type of feedback. See Table 1 for a summary.

One-Unit

Deviation

Weak

Covariance

Domination

Strong

Covariance

Domination

Facility-Level ✔ ✔ ✔

Agent-Level ✘✔ ✔

Game-Level ✘ ✘ ✔

Table 1: A summary of how data coverage assumptions

aﬀect oﬄine learnability. In particular, ✔represents

under this pair of feedback type and assumption, an

NE can be learned with a suﬃcient amount of data;

on the other hand, ✘represents there exists some

instances in which a NE cannot be learned no matter

how much data is collected.

1. Three types of feedback and corresponding

data coverage assumptions. We consider

three types of feedback: facility-level feedback,

agent-level feedback, and game-level feedback to

model diﬀerent real-world applications and what

dataset coverage assumptions permit ﬁnding an

approximate NE. In oﬄine general-sum games, Cui

& Du (2022b) proposes the unilateral coverage

assumption. Although their result can be applied to

oﬄine congestion games with agent-level feedback,

their unilateral coverage coeﬃcient is at least as

large as the number of actions and thus has an

exponential dependence on the number of facilities.

Therefore, for each type of feedback, we propose a

corresponding data coverage assumption to escape

the curse of the large action set. Speciﬁcally:

•Facility-Level Feedback: For facility-level

feedback, the reward incurred in each facility is provided in the oﬄine dataset. This type of feedback

has the strongest signal. We propose the One-Unit Deviation coverage assumption (cf. Assumption 2) for

this feedback.

•Agent-Level Feedback: For agent-level feedback, only the sum of the facilities’ rewards for each agent

is observed. This type of feedback has weaker signals than the facility-level feedback does, and therefore we

require a stronger data coverage assumption (cf. Assumption 3).

•Game-Level Feedback: For the game-level feedback, only the sum of the agent rewards is obtained.

This type of feedback has the weakest signals, and we require the strongest data coverage assumption

(Assumption 4).

Notably, for the latter two types of feedback, we leverage the connections between congestion games and

linear bandits.

2. Sample complexity analyses for diﬀerent types of feedback. We adopt the surrogate minimization

idea in Cui & Du (2022b) and show a uniﬁed algorithm (cf. Algorithm 1) with carefully designed bonus

terms tailored to diﬀerent types of feedback can eﬃciently ﬁnd an approximate NE, therefore showing our

proposed data coverage assumptions are suﬃcient. For each type of feedback, we give a polynomial upper

bound under its corresponding dataset coverage assumption.

3. Separations between diﬀerent types of feedback. To rigorously quantify the signal strengths

in the three types of feedback, we provide concrete hard instances. Speciﬁcally, we show there exists a

problem instance that satisﬁes Assumption 2, but with only agent-level feedback, we provably cannot ﬁnd

an approximate NE, yielding a separation between the facility-level feedback and the agent-level feedback.

Furthermore, we also show there exists a problem instance that satisﬁes Assumption 3, but with only game-

level feedback, we provably cannot ﬁnd an approximate NE, yielding a separation between the agent-level

feedback and game-level feedback.

1.2 Motivating Examples

Here, we provide several concrete scenarios to exemplify and motivate the aforementioned three types of

feedback. Formal deﬁnitions of the problem can be found in Section 2.

Example 1 (Facility-level feedback).Suppose Google Maps is trying to improve its route assigning

algorithm through historical data based on certain regions. Then, each edge (road) on the traﬃc graph

of this region can be considered as a facility and the action that a user will take is a path that connects a

certain origin and destination. In this setting, the cost of each facility is the waiting time on that road, which

may increase as the number of users choosing this facility increases. In the historical data, each data point

contains the path chosen by each user and his/her waiting time on each road, which is an oﬄine dataset

with facility-level feedback.

Example 2 (Agent-level feedback).Suppose a company is trying to learn a policy to advertise its products

from historical data. We can consider a certain set of websites as the facility set, and the products as

the players. The action chosen for each product is a subset of websites where the company will place

advertisements for that product. The reward for each product is measured by its sales. In the historical

data, each data point contains the websites chosen for each product advertisement and the total amount of

sales within a certain range of time. This oﬄine dataset inherently has only agent-level feedback since the

company cannot measure each website’s contribution to sales.

Example 3 (Game-level feedback).Under the same setting above, suppose now another company (called

B) is also trying to learn such a policy but lacks internal historical data. Therefore, B decides to use the

data from the company mentioned in the above example (called A). However, since company B does not

have internal access to company A’s database, the precise sales of each product is not visible to company B.

As a result, company B can only record the total amount of sales of all concerning products from company

A’s public ﬁnancial reports, making its oﬄine dataset have only the game-level feedback.

1.3 Related Work

Potential Games and Congestion Games. Potential games are a special class of normal-form games

with a potential function to quantify the changes in the payoﬀ of each player and deterministic NE is proven

to exist (Monderer & Shapley, 1996). Asymptotic convergence to the NE can be achieved by classic game

theory dynamics such as best response dynamic (Durand, 2018; Swenson et al., 2018), replicator dynamic

(Sandholm et al., 2008; Panageas & Piliouras, 2016) and no-regret dynamic (Heliou et al., 2017). Recently,

Cen et al. (2021) proved that natural policy gradient has a convergence rate independent of the number

of actions in entropy regularized potential games. Anagnostides et al. (2022) provided the non-asymptotic

convergence rate for mirror descent and O(1) individual regret for optimistic mirror descent.

Congestion games are proposed in the seminal work (Rosenthal, 1973) and the equivalence with potential

games is proven in (Monderer & Shapley, 1996). Note that congestion games can have exponentially large

action sets, so eﬃcient algorithms for potential games are not necessarily eﬃcient for congestion games.

Non-atomic congestion games consider separable players, which enjoy a convex potential function if the

cost function is non-decreasing (Roughgarden & Tardos, 2004). For atomic congestion games, the potential

function is usually non-convex, making the problem more diﬃcult. (Kleinberg et al., 2009; Krichene et al.,

2014) show that no-regret algorithms asymptotically converge to NE with full information feedback and

(Krichene et al., 2015) provide averaged iterate convergence for bandit feedback. (Chen & Lu, 2015, 2016)

provide non-asymptotic convergence by assuming the atomic congestion game is close to a non-atomic one,

and thus approximately enjoys the convex potential. Recently, Cui et al. (2022) proposed two Nash-regret

minimizing algorithms, an upper-conﬁdence-bound-type algorithm and a Frank-Wolfe-type algorithm for

semi-bandit feedback and bandit feedback settings respectively, and showed both algorithms converge at a

rate that does not depend on the number of actions. To the best of our knowledge, all of these works either

consider the full information setting or the online feedback setting instead of the oﬄine setting in this paper.

Oﬄine Bandits and Reinforcement Learning. For related works in empirical oﬄine reinforcement

learning, interested readers can refer to (Levine et al., 2020). From the theoretical perspective, researchers

have been putting eﬀort into understanding what dataset coverage assumptions allow for learning the optimal

policy. The most basic assumption is the uniform coverage, i.e., every state-action pair is covered by the

dataset (Szepesv´ari & Munos, 2005). Provably eﬃcient algorithms have been proposed for both single-agent

and multi-agent reinforcement learning (Yin et al., 2020, 2021; Ren et al., 2021; Sidford et al., 2020; Cui

& Yang, 2021; Zhang et al., 2020; Subramanian et al., 2021). In single-agent bandits and reinforcement

learning, with the help of pessimism, only single policy coverage is required, i.e., the dataset only needs

to cover the optimal policy (Jin et al., 2021a; Rashidinejad et al., 2021; Xie et al., 2021b,a). For oﬄine

multi-agent Markov games, Cui & Du (2022a) ﬁrst show that it is impossible to learn with the single policy

coverage and they identify the unilateral coverage assumption as the minimal coverage assumption while

Zhong et al. (2022); Xiong et al. (2022) provide similar results with linear function approximation. Recently,

Yan et al. (2022); Cui & Du (2022b) give minimax sample complexity for oﬄine zero-sum Markov games. In

addition, Cui & Du (2022b) proposes an algorithm for oﬄine multi-player general-sum Markov games that

do not suﬀer from the curse of multiagents.

2 Preliminary

2.1 Congestion Game

General-Sum Matrix Game. A general-sum matrix game is deﬁned by a tuple G= ({Ai}m

i=1, R), where

mis the number of players, Aiis the action space for player iand R(·|a) is a distribution over [0, rmax]m

with mean r(a). When playing the game, all players simultaneously select actions, constituting joint action

aand the reward is sampled as r∼R(·|a), where player igets reward ri.

Let A=Qm

i=1 Ai. A joint policy is a distribution π∈∆(A) while a product policy is π=Nm

i=1 πiwith

πi∈∆(Ai), where ∆(X) denotes the probability simplex over X. If the players follow a policy π, their

actions are sampled from the distribution a∼π. The expected return of player iunder some policy πis

deﬁned as value Vπ

i=Ea∼π[ri(a)].

Let π−ibe the joint policy of all the players except for player i. The best response of the player ito policy

π−iis deﬁned as π†,π−i

i= arg maxµ∈∆(Ai)Vµ,π−i

i. Here µis the policy for player iand (µ, π−i) constitutes a

joint policy for all players. We can always set the best response to be a pure strategy since the value function

is linear in µ. We also denote V†,π−i

i:= Vπ†,π−i

i,π−i

ias the best response value. To evaluate a policy π, we

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Offlinecongestiongames:HowfeedbacktypeaffectsdatacoveragerequirementHaozheJiang∗jianghz20@mails.tsinghua.edu.cnQiwenCui∗qwcui@cs.washington.eduZhihanXiongzhihanx@cs.washington.eduMaryamFazelmfazel@uw.eduSimonS.Dussdu@cs.washington.eduAbstractThispaperinvestigateswhenonecanefficientlyrecoveranapproxi...

展开>> 收起<<

Offline congestion games How feedback type affects data coverage requirement Haozhe Jiang.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Offline congestion games How feedback type affects data coverage requirement Haozhe Jiang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: