
provably converge to NE in congestion games. Recently Heliou et al. (2017) and Cui et al. (2022) relaxed
the full information setting to the online (semi-) bandit feedback setting, achieving asymptotic and non-
asymptotic convergence, respectively. It is worth noting that Cui et al. (2022) proposed the first algorithm
that has sample complexity independent of the number of actions.
Offline reinforcement learning has been studied in many real-world applications (Levine et al., 2020).
From the theoretical perspective, a line of work provides understandings of offline single-agent decision
making, including bandits and Markov Decision Processes (MDPs), where researchers derived favorable
sample complexity under the single policy coverage (Rashidinejad et al., 2021; Xie et al., 2021b). However,
how to learn in offline multi-agent games with offline data is still far from clear. Recently, the unilateral
coverage assumption has been proposed as the minimal assumption for offline zero-sum games and offline
general-sum games with corresponding algorithms to learn the NE (Cui & Du, 2022a,b; Zhong et al., 2022).
Though their coverage assumption and the algorithms apply to the most general class of normal-form games,
when specialized to congestion games, the sample complexity will scale with the number of actions, which can
be exponentially large. Since congestion games admit specific structures, one may hope to find specialized
data coverage assumptions that permit sample-efficient offline learning.
In different applications, the types of feedback, i.e., the revealed reward information, can be different in
the offline dataset. For instance, the dataset may include the reward of each facility, the reward of each
player, or the total reward of the game. With decreasing information contained in the dataset, different
coverage assumptions and algorithms are necessary. In addition, the main challenge in solving congestion
games lies in the curse of an exponentially large action set, as the number of actions can be exponential in
the number of facilities. In this work, we aim to answer the following question:
When can we find approximate NE in offline congestion games with different types of feedback, without
suffering from the curse of large action set?
We provide an answer that reveals striking differences between different types of feedback.
1.1 Main Contributions
We provide both positive and negative results for each type of feedback. See Table 1 for a summary.
One-Unit
Deviation
Weak
Covariance
Domination
Strong
Covariance
Domination
Facility-Level ✔ ✔ ✔
Agent-Level ✘✔ ✔
Game-Level ✘ ✘ ✔
Table 1: A summary of how data coverage assumptions
affect offline learnability. In particular, ✔represents
under this pair of feedback type and assumption, an
NE can be learned with a sufficient amount of data;
on the other hand, ✘represents there exists some
instances in which a NE cannot be learned no matter
how much data is collected.
1. Three types of feedback and corresponding
data coverage assumptions. We consider
three types of feedback: facility-level feedback,
agent-level feedback, and game-level feedback to
model different real-world applications and what
dataset coverage assumptions permit finding an
approximate NE. In offline general-sum games, Cui
& Du (2022b) proposes the unilateral coverage
assumption. Although their result can be applied to
offline congestion games with agent-level feedback,
their unilateral coverage coefficient is at least as
large as the number of actions and thus has an
exponential dependence on the number of facilities.
Therefore, for each type of feedback, we propose a
corresponding data coverage assumption to escape
the curse of the large action set. Specifically:
•Facility-Level Feedback: For facility-level
feedback, the reward incurred in each facility is provided in the offline dataset. This type of feedback
has the strongest signal. We propose the One-Unit Deviation coverage assumption (cf. Assumption 2) for
this feedback.
•Agent-Level Feedback: For agent-level feedback, only the sum of the facilities’ rewards for each agent
is observed. This type of feedback has weaker signals than the facility-level feedback does, and therefore we
require a stronger data coverage assumption (cf. Assumption 3).
2