1 Multi-agent Deep Covering Skill Discovery Jiayu Chen Marina Haliem Tian Lan and Vaneet Aggarwal

2025-04-28 0 0 1.31MB 11 页 10玖币
侵权投诉
1
Multi-agent Deep Covering Skill Discovery
Jiayu Chen, Marina Haliem, Tian Lan, and Vaneet Aggarwal
Abstract—The use of skills (a.k.a., options) can greatly ac-
celerate exploration in reinforcement learning, especially when
only sparse reward signals are available. While option discovery
methods have been proposed for individual agents, in multi-
agent reinforcement learning settings, discovering collaborative
options that can coordinate the behavior of multiple agents and
encourage them to visit the under-explored regions of their joint
state space has not been considered. In this case, we propose
Multi-agent Deep Covering Option Discovery, which constructs the
multi-agent options through minimizing the expected cover time
of the multiple agents’ joint state space.
Also, we propose a novel framework to adopt the multi-agent
options in the MARL process. In practice, a multi-agent task
can usually be divided into some sub-tasks, each of which can
be completed by a sub-group of the agents. Therefore, our
algorithm framework first leverages an attention mechanism
to find collaborative agent sub-groups that would benefit most
from coordinated actions. Then, a hierarchical algorithm, namely
HA-MSAC, is developed to learn the multi-agent options for each
sub-group to complete their sub-tasks first, and then to integrate
them through a high-level policy as the solution of the whole task.
This hierarchical option construction allows our framework to
strike a balance between scalability and effective collaboration
among the agents.
The evaluation based on multi-agent collaborative tasks shows
that the proposed algorithm can effectively capture the agent
interactions with the attention mechanism, successfully identify
multi-agent options, and significantly outperforms prior works
using single-agent options or no options, in terms of both faster
exploration and higher task rewards.
Index Terms—Multi-agent Reinforcement Learning, Skill Dis-
covery, Deep Covering Options
I. INTRODUCTION
Option discovery [
1
] enables temporally-abstract actions to
be constructed in the reinforcement learning process. It can
greatly improve the performance of reinforcement learning
agents by representing actions at different time scales. Among
recent developments on the topic, Covering Option Discovery
[
2
] has been shown to be a promising approach. It leverages
Laplacian matrix extracted from the state-transition graph
induced by the dynamics of the environment. To be specific,
the second smallest eigenvalue of the Laplacian matrix, known
as the algebraic connectivity of the graph, is considered as
a measure of how well-connected the graph is [
3
]. In this
case, it uses the algebraic connectivity as an intrinsic reward to
train the option policy, with the goal of connecting the states
that are not well-connected, encouraging the agent to explore
infrequently-visited regions, and thus minimizing the agent’s
expected cover time of the state space. Recently, deep learning
J. Chen, M. Haliem, and V. Aggarwal are with Purdue University, West
Lafayette IN 47907, USA, email: {chen3686,mwadea,vaneet}@purdue.edu.
T. Lan is with the George Washington University, Washinton DC 20052, USA,
email:tlan@gwu.edu.
This paper was presented in part at the ICML workshop, July 2021 (no
proceedings).
techniques have been developed to extend the use of covering
options to large/infinite state space, e.g., Deep Covering Option
Discovery [
4
]. However, these efforts focus on discovering
options for individual agents. Discovering collaborative options
that encourage multiple agents to visit the under-explored
regions of their joint state space has not been considered.
In this paper, we propose a novel framework – Multi-agent
Deep Covering Option Discovery. For multi-agent scenarios,
recent works [
5
], [
6
], [
7
] compute options with exploratory
behaviors for each individual agent by considering only its
own state transitions, and then learn to collaboratively leverage
these individual options. However, our proposed framework
directly recognize joint options composed of multiple agents’
temporally-abstract action sequences to encourage joint explo-
ration. Also, we note that in practical scenarios, multi-agent
tasks can often be divided into a series of sub-tasks and each
sub-task can be completed by a sub-group of the agents. Thus,
our proposed algorithm leverages an attention mechanism [
8
]
in the option discovery process to quantify the strength of agent
interactions and find collaborative agent sub-groups. After that,
we can train a set of multi-agent options for each sub-group
to complete their sub-tasks, and then integrate them through
a high-level policy as the solution for completing the whole
task. This sub-group partitioning and hierarchical learning
structure can effectively construct collaborative options that
jointly coordinate the exploration behavior of multiple agents,
while keeping the algorithm scalable in practice.
The main contributions of our work are as follows: (1) We
extend the deep covering option discovery to a multi-agent
scenario, namely Multi-agent Deep Covering Option Discovery,
and demonstrate that the use of multi-agent options can further
improve the performance of MARL agents compared with
single-agent options. (2) We propose to leverage an attention
mechanism in the discovery process to enable agents to find
peer agents that it should interact closely and form sub-
groups with. (3) We propose HA-MSAC, a hierarchical MARL
algorithm, which integrates the training of intra-option policies
(for the option construction) and the high-level policy (for
integrating the options). The proposed algorithm, evaluated
on MARL collaborative tasks, significantly outperforms prior
works in terms of faster exploration and higher task rewards.
The rest of this paper is organized as follows. Section II
introduces some related works and highlights the innovation of
this paper. Section III presents the background knowledge
on option discovery and attention mechanism. Section IV
and V explain the proposed approach in detail, including its
overall framework, network structure and objective functions
to optimize. Section VI describes the simulation setup, and
presents the comparisons of our algorithm with two baselines:
MARL without option discovery, and MARL with single-agent
option discovery. Section VII concludes this paper.
arXiv:2210.03269v3 [cs.LG] 21 Sep 2023
2
II. RELATED WORK
Option Discovery. The option framework was proposed in
[
1
], which extends the usual notion of actions to include options
— closed-loop policies for taking actions over a period of time.
Formally, a set of options defined over an MDP constitutes a
semi-MDP (SMDP), where the SMDP actions (options) are no
longer black boxes, but policies in the base MDP which can be
learned in their own right. In literature, lots of option discovery
algorithms utilize the task-dependent reward signals generated
by the environment, such as [
9
], [
10
], [
11
], [
12
]. Specifically,
they directly define or learn through gradient descent the
options that can lead the agent to the rewarding states in
the environments, and then utilize these trajectory segments
(options) to compose the completed trajectory toward the goal
state. However, these methods rely on dense reward signals,
which are usually hard to acquire in real-life tasks. Therefore,
the authors in [
13
] proposed an approach to generate options
through maximizing an information theoretic objective so that
each option can generate diverse behaviors. It learns useful
skills/options without reward signals and thus can be applied
in environments where only sparse rewards are available.
On the other hand, the work in [
14
], [
2
] focused on Covering
Option Discovery, a method which is also not based on the
task-dependent reward signals but on the Laplacian matrix of
the environment’s state-transition graph. This method aims
at minimizing the agent’s expected cover time of the state
space with a uniformly random policy. To realize this, it
augments the agent’s action set with options obtained from
the eigenvector associated with the second smallest eigenvalue
(algebraic connectivity) of the Laplacian matrix. However, this
Laplacian-based framework can only be applied to tabular
settings. To mitigate this issue, the authors in [
4
] proposed Deep
Covering Option Discovery to combine covering options with
modern representation learning techniques for the eigenfunction
estimation, which can be applied in domains with infinite state
space. In [
4
], the authors compared their approach with the
one proposed by [
13
] (mentioned above): both approaches are
sample-based and scalable to large-scale state space, but RL
agents with deep covering options have better performance on
the same benchmarks. Thus, in the evaluation part, we use
Deep Covering Option Discovery as one of the baselines.
Note that all the approaches mentioned above are for single-
agent scenarios and the goal of this paper is to extend the
adoption of deep covering options to multi-agent reinforcement
learning.
Options in multi-agent scenarios. As mentioned in Section
I, most of the researches about adopting options in MARL tried
to define or learn the options for each individual agent first, and
then learn the collaborative behaviors among the agents based
on their extended action sets –
{
primitive actions, individual
options
}
. Therefore, the options they use are still single-agent
options, and the coordination in the multi-agent system can
only be shown/utilized in the option-choosing process while not
the option discovery process. We can classify these works by
the option discovery methods they used: the algorithms in [
15
],
[
16
] directly defined the options based on their task without
the learning process; the algorithms in [
17
], [
5
], [
6
] learned
the options based on the task-related reward signals generated
by the environment; the algorithm in [
7
] trained the options
based on a reward function that is a weighted summation of
the environment reward and the information theoretic reward
term proposed in [13].
In this paper, we propose to construct multi-agent deep cov-
ering options using the Laplacian-based framework mentioned
above. Also, in an N-agent system, there may be not only
N-agent options, but also one-agent options, two-agent options,
etc. In this case, we divide the agents into some sub-groups
based on their interaction relationship first, which is realized
through the attention mechanism [
8
], and then construct the
interaction patterns (multi-agent options) for each sub-group
accordingly. Through these improvements, the coordination
among agents is considered in the option discovery process,
which has the potential to further improve the performance of
MARL agents.
Hierarchical Multi-agent Reinforcement Learning. Multi-
agent reinforcement learning (MARL) methods hold great
potential to solve a variety of real-world problems. Specific to
the multi-agent cooperative setting (our focus), there are many
related works: VDN [
18
], QMIX [
19
], QTRAN [
20
], MAVEN
[
21
] and MSAC [
22
]. Among them, MSAC introduces the
attention mechanism [
8
] to MARL, which is also what we
need.
On the other hand, when adopting options in reinforcement
learning, agents need to learn the internal policy of each
option (low-level policy) and the policy to choose over the
options (high-level policy) in the meantime. Also, the high-
level policy is used to select a new option only when the
previous option terminates, so the termination signals should be
considered when updating the high-level policy. The updating
rules of reinforcement learning with options are talked about in
literatures like [
23
], [
24
], [
12
] as the option-critic framework.
In this paper, we try to introduce multi-agent options to
MARL, so we extend the option-critic framework to multi-agent
scenarios and combine it with MSAC to propose HA-MSAC,
a hierarchical multi-agent reinforcement learning algorithm,
which will be talked about in Section V.
III. BACKGROUND
Before extending deep covering options to the multi-agent
setting, we will introduce the formal definition of the option
framework and some key issues of deep covering options. Also,
we will introduce the soft-attention mechanism leveraged for
sub-group division in this section.
A. Formal Definition of Option
In this paper, we use the term options for the generalization
of primitive actions to include temporally-extended courses
of actions. As defined in [
1
], an option
ω
consists of three
components: an intra-option policy
πω:SxA → [0,1]
, a
termination condition
βω:S → {0,1}
, and an initiation set
Iω⊆ S
. An option
< Iω, πω, βω>
is available in state
s
if
and only if
sIω
. If the option
ω
is taken, actions are selected
according to
πω
until
ω
terminates stochastically according to
βω
. Therefore, in order to get an option, we need to train/define
3
the intra-option policy, and to define the termination condition
and initiation set.
B. Deep Covering Option Discovery
As descried in [
4
], deep covering options can be constructed
through greedily maximizing the state-space graph’s algebraic
connectivity – the second smallest eigenvalue, so as to minimize
the expected cover time of the state space. To realize this, they
first compute the eigenfunction
f
associated with the algebraic
connectivity by minimizing G(f):
G(f) = 1
2E(s,s)∼H[(f(s)f(s))2] + ηEsρ,sρ
[(f(s)21)(f(s)21) + 2f(s)f(s)]
(1)
where
H
is the set of sampled state-transitions and
ρ
is the
distribution of the states in
H
. Note that this is a sample-based
approach and thus can scale to infinite state-space domains.
Then, based on the computed
f
, they define the termination
set as a set of states where the
f
value is smaller than the
k
-th
percentile of the
f
values on
H
. Accordingly, the initiation set
is defined as the complement of the termination set. As for
the intra-option policy, they train it through maximizing the
reward,
r(s, a, s) = f(s)f(s)
, to encourage the agent to
explore the states with lower
f
values, i.e., the less-explored
states in the termination set.
In this paper, we will compute
f
of the joint observation
space of each collaborative group of agents, and then learn
the multi-agent options based on
f
, so as to encourage the
joint exploration of the agents within the same group and to
increase the connectivity of their joint observation space.
C. Soft Attention Mechanism
The soft attention mechanism functions in a manner similar
with a differentiable key-value memory model [
8
], [
25
], [
26
].
The soft attention weight of agent
i
to agent
j
is defined as
Equation (2):
Wi,j
S=exp[hT
jWT
kWqhi]
Pn̸=iexp[hT
nWT
kWqhi](2)
where
hi=gi(oi)
is the embedding of agent
i
s input (we
take the observation as the input and a neural network defined
in Section
IV-B
as the embedding function),
Wq
transforms
hi
into a “query” and
Wk
transforms
hj
into a “key”. In this case,
the attention weight
Wi,j
S
is acquired through comparing the
embedding
hj
with
hi
with a bilinear mapping (i.e., the query-
key system) and passing the similarity value between these
two embeddings into a softmax function. Note that the weight
matrices for extracting selectors (
Wq
), keys (
Wk
) are shared
across all agents, which encourages a common embedding
space.
IV. ALGORITHM FRAMEWORK AND NETWORK STRUCTURE
In this section, we will introduce Multi-agent Deep Covering
Option Discovery and how to adopt it in a MARL setting.
First, we will provide the key objective functions of Multi-
agent Deep Covering Option Discovery and the hierarchical
Fig. 1: An agent first decides on which option
ω
to use ac-
cording to the high-level policy, and then decides on the action
(primitive action) to take based on the corresponding intra-
option policy πω.Primitive option: Typically, we train a RL
agent to select among the primitive actions; we view this agent
as a special option, whose intra-option policy lasts for only
one step. Option
1N
: Based on the attention mechanism,
each agent can figure out which agents to collaborate closely
and form a sub-group with, so there are at most
N
sub-groups
(duplicate ones need to be eliminated), and we need to train a
multi-agent option for each sub-group.
Algorithm 1 Main Framework
1:
Input: primitive option
A
, high-level policies for each
agent
π1:N
and corresponding Q-functions
Q1:N
, genera-
tion times of options Nω, generation frequency Nint;
2: Initialize the set of options ← {A}
3: Create an empty replay buffer B
4: Set nω0
5: for episode i= 1 to Nepi do
6: Collect a trajectory τiby repeating this process until
done: choose an available option from according
to π1:N, and then execute the corresponding intra-
option policy until it terminates
7: Update Bwith τi
8: if i mod Nint == 0 and nω< Nωthen
9: Generate a set of multi-agent options using
Algorithm 2 based on trajectories in B
10: Update :← {A} ∪
11: Update nω:nωnω+ 1
12: end if
13: Sample trajectories τ1:batchsize from B
14:
Update
π1:N
,
Q1:N
using HA-MSAC (defined in Sec
tion V) based on τ1:batchsize
15: end for
algorithm framework to adopt it in MARL. Then, as an
important algorithm module, we will show how to integrate
the attention mechanism in the network design and how to
adopt it for the sub-group division.
A. Main Framework
In order to take advantage of options in the learning process,
we adopt a hierarchical RL framework shown as Algorithm 1.
Typically, we train a RL agent to select among the primitive
摘要:

1Multi-agentDeepCoveringSkillDiscoveryJiayuChen,MarinaHaliem,TianLan,andVaneetAggarwalAbstract—Theuseofskills(a.k.a.,options)cangreatlyac-celerateexplorationinreinforcementlearning,especiallywhenonlysparserewardsignalsareavailable.Whileoptiondiscoverymethodshavebeenproposedforindividualagents,inmult...

展开>> 收起<<
1 Multi-agent Deep Covering Skill Discovery Jiayu Chen Marina Haliem Tian Lan and Vaneet Aggarwal.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.31MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注