1 Multi-agent Deep Covering Skill Discovery Jiayu Chen Marina Haliem Tian Lan and Vaneet Aggarwal

2025-04-28 0 0 1.31MB 11 页 10玖币

侵权投诉

Multi-agent Deep Covering Skill Discovery

Jiayu Chen, Marina Haliem, Tian Lan, and Vaneet Aggarwal

Abstract—The use of skills (a.k.a., options) can greatly ac-

celerate exploration in reinforcement learning, especially when

only sparse reward signals are available. While option discovery

methods have been proposed for individual agents, in multi-

agent reinforcement learning settings, discovering collaborative

options that can coordinate the behavior of multiple agents and

encourage them to visit the under-explored regions of their joint

state space has not been considered. In this case, we propose

Multi-agent Deep Covering Option Discovery, which constructs the

multi-agent options through minimizing the expected cover time

of the multiple agents’ joint state space.

Also, we propose a novel framework to adopt the multi-agent

options in the MARL process. In practice, a multi-agent task

can usually be divided into some sub-tasks, each of which can

be completed by a sub-group of the agents. Therefore, our

algorithm framework ﬁrst leverages an attention mechanism

to ﬁnd collaborative agent sub-groups that would beneﬁt most

from coordinated actions. Then, a hierarchical algorithm, namely

HA-MSAC, is developed to learn the multi-agent options for each

sub-group to complete their sub-tasks ﬁrst, and then to integrate

them through a high-level policy as the solution of the whole task.

This hierarchical option construction allows our framework to

strike a balance between scalability and effective collaboration

among the agents.

The evaluation based on multi-agent collaborative tasks shows

that the proposed algorithm can effectively capture the agent

interactions with the attention mechanism, successfully identify

multi-agent options, and signiﬁcantly outperforms prior works

using single-agent options or no options, in terms of both faster

exploration and higher task rewards.

Index Terms—Multi-agent Reinforcement Learning, Skill Dis-

covery, Deep Covering Options

I. INTRODUCTION

Option discovery [

] enables temporally-abstract actions to

be constructed in the reinforcement learning process. It can

greatly improve the performance of reinforcement learning

agents by representing actions at different time scales. Among

recent developments on the topic, Covering Option Discovery

[

] has been shown to be a promising approach. It leverages

Laplacian matrix extracted from the state-transition graph

induced by the dynamics of the environment. To be speciﬁc,

the second smallest eigenvalue of the Laplacian matrix, known

as the algebraic connectivity of the graph, is considered as

a measure of how well-connected the graph is [

]. In this

case, it uses the algebraic connectivity as an intrinsic reward to

train the option policy, with the goal of connecting the states

that are not well-connected, encouraging the agent to explore

infrequently-visited regions, and thus minimizing the agent’s

expected cover time of the state space. Recently, deep learning

J. Chen, M. Haliem, and V. Aggarwal are with Purdue University, West

Lafayette IN 47907, USA, email: {chen3686,mwadea,vaneet}@purdue.edu.

T. Lan is with the George Washington University, Washinton DC 20052, USA,

email:tlan@gwu.edu.

This paper was presented in part at the ICML workshop, July 2021 (no

proceedings).

techniques have been developed to extend the use of covering

options to large/inﬁnite state space, e.g., Deep Covering Option

Discovery [

]. However, these efforts focus on discovering

options for individual agents. Discovering collaborative options

that encourage multiple agents to visit the under-explored

regions of their joint state space has not been considered.

In this paper, we propose a novel framework – Multi-agent

Deep Covering Option Discovery. For multi-agent scenarios,

recent works [

], [

] compute options with exploratory

behaviors for each individual agent by considering only its

own state transitions, and then learn to collaboratively leverage

these individual options. However, our proposed framework

directly recognize joint options composed of multiple agents’

temporally-abstract action sequences to encourage joint explo-

ration. Also, we note that in practical scenarios, multi-agent

tasks can often be divided into a series of sub-tasks and each

sub-task can be completed by a sub-group of the agents. Thus,

our proposed algorithm leverages an attention mechanism [

]

in the option discovery process to quantify the strength of agent

interactions and ﬁnd collaborative agent sub-groups. After that,

we can train a set of multi-agent options for each sub-group

to complete their sub-tasks, and then integrate them through

a high-level policy as the solution for completing the whole

task. This sub-group partitioning and hierarchical learning

structure can effectively construct collaborative options that

jointly coordinate the exploration behavior of multiple agents,

while keeping the algorithm scalable in practice.

The main contributions of our work are as follows: (1) We

extend the deep covering option discovery to a multi-agent

scenario, namely Multi-agent Deep Covering Option Discovery,

and demonstrate that the use of multi-agent options can further

improve the performance of MARL agents compared with

single-agent options. (2) We propose to leverage an attention

mechanism in the discovery process to enable agents to ﬁnd

peer agents that it should interact closely and form sub-

groups with. (3) We propose HA-MSAC, a hierarchical MARL

algorithm, which integrates the training of intra-option policies

(for the option construction) and the high-level policy (for

integrating the options). The proposed algorithm, evaluated

on MARL collaborative tasks, signiﬁcantly outperforms prior

works in terms of faster exploration and higher task rewards.

The rest of this paper is organized as follows. Section II

introduces some related works and highlights the innovation of

this paper. Section III presents the background knowledge

on option discovery and attention mechanism. Section IV

and V explain the proposed approach in detail, including its

overall framework, network structure and objective functions

to optimize. Section VI describes the simulation setup, and

presents the comparisons of our algorithm with two baselines:

MARL without option discovery, and MARL with single-agent

option discovery. Section VII concludes this paper.

arXiv:2210.03269v3 [cs.LG] 21 Sep 2023

II. RELATED WORK

Option Discovery. The option framework was proposed in

[

], which extends the usual notion of actions to include options

— closed-loop policies for taking actions over a period of time.

Formally, a set of options deﬁned over an MDP constitutes a

semi-MDP (SMDP), where the SMDP actions (options) are no

longer black boxes, but policies in the base MDP which can be

learned in their own right. In literature, lots of option discovery

algorithms utilize the task-dependent reward signals generated

by the environment, such as [

], [

]. Speciﬁcally,

they directly deﬁne or learn through gradient descent the

options that can lead the agent to the rewarding states in

the environments, and then utilize these trajectory segments

(options) to compose the completed trajectory toward the goal

state. However, these methods rely on dense reward signals,

which are usually hard to acquire in real-life tasks. Therefore,

the authors in [

] proposed an approach to generate options

through maximizing an information theoretic objective so that

each option can generate diverse behaviors. It learns useful

skills/options without reward signals and thus can be applied

in environments where only sparse rewards are available.

On the other hand, the work in [

], [

] focused on Covering

Option Discovery, a method which is also not based on the

task-dependent reward signals but on the Laplacian matrix of

the environment’s state-transition graph. This method aims

at minimizing the agent’s expected cover time of the state

space with a uniformly random policy. To realize this, it

augments the agent’s action set with options obtained from

the eigenvector associated with the second smallest eigenvalue

(algebraic connectivity) of the Laplacian matrix. However, this

Laplacian-based framework can only be applied to tabular

settings. To mitigate this issue, the authors in [

] proposed Deep

Covering Option Discovery to combine covering options with

modern representation learning techniques for the eigenfunction

estimation, which can be applied in domains with inﬁnite state

space. In [

], the authors compared their approach with the

one proposed by [

] (mentioned above): both approaches are

sample-based and scalable to large-scale state space, but RL

agents with deep covering options have better performance on

the same benchmarks. Thus, in the evaluation part, we use

Deep Covering Option Discovery as one of the baselines.

Note that all the approaches mentioned above are for single-

agent scenarios and the goal of this paper is to extend the

adoption of deep covering options to multi-agent reinforcement

learning.

Options in multi-agent scenarios. As mentioned in Section

I, most of the researches about adopting options in MARL tried

to deﬁne or learn the options for each individual agent ﬁrst, and

then learn the collaborative behaviors among the agents based

on their extended action sets –

{

primitive actions, individual

options

}

. Therefore, the options they use are still single-agent

options, and the coordination in the multi-agent system can

only be shown/utilized in the option-choosing process while not

the option discovery process. We can classify these works by

the option discovery methods they used: the algorithms in [

[

] directly deﬁned the options based on their task without

the learning process; the algorithms in [

], [

] learned

the options based on the task-related reward signals generated

by the environment; the algorithm in [

] trained the options

based on a reward function that is a weighted summation of

the environment reward and the information theoretic reward

term proposed in [13].

In this paper, we propose to construct multi-agent deep cov-

ering options using the Laplacian-based framework mentioned

above. Also, in an N-agent system, there may be not only

N-agent options, but also one-agent options, two-agent options,

etc. In this case, we divide the agents into some sub-groups

based on their interaction relationship ﬁrst, which is realized

through the attention mechanism [

], and then construct the

interaction patterns (multi-agent options) for each sub-group

accordingly. Through these improvements, the coordination

among agents is considered in the option discovery process,

which has the potential to further improve the performance of

MARL agents.

Hierarchical Multi-agent Reinforcement Learning. Multi-

agent reinforcement learning (MARL) methods hold great

potential to solve a variety of real-world problems. Speciﬁc to

the multi-agent cooperative setting (our focus), there are many

related works: VDN [

], QMIX [

], QTRAN [

], MAVEN

[

] and MSAC [

]. Among them, MSAC introduces the

attention mechanism [

] to MARL, which is also what we

need.

On the other hand, when adopting options in reinforcement

learning, agents need to learn the internal policy of each

option (low-level policy) and the policy to choose over the

options (high-level policy) in the meantime. Also, the high-

level policy is used to select a new option only when the

previous option terminates, so the termination signals should be

considered when updating the high-level policy. The updating

rules of reinforcement learning with options are talked about in

literatures like [

], [

] as the option-critic framework.

In this paper, we try to introduce multi-agent options to

MARL, so we extend the option-critic framework to multi-agent

scenarios and combine it with MSAC to propose HA-MSAC,

a hierarchical multi-agent reinforcement learning algorithm,

which will be talked about in Section V.

III. BACKGROUND

Before extending deep covering options to the multi-agent

setting, we will introduce the formal deﬁnition of the option

framework and some key issues of deep covering options. Also,

we will introduce the soft-attention mechanism leveraged for

sub-group division in this section.

A. Formal Deﬁnition of Option

In this paper, we use the term options for the generalization

of primitive actions to include temporally-extended courses

of actions. As deﬁned in [

], an option

consists of three

components: an intra-option policy

πω:SxA → [0,1]

, a

termination condition

βω:S → {0,1}

, and an initiation set

Iω⊆ S

. An option

< Iω, πω, βω>

is available in state

and only if

s∈Iω

. If the option

is taken, actions are selected

according to

πω

until

terminates stochastically according to

βω

. Therefore, in order to get an option, we need to train/deﬁne

the intra-option policy, and to deﬁne the termination condition

and initiation set.

B. Deep Covering Option Discovery

As descried in [

], deep covering options can be constructed

through greedily maximizing the state-space graph’s algebraic

connectivity – the second smallest eigenvalue, so as to minimize

the expected cover time of the state space. To realize this, they

ﬁrst compute the eigenfunction

associated with the algebraic

connectivity by minimizing G(f):

G(f) = 1

2E(s,s′)∼H[(f(s)−f(s′))2] + ηEs∼ρ,s′∼ρ

[(f(s)2−1)(f(s′)2−1) + 2f(s)f(s′)]

(1)

where

is the set of sampled state-transitions and

is the

distribution of the states in

. Note that this is a sample-based

approach and thus can scale to inﬁnite state-space domains.

Then, based on the computed

, they deﬁne the termination

set as a set of states where the

value is smaller than the

-th

percentile of the

values on

. Accordingly, the initiation set

is deﬁned as the complement of the termination set. As for

the intra-option policy, they train it through maximizing the

reward,

r(s, a, s′) = f(s)−f(s′)

, to encourage the agent to

explore the states with lower

values, i.e., the less-explored

states in the termination set.

In this paper, we will compute

of the joint observation

space of each collaborative group of agents, and then learn

the multi-agent options based on

, so as to encourage the

joint exploration of the agents within the same group and to

increase the connectivity of their joint observation space.

C. Soft Attention Mechanism

The soft attention mechanism functions in a manner similar

with a differentiable key-value memory model [

], [

The soft attention weight of agent

to agent

is deﬁned as

Equation (2):

Wi,j

S=exp[hT

jWT

kWqhi]

Pn̸=iexp[hT

nWT

kWqhi](2)

where

hi=gi(oi)

is the embedding of agent

’s input (we

take the observation as the input and a neural network deﬁned

in Section

IV-B

as the embedding function),

transforms

into a “query” and

transforms

into a “key”. In this case,

the attention weight

Wi,j

is acquired through comparing the

embedding

with

with a bilinear mapping (i.e., the query-

key system) and passing the similarity value between these

two embeddings into a softmax function. Note that the weight

matrices for extracting selectors (

), keys (

) are shared

across all agents, which encourages a common embedding

space.

IV. ALGORITHM FRAMEWORK AND NETWORK STRUCTURE

In this section, we will introduce Multi-agent Deep Covering

Option Discovery and how to adopt it in a MARL setting.

First, we will provide the key objective functions of Multi-

agent Deep Covering Option Discovery and the hierarchical

Fig. 1: An agent ﬁrst decides on which option

to use ac-

cording to the high-level policy, and then decides on the action

(primitive action) to take based on the corresponding intra-

option policy πω.Primitive option: Typically, we train a RL

agent to select among the primitive actions; we view this agent

as a special option, whose intra-option policy lasts for only

one step. Option

1∼N

: Based on the attention mechanism,

each agent can ﬁgure out which agents to collaborate closely

and form a sub-group with, so there are at most

sub-groups

(duplicate ones need to be eliminated), and we need to train a

multi-agent option for each sub-group.

Algorithm 1 Main Framework

Input: primitive option

, high-level policies for each

agent

π1:N

and corresponding Q-functions

Q1:N

, genera-

tion times of options Nω, generation frequency Nint;

2: Initialize the set of options Ω′← {A}

3: Create an empty replay buffer B

4: Set nω←0

5: for episode i= 1 to Nepi do

6: Collect a trajectory τiby repeating this process until

done: choose an available option from Ω′according

to π1:N, and then execute the corresponding intra-

option policy until it terminates

7: Update Bwith τi

8: if i mod Nint == 0 and nω< Nωthen

9: Generate a set of multi-agent options Ωusing

Algorithm 2 based on trajectories in B

10: Update Ω′:Ω′← {A} ∪ Ω

11: Update nω:nω←nω+ 1

12: end if

13: Sample trajectories τ1:batchsize from B

14:

Update

π1:N

Q1:N

using HA-MSAC (deﬁned in Sec

tion V) based on τ1:batchsize

15: end for

algorithm framework to adopt it in MARL. Then, as an

important algorithm module, we will show how to integrate

the attention mechanism in the network design and how to

adopt it for the sub-group division.

A. Main Framework

In order to take advantage of options in the learning process,

we adopt a hierarchical RL framework shown as Algorithm 1.

Typically, we train a RL agent to select among the primitive

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1Multi-agentDeepCoveringSkillDiscoveryJiayuChen,MarinaHaliem,TianLan,andVaneetAggarwalAbstract—Theuseofskills(a.k.a.,options)cangreatlyac-celerateexplorationinreinforcementlearning,especiallywhenonlysparserewardsignalsareavailable.Whileoptiondiscoverymethodshavebeenproposedforindividualagents,inmult...

展开>> 收起<<

1 Multi-agent Deep Covering Skill Discovery Jiayu Chen Marina Haliem Tian Lan and Vaneet Aggarwal.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Multi-agent Deep Covering Skill Discovery Jiayu Chen Marina Haliem Tian Lan and Vaneet Aggarwal

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: