1
Multi-agent Deep Covering Skill Discovery
Jiayu Chen, Marina Haliem, Tian Lan, and Vaneet Aggarwal
Abstract—The use of skills (a.k.a., options) can greatly ac-
celerate exploration in reinforcement learning, especially when
only sparse reward signals are available. While option discovery
methods have been proposed for individual agents, in multi-
agent reinforcement learning settings, discovering collaborative
options that can coordinate the behavior of multiple agents and
encourage them to visit the under-explored regions of their joint
state space has not been considered. In this case, we propose
Multi-agent Deep Covering Option Discovery, which constructs the
multi-agent options through minimizing the expected cover time
of the multiple agents’ joint state space.
Also, we propose a novel framework to adopt the multi-agent
options in the MARL process. In practice, a multi-agent task
can usually be divided into some sub-tasks, each of which can
be completed by a sub-group of the agents. Therefore, our
algorithm framework first leverages an attention mechanism
to find collaborative agent sub-groups that would benefit most
from coordinated actions. Then, a hierarchical algorithm, namely
HA-MSAC, is developed to learn the multi-agent options for each
sub-group to complete their sub-tasks first, and then to integrate
them through a high-level policy as the solution of the whole task.
This hierarchical option construction allows our framework to
strike a balance between scalability and effective collaboration
among the agents.
The evaluation based on multi-agent collaborative tasks shows
that the proposed algorithm can effectively capture the agent
interactions with the attention mechanism, successfully identify
multi-agent options, and significantly outperforms prior works
using single-agent options or no options, in terms of both faster
exploration and higher task rewards.
Index Terms—Multi-agent Reinforcement Learning, Skill Dis-
covery, Deep Covering Options
I. INTRODUCTION
Option discovery [
1
] enables temporally-abstract actions to
be constructed in the reinforcement learning process. It can
greatly improve the performance of reinforcement learning
agents by representing actions at different time scales. Among
recent developments on the topic, Covering Option Discovery
[
2
] has been shown to be a promising approach. It leverages
Laplacian matrix extracted from the state-transition graph
induced by the dynamics of the environment. To be specific,
the second smallest eigenvalue of the Laplacian matrix, known
as the algebraic connectivity of the graph, is considered as
a measure of how well-connected the graph is [
3
]. In this
case, it uses the algebraic connectivity as an intrinsic reward to
train the option policy, with the goal of connecting the states
that are not well-connected, encouraging the agent to explore
infrequently-visited regions, and thus minimizing the agent’s
expected cover time of the state space. Recently, deep learning
J. Chen, M. Haliem, and V. Aggarwal are with Purdue University, West
Lafayette IN 47907, USA, email: {chen3686,mwadea,vaneet}@purdue.edu.
T. Lan is with the George Washington University, Washinton DC 20052, USA,
email:tlan@gwu.edu.
This paper was presented in part at the ICML workshop, July 2021 (no
proceedings).
techniques have been developed to extend the use of covering
options to large/infinite state space, e.g., Deep Covering Option
Discovery [
4
]. However, these efforts focus on discovering
options for individual agents. Discovering collaborative options
that encourage multiple agents to visit the under-explored
regions of their joint state space has not been considered.
In this paper, we propose a novel framework – Multi-agent
Deep Covering Option Discovery. For multi-agent scenarios,
recent works [
5
], [
6
], [
7
] compute options with exploratory
behaviors for each individual agent by considering only its
own state transitions, and then learn to collaboratively leverage
these individual options. However, our proposed framework
directly recognize joint options composed of multiple agents’
temporally-abstract action sequences to encourage joint explo-
ration. Also, we note that in practical scenarios, multi-agent
tasks can often be divided into a series of sub-tasks and each
sub-task can be completed by a sub-group of the agents. Thus,
our proposed algorithm leverages an attention mechanism [
8
]
in the option discovery process to quantify the strength of agent
interactions and find collaborative agent sub-groups. After that,
we can train a set of multi-agent options for each sub-group
to complete their sub-tasks, and then integrate them through
a high-level policy as the solution for completing the whole
task. This sub-group partitioning and hierarchical learning
structure can effectively construct collaborative options that
jointly coordinate the exploration behavior of multiple agents,
while keeping the algorithm scalable in practice.
The main contributions of our work are as follows: (1) We
extend the deep covering option discovery to a multi-agent
scenario, namely Multi-agent Deep Covering Option Discovery,
and demonstrate that the use of multi-agent options can further
improve the performance of MARL agents compared with
single-agent options. (2) We propose to leverage an attention
mechanism in the discovery process to enable agents to find
peer agents that it should interact closely and form sub-
groups with. (3) We propose HA-MSAC, a hierarchical MARL
algorithm, which integrates the training of intra-option policies
(for the option construction) and the high-level policy (for
integrating the options). The proposed algorithm, evaluated
on MARL collaborative tasks, significantly outperforms prior
works in terms of faster exploration and higher task rewards.
The rest of this paper is organized as follows. Section II
introduces some related works and highlights the innovation of
this paper. Section III presents the background knowledge
on option discovery and attention mechanism. Section IV
and V explain the proposed approach in detail, including its
overall framework, network structure and objective functions
to optimize. Section VI describes the simulation setup, and
presents the comparisons of our algorithm with two baselines:
MARL without option discovery, and MARL with single-agent
option discovery. Section VII concludes this paper.
arXiv:2210.03269v3 [cs.LG] 21 Sep 2023