Near-Optimal Multi-Agent Learning for Safe Coverage Control Manish Prajapat

2025-04-26 0 0 2.34MB 44 页 10玖币
侵权投诉
Near-Optimal Multi-Agent Learning for
Safe Coverage Control
Manish Prajapat
ETH Zurich
manishp@ai.ethz.ch
Matteo Turchetta
ETH Zurich
matteotu@inf.ethz.ch
Melanie N. Zeilinger
ETH Zurich
mzeilinger@ethz.ch
Andreas Krause
ETH Zurich
krausea@ethz.ch
Abstract
In multi-agent coverage control problems, agents navigate their environment to
reach locations that maximize the coverage of some density. In practice, the density
is rarely known a priori, further complicating the original NP-hard problem.
Moreover, in many applications, agents cannot visit arbitrary locations due to a
priori unknown safety constraints. In this paper, we aim to efficiently learn the
density to approximately solve the coverage problem while preserving the agents’
safety. We first propose a conditionally linear submodular coverage function that
facilitates theoretical analysis. Utilizing this structure, we develop MACOPT, a
novel algorithm that efficiently trades off the exploration-exploitation dilemma
due to partial observability, and show that it achieves sublinear regret. Next, we
extend results on single-agent safe exploration to our multi-agent setting and
propose SAFEMACfor safe coverage and exploration. We analyze SAFEMACand
give first of its kind results: near optimal coverage in finite time while provably
guaranteeing safety. We extensively evaluate our algorithms on synthetic and real
problems, including a biodiversity monitoring task under safety constraints, where
SAFEMACoutperforms competing methods.
1 Introduction
In multi-agent coverage control (MAC) problems, multiple agents coordinate to maximize coverage
over some spatially distributed events. Their applications abound, from collaborative mapping [
1
],
environmental monitoring [
2
], inspection robotics [
3
] to sensor networks [
4
]. In addition, the coverage
formulation can address core challenges in cooperative multi-agent RL [
5
,
6
], e.g., exploration [
7
],
by providing high-level goals. In these applications, agents often encounter safety constraints that
may lead to critical accidents when ignored, e.g., obstacles [
8
] or extreme weather conditions [
9
,
10
].
Deploying coverage control solutions in the real world presents many challenges: (i) for a given
density of relevant events, this is an NP hard problem [
11
]; (ii) such density is rarely known in
practice [
2
] and must be learned from data, which presents a complex active learning problem as
the quantity we measure (the density) differs from the one we want to optimize (its coverage); (iii)
agents often operate under safety-critical conditions, [
8
10
], that may be unknown a priori. This
requires cautious exploration of the environment to prevent catastrophic outcomes. While prior work
addresses subsets of these challenges (see Section 7), we are not aware of methods that address them
jointly.
Joint supervision. Code available at https://github.com/manish-pra/SafeMaC
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06380v1 [cs.LG] 12 Oct 2022
flaticon.com
(a) Safe initialization
flaticon.com
(b) Coverage maximization
flaticon.com
(c) Safe coverage maximization
Figure 1: The three drones aim to maximize the gorilla nests’ coverage (grey shaded circle) while
avoiding unsafe extreme weather zones (red cross pattern). The contours (yellow is high and purple
is low) represent the density of gorilla nests (blue dots). The density and the constraint are a-prior
unknown. a) To be safe, drones apply a conservative strategy and do not explore, which results in
poor coverage. In b), the drones maximize coverage but get destroyed in extreme weather. c) shows
SAFEMACsolution. The drones strike a balance, trading off between learning the density and the
constraints, and thus achieve near-optimal coverage while always being safe.
This work makes the following contributions toward efficiently solving safe coverage control with
a-priori unknown objectives and constraints.
Firstly
, we model this multi-agent learning task as
aconditionally linear coverage function. We use the monotonocity and the submodularity of this
function to propose MACOPT, a new algorithm for the unconstrained setting that enjoys sublinear
cumulative regret and efficiently recommends a near-optimal solution.
Secondly
, we extend GOOSE
[
12
], an algorithm for single agent safe exploration, to the multi-agent case. Combining our extension
of GOOSE with MACOPT, we propose SAFEMAC, a novel algorithm for safe multi-agent coverage
control. We analyze it and show it attains a near-optimal solution in a finite time.
Finally
, we demon-
strate our algorithms on a synthetic and two real world applications: safe biodiversity monitoring and
obstacle avoidance. We show SAFEMACfinds better solutions than algorithms that do not actively
explore the feasible region and is more sample efficient than competing near-optimal safe algorithms.
2 Problem Statement
We present the safety-constrained multi-agent coverage control problem (Fig. 1) that we aim to solve.
Coverage control
. Coverage control models situations where we want deploy a swarm of dynamic
agents to maximize the coverage of a quantity of interest, see Fig. 1. Formally, given a finite
1
set of
possible locations
V
, the goal of coverage control is to maximize a function
F: 2VR
that assigns
to each subset,
XV
, the corresponding coverage value. For
N
agents, the resulting problem
is
arg maxX:|X|≤NF(X)
. The discrete domain
V
can be represented by a graph, where nodes
represent locations in the domain, and an edge connects node
v
to
v0
if the agent can go from
v
to
v0
.
This corresponds to a deterministic MDP where locations are states and edges represent transitions.
Sensing region
. Depending on the application, we may use different definitions of
F
. Here, we
model cases where agent
i
at location
xi
covers a limited sensing region around it,
Di
. While
Di
can be any connected subset of
V
, in practice it is often a ball centered at
xi
. Given a function
ρ:VRdenoting the density of a quantity of interest at each vV, our coverage objective is
F(X;ρ, V ) = X
xiXX
vDi
ρ(v)/|V|,(1)
where
Di:=Di\D1 : i1
indicates the elements in
V
covered by agent
i
but not agents
1: i1
,
D1 : i1=i1
j=1Djand |V|denotes cardinality of the domain V.
Safety
. In many real-world problems, agents cannot go to arbitrary locations due to safety concerns.
To model this, we introduce a constraint function
q:VR
and we consider safe all locations
v
satisfying
q(v)0
. Such constraint restricts the space of possible solutions of our problem in
two ways. First, it prevents agents from monitoring from unsafe locations. Second, depending on
its dynamics, agent
i
may be unable to safely reach a disconnected safe area starting from
xi
0
, see
1Continuous domains can be handled via discretization
2
Appendix A.3. We denote with
¯
Rq({xi
0})
the largest safely reachable region starting from
xi
0
and
with
B
a collection of batches of agents such that all agents in the same batch
B
share the same
safely reachable set,
i, j B:¯
Rq({xi
0})¯
Rq({xj
0})6=
, see Appendix A for formal definitions.
Based on this, we define the safely reachable control problem
X
B∈B
max
XB¯
Rq(XB
0)F(XB;ρ, ¯
Rq(XB
0)),(2)
where
XB
0={xi
0}iB
are the starting locations of all agents in
B
and
¯
Rq(XB
0) = SiB¯
Rq({xi
0})
indicates the largest safely reachable region from any point
xi
0
for all
i
in
B
(since the agents have
the same dynamics,
¯
Rq(XB
0) = ¯
Rq({xi
0}),iB
). In safety-critical monitoring, there may be
unreachable safe regions. However, since agents should be able to collect measurements if required,
we focus only on covering the safely reachable region.
Unknown density and constraint
. In practice, the density
ρ
and the constraint
q
are often unknown
a priori. However, the agents can iteratively obtain noisy measurements of their values at target
locations. We consider synchronous measurements, i.e., we wait until all agents have collected the
desired measurement for the current iteration before moving to the next one. Here, we focus on the
high-level problem of choosing informative locations, rather than the design of low-level motion
planning
2
. Therefore, our goal is to find an approximate solution to the problem in Eq. (2) preserving
safety throughout exploration, i.e., at every location visited by the agents, while taking as few
measurements as possible in case the dynamics of the agents are deterministic and known as in [
12
].
3 Background
This section presents foundational ideas that our method builds on. In particular, it discusses (i)
monotone submodular functions and (ii) previous work on single-agent safe exploration.
Submodularity
. Optimizing a function defined over the power set of a finite domain,
V
, scales com-
binatorially with the size of
V
in general. In special cases, we can exploit the structure of the objective
to find approximate solutions efficiently. Monotone submodular functions are one example of this.
A set function
F: 2VR
is monotone if for all
ABV
we have
F(A)F(B)
. It is
submodular if
ABV, v V\B
, we have,
F(A∪ {v})F(A)F(B∪ {v})F(B)
. In
coverage control, this means adding vto Ayields at least as much increase in coverage than adding
v
to
B
, if
AB
. Crucially, [
13
] guarantees that the greedy algorithm produces a solution within
a factor of
(1 1/e)
of the optimal solution for problems of the type
arg maxX:|X|≤NF(X;ρ, V )
,
when
F
is monotone and submodular. In practice, the greedy algorithm often outperforms this
worst-case guarantee [14] and guaranteeing a solution better than (1 1/e)factor is NP hard [15].
The coverage function in Eq. (1) is a conditionally linear, monotone and submodular function (proof
in Appendix B), which lets us use the results above to design our algorithm for safe coverage control.
Goal-oriented safe exploration
.GOOSE [
12
] is a single-agent safe exploration algorithm that
extends unconstrained methods to safety-critical cases. Concretely, it maintains under- and over-
approximations of the feasible set, called pessimistic and optimistic safe sets. It preserves safety by re-
stricting the agent to the pessimistic safe set. It efficiently explores the objective by letting the original
unconstrained algorithm recommend locations within the optimistic safe set. If such recommendations
are provably safe, the agent evaluates the objective there. Otherwise, it evaluates the constraint at a se-
quence of safe locations to prove that such recommendation is either safe, which allows it to evaluate
the objective, or unsafe, which triggers the unconstrained algorithm to provide a new recommendation.
Assumptions
. To guarantee safety, GOOSE makes two main assumptions. First, it assumes there
is an initial set of safe locations,
X0
, from where the agent can start exploring. Second, it assumes
the constraint is sufficiently well-behaved, so that we can use data to infer the safety of unvisited
locations. Formally, it assumes the domain
V
is endowed with a positive definite kernel
kq(·,·)
,
and that the constraint’s norm in the associated Reproducing Kernel Hilbert Space [
16
] is bounded,
kqkkqBq
. This lets us use Gaussian Processes (GPs) [
17
]to construct high-probability confidence
intervals for
q
. We specify the GP prior over
q
through a mean function, which we assume to be
2
Agents can use their transition graph to find a path between two goals. In a continuous domain, the path can
be tracked with a controller (e.g., MPC)
3
zero everywhere w.l.o.g.,
µ(v) = 0,vV
, and a kernel function,
k
, that captures the covariance
between different locations. If we have access to
T
measurements, at
VT={vt}T
t=1
perturbed by
i.i.d. Gaussian noise,
yT={q(vt) + ηt}T
t=1
with
ηt N (0, σ2)
, we can compute the posterior mean
and covariance over the constraint at unseen locations
v
,
v0
as
µT(v) = k>
T(v)(KT+σ2I)1yT
and
kt(v, v0) = k(v, v0)k>
T(v)(KT+σ2I)1kT(v0)
, where
kT(v) = [k(v1, v), ..., k(vT, v)]>, KT
is the positive definite kernel matrix [k(v, v0)]v,v0VTand IRT×Tdenotes the identity matrix.
In this work, we make the same assumptions about the safe seed and the regularity of qand ρ.
Approximations of the feasible set
. Based on the GP posterior above, GOOSE builds monotonic con-
fidence intervals for the constraint at each iteration
t
as
lq
t(v):= max{lq
t1(v), µq
t1(v)βq
tσq
t1(v)}
and
uq
t(v):= min{uq
t1(v), µq
t1(v) + βq
tσq
t1(v)}
, which contain the true constraint function for
every
vV
and
t1
, with high probability if
βq
t
is selected as in [
18
] or Section 5. GOOSE uses
these confidence intervals within a set
SV
together with the
Lq
-Lipschitz continuity of
q
, to
define operators that determine which locations are safe in plausible worst- and best-case scenarios,
pt(S) = {vV, |∃zS:lq
t(z)Lqd(v, z)0},(3)
oq
t(S) = {vV, |∃zS:uq
t(z)qLqd(v, z)0}.(4)
Notice that the pessimistic operator relies on the lower bound,
lq
, while the optimistic one on the
upper bound,
uq
. Moreover, the optimistic one uses a margin
q
to exclude "barely" safe locations
as the agent might get stuck learning about them. Finally, to disregard locations the agent could
not safely reach or from where it could not safely return, GOOSE introduces the
Rergodic(·,·)
operator.
Rergodic(pt(S), S)
indicates locations in
S
or locations in
pt(S)
reachable from
S
and from where the
agent can return to
S
along a path contained in
pt(S)
. Combining
pt(S)
and
Rergodic(·,·)
,GOOSE
defines the pessimistic and ergodic operator
˜
Pt(·)
, which it uses to update the pessimistic safe set.
Similarly, it defines ˜
Ot(·)using oq
t(·)to compute the optimistic safe set.
4MACOPT and SAFEMAC
This section presents MACOPT and SAFEMAC, our algorithms for unconstrained and safety-constrained
multi-agent coverage control, which we then formally analyze in Section 5.
4.1 MACOPT: unconstrained multi-agent coverage control
Greedy sensing regions
. In sequential optimization, it is crucial to balance exploration and exploita-
tion. GP-UCB [
19
] is a theoretically sound strategy to strike such a trade-off that works well in
practice. Agents evaluate the objective at locations that maximize an upper confidence bound over the
objective given by the GP model such that locations with either a high posterior mean (exploitation) or
standard deviation (exploration) are visited. We construct a valid upper confidence bound for the cov-
erage
F(X)
starting from our confidence intervals on
ρ
, by replacing the true density
ρ
with its upper
bound
uρ
t
in Eq. (1). Next, we apply the greedy algorithm to this upper bound (Line 3 of Algorithm 1)
to select
N
candidate locations for evaluating the density. However, this simple exploration strategy
may perform poorly, due to the fact that in order to reduce the uncertainty over the coverage
F
at
X
,
we must learn the density
ρ
at all locations inside the sensing region,
SN
i=1 Di
, rather than simply at
X
.
It is a form of partial monitoring [
20
], where the objective
F
differs from the quantity we measure, i.e.,
the density
ρ
. Next, we explain how to choose locations where to observe the density for a given
X
.
Uncertainty sampling
. Given location assignments
X
for the agents, we measure the density to
efficiently learn the function
F(X)
. Intuitively, agent
i
observes the density where it’s most uncertain
within the area it covers that is not covered by agents
{1,...i1}
, i.e.,
Di
t
(Line 4 of Alg. 2, Fig. 2a).
Stopping criterion
. The algorithm terminates when a near-optimal solution is achieved. Intuitively,
this occurs when the uncertainty about the coverage value of the greedy recommendation is low. For-
mally, we require the sum of the uncertainties over the sampling targets to be below a threshold, i.e. ,
wt=PN
i=1 uρ
t1(xg,i
t)lρ
t1(xg,i
t)ρ
(Line 3 of Algorithm 2). Importantly, this stopping criterion
requires the confidence intervals to shrink only at regions that potentially maximize the coverage.
MACOPT
. Now, we introduce MACOPT in Algorithm 2. At round
t
, we select the sensing locations
for the agents,
Xt
, by greedily optimizing the upper confidence bound of the coverage. Then, each
4
flaticon.com
xg,1
t
xg,2
t
xg,3
t
(a) Uncertainty sampling in MACOPT
So,
t
Sp,B1
t
Sp,B2
t
x1
t
x2
t
x3
t
xg,1
t
xg,2
t
xg,3
t
(b) Illustration of multi-agent GOOSE
Figure 2: a) The contours represent the density uncertainty, and the red
×
s correspond to the
maximum coverage locations evaluated by the GREEDY Algorithm 1. While these locations maximize
coverage, they may not be informative about the coverage since the uncertainty can be low. Therefore,
the agents collect measurements at the maximum uncertainty of the density in a disc (green
×
s,
xg,i
t
),
also known as uncertainty sampling. b) In a constrained environment, SAFEMACevaluates
xg,i
t
for all
agents in the optimistic set So,q
t(violet) and set it as a next goal. It forms an expander region (dark
blue) to safely expand the pessimistic safe set Sp
t(green) toward the goal.
agent
i
collects noisy density measurements at the points of highest uncertainty within
Di
t
. Finally,
we update our GP over the density and, if the sum of maximum uncertainties within each sensing
region is small, we stop the algorithm.
4.2 SAFEMAC: safety-constrained multi-agent coverage control
Intuition
. We adopt a perspective similar to GOOSE as we separate the exploration of the safe set
from the maximization of the coverage. Given an over and under approximation of the safe set (whose
computation is discussed later), we want to explore optimistically optimal goals for each agent, similar
to MACOPT. To this end, we find the maximizers of the density upper bound in the optimistic safe set
with the GREEDY algorithm. Then, we define sampling goals to learn the coverage at those locations.
Phases of SAFEMAC
. Coverage values depend both on the density and the feasible region (Eq. (2)).
Thus, there are two sensible sampling goals given a disk assignment: i) optimistic coverage: if we are
uncertain about the density within the disks, we target locations with the highest density uncertainty
(Line 6 of Algorithm 4); ii) optimistic exploration: if we know the density within the disk but there are
locations under it that we cannot classify as either safe (in
Sp
) or unsafe (in
V\So,q
), we target those
with the highest constraint uncertainty among them (Line 8). If all the goal locations are safe with high
probability, which can only happen during optimistic coverage, we safely evaluate the density there
(Line 19). Otherwise, we explore the constraint with a goal directed strategy that aims at classifying
them as either safe or unsafe similar to GOOSE (Line 9-12). In case this changes the topological
connection of the optimistic feasible set, we recompute the disks as this may change GREEDYs output
(Line 15-17). We repeat this loop until we know the feasibility of all the points under the disks
recommended by GREEDY and their density uncertainty is low (Line 4). Next, we explain how the
multiple agents coordinate their individual safe regions to evaluate a goal (MACOPT in batches), how
the agents progress toward their goals (safe expansion) and finally we describe SAFEMACconvergence.
MACOPT in batches
. In the multi-agent setting of GOOSE (see Fig. 2b), each agent
i
maintains
Sp,i
t
a
pessimistic (or
So,q,i
t
an optimistic) belief of the safe locations, obtained by iteratively applying
˜
Pt(·)
the pessimistic ( or
˜
Ot(·)
the optimistic) ergodic operators (see Section 3) to the previous pessimistic
belief
Sp,i
t1
(Line 11 of Algorithm 4). Since the agents cannot navigate to an arbitrary location in
the constrained case, SAFEMACcomputes coverage maximizers on a restricted region, obtained by
ignoring the known unsafe locations. To denote such a restricted region, we define a union set
Su,i
t:=
So,q,i
tSp,i
t
, which is the largest set known to be optimistically or pessimistically safe up to time
t
.
Moreover, if the agents are topologically disconnected, they cannot travel from one safe region to an-
other and the best strategy for any batch of agents is to maximize coverage locally. For this, we form a
collection of batches
Bt
, such that any batch
B∈ Bt
contains agents that lie in topologically connected
regions determined by the union set (Line 13-14 ). SAFEMACcomputes a GREEDY solution for each
B∈ Bt
in their corresponding
Su,B
t:=iBSu,i
t
. This is the largest set where the agents can find an
5
摘要:

Near-OptimalMulti-AgentLearningforSafeCoverageControlManishPrajapatETHZurichmanishp@ai.ethz.chMatteoTurchettaETHZurichmatteotu@inf.ethz.chMelanieN.ZeilingeryETHZurichmzeilinger@ethz.chAndreasKrauseyETHZurichkrausea@ethz.chAbstractInmulti-agentcoveragecontrolproblems,agentsnavigatetheirenvironmenttor...

展开>> 收起<<
Near-Optimal Multi-Agent Learning for Safe Coverage Control Manish Prajapat.pdf

共44页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:44 页 大小:2.34MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 44
客服
关注