Near-Optimal Multi-Agent Learning for Safe Coverage Control Manish Prajapat

2025-04-26 0 0 2.34MB 44 页 10玖币

侵权投诉

Near-Optimal Multi-Agent Learning for

Safe Coverage Control

Manish Prajapat

ETH Zurich

manishp@ai.ethz.ch

Matteo Turchetta

ETH Zurich

matteotu@inf.ethz.ch

Melanie N. Zeilinger†

ETH Zurich

mzeilinger@ethz.ch

Andreas Krause†

ETH Zurich

krausea@ethz.ch

Abstract

In multi-agent coverage control problems, agents navigate their environment to

reach locations that maximize the coverage of some density. In practice, the density

is rarely known a priori, further complicating the original NP-hard problem.

Moreover, in many applications, agents cannot visit arbitrary locations due to a

priori unknown safety constraints. In this paper, we aim to efﬁciently learn the

density to approximately solve the coverage problem while preserving the agents’

safety. We ﬁrst propose a conditionally linear submodular coverage function that

facilitates theoretical analysis. Utilizing this structure, we develop MACOPT, a

novel algorithm that efﬁciently trades off the exploration-exploitation dilemma

due to partial observability, and show that it achieves sublinear regret. Next, we

extend results on single-agent safe exploration to our multi-agent setting and

propose SAFEMACfor safe coverage and exploration. We analyze SAFEMACand

give ﬁrst of its kind results: near optimal coverage in ﬁnite time while provably

guaranteeing safety. We extensively evaluate our algorithms on synthetic and real

problems, including a biodiversity monitoring task under safety constraints, where

SAFEMACoutperforms competing methods.

1 Introduction

In multi-agent coverage control (MAC) problems, multiple agents coordinate to maximize coverage

over some spatially distributed events. Their applications abound, from collaborative mapping [

environmental monitoring [

], inspection robotics [

] to sensor networks [

]. In addition, the coverage

formulation can address core challenges in cooperative multi-agent RL [

], e.g., exploration [

by providing high-level goals. In these applications, agents often encounter safety constraints that

may lead to critical accidents when ignored, e.g., obstacles [

] or extreme weather conditions [

Deploying coverage control solutions in the real world presents many challenges: (i) for a given

density of relevant events, this is an NP hard problem [

]; (ii) such density is rarely known in

practice [

] and must be learned from data, which presents a complex active learning problem as

the quantity we measure (the density) differs from the one we want to optimize (its coverage); (iii)

agents often operate under safety-critical conditions, [

–

], that may be unknown a priori. This

requires cautious exploration of the environment to prevent catastrophic outcomes. While prior work

addresses subsets of these challenges (see Section 7), we are not aware of methods that address them

jointly.

†Joint supervision. Code available at https://github.com/manish-pra/SafeMaC

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06380v1 [cs.LG] 12 Oct 2022

ﬂaticon.com

(a) Safe initialization

ﬂaticon.com

(b) Coverage maximization

ﬂaticon.com

Figure 1: The three drones aim to maximize the gorilla nests’ coverage (grey shaded circle) while

avoiding unsafe extreme weather zones (red cross pattern). The contours (yellow is high and purple

is low) represent the density of gorilla nests (blue dots). The density and the constraint are a-prior

unknown. a) To be safe, drones apply a conservative strategy and do not explore, which results in

poor coverage. In b), the drones maximize coverage but get destroyed in extreme weather. c) shows

SAFEMACsolution. The drones strike a balance, trading off between learning the density and the

constraints, and thus achieve near-optimal coverage while always being safe.

This work makes the following contributions toward efﬁciently solving safe coverage control with

a-priori unknown objectives and constraints.

Firstly

, we model this multi-agent learning task as

aconditionally linear coverage function. We use the monotonocity and the submodularity of this

function to propose MACOPT, a new algorithm for the unconstrained setting that enjoys sublinear

cumulative regret and efﬁciently recommends a near-optimal solution.

Secondly

, we extend GOOSE

[

], an algorithm for single agent safe exploration, to the multi-agent case. Combining our extension

of GOOSE with MACOPT, we propose SAFEMAC, a novel algorithm for safe multi-agent coverage

control. We analyze it and show it attains a near-optimal solution in a ﬁnite time.

Finally

, we demon-

strate our algorithms on a synthetic and two real world applications: safe biodiversity monitoring and

obstacle avoidance. We show SAFEMACﬁnds better solutions than algorithms that do not actively

explore the feasible region and is more sample efﬁcient than competing near-optimal safe algorithms.

2 Problem Statement

We present the safety-constrained multi-agent coverage control problem (Fig. 1) that we aim to solve.

Coverage control

. Coverage control models situations where we want deploy a swarm of dynamic

agents to maximize the coverage of a quantity of interest, see Fig. 1. Formally, given a ﬁnite

set of

possible locations

, the goal of coverage control is to maximize a function

F: 2V→R

that assigns

to each subset,

X⊆V

, the corresponding coverage value. For

agents, the resulting problem

arg maxX:|X|≤NF(X)

. The discrete domain

can be represented by a graph, where nodes

represent locations in the domain, and an edge connects node

if the agent can go from

This corresponds to a deterministic MDP where locations are states and edges represent transitions.

Sensing region

. Depending on the application, we may use different deﬁnitions of

. Here, we

model cases where agent

at location

covers a limited sensing region around it,

. While

can be any connected subset of

, in practice it is often a ball centered at

. Given a function

ρ:V→Rdenoting the density of a quantity of interest at each v∈V, our coverage objective is

F(X;ρ, V ) = X

xi∈XX

v∈Di−

ρ(v)/|V|,(1)

where

Di−:=Di\D1 : i−1

indicates the elements in

covered by agent

but not agents

1: i−1

D1 : i−1=∪i−1

j=1Djand |V|denotes cardinality of the domain V.

Safety

. In many real-world problems, agents cannot go to arbitrary locations due to safety concerns.

To model this, we introduce a constraint function

q:V→R

and we consider safe all locations

satisfying

q(v)≥0

. Such constraint restricts the space of possible solutions of our problem in

two ways. First, it prevents agents from monitoring from unsafe locations. Second, depending on

its dynamics, agent

may be unable to safely reach a disconnected safe area starting from

, see

1Continuous domains can be handled via discretization

Appendix A.3. We denote with

Rq({xi

0})

the largest safely reachable region starting from

and

with

a collection of batches of agents such that all agents in the same batch

share the same

safely reachable set,

∀i, j ∈B:¯

Rq({xi

0})∩¯

Rq({xj

0})6=∅

, see Appendix A for formal deﬁnitions.

Based on this, we deﬁne the safely reachable control problem

B∈B

max

XB∈¯

Rq(XB

0)F(XB;ρ, ¯

Rq(XB

0)),(2)

where

0={xi

0}i∈B

are the starting locations of all agents in

and

Rq(XB

0) = Si∈B¯

Rq({xi

0})

indicates the largest safely reachable region from any point

for all

(since the agents have

the same dynamics,

Rq(XB

0) = ¯

Rq({xi

0}),∀i∈B

). In safety-critical monitoring, there may be

unreachable safe regions. However, since agents should be able to collect measurements if required,

we focus only on covering the safely reachable region.

Unknown density and constraint

. In practice, the density

and the constraint

are often unknown

a priori. However, the agents can iteratively obtain noisy measurements of their values at target

locations. We consider synchronous measurements, i.e., we wait until all agents have collected the

desired measurement for the current iteration before moving to the next one. Here, we focus on the

high-level problem of choosing informative locations, rather than the design of low-level motion

planning

. Therefore, our goal is to ﬁnd an approximate solution to the problem in Eq. (2) preserving

safety throughout exploration, i.e., at every location visited by the agents, while taking as few

measurements as possible in case the dynamics of the agents are deterministic and known as in [

3 Background

This section presents foundational ideas that our method builds on. In particular, it discusses (i)

monotone submodular functions and (ii) previous work on single-agent safe exploration.

Submodularity

. Optimizing a function deﬁned over the power set of a ﬁnite domain,

, scales com-

binatorially with the size of

in general. In special cases, we can exploit the structure of the objective

to ﬁnd approximate solutions efﬁciently. Monotone submodular functions are one example of this.

A set function

F: 2V→R

is monotone if for all

A⊆B⊂V

we have

F(A)≤F(B)

. It is

submodular if

∀A⊆B⊆V, v ∈V\B

, we have,

F(A∪ {v})−F(A)≥F(B∪ {v})−F(B)

. In

coverage control, this means adding vto Ayields at least as much increase in coverage than adding

, if

A⊆B

. Crucially, [

] guarantees that the greedy algorithm produces a solution within

a factor of

(1 −1/e)

of the optimal solution for problems of the type

arg maxX:|X|≤NF(X;ρ, V )

when

is monotone and submodular. In practice, the greedy algorithm often outperforms this

worst-case guarantee [14] and guaranteeing a solution better than (1 −1/e)factor is NP hard [15].

The coverage function in Eq. (1) is a conditionally linear, monotone and submodular function (proof

in Appendix B), which lets us use the results above to design our algorithm for safe coverage control.

Goal-oriented safe exploration

.GOOSE [

] is a single-agent safe exploration algorithm that

extends unconstrained methods to safety-critical cases. Concretely, it maintains under- and over-

approximations of the feasible set, called pessimistic and optimistic safe sets. It preserves safety by re-

stricting the agent to the pessimistic safe set. It efﬁciently explores the objective by letting the original

unconstrained algorithm recommend locations within the optimistic safe set. If such recommendations

are provably safe, the agent evaluates the objective there. Otherwise, it evaluates the constraint at a se-

quence of safe locations to prove that such recommendation is either safe, which allows it to evaluate

the objective, or unsafe, which triggers the unconstrained algorithm to provide a new recommendation.

Assumptions

. To guarantee safety, GOOSE makes two main assumptions. First, it assumes there

is an initial set of safe locations,

, from where the agent can start exploring. Second, it assumes

the constraint is sufﬁciently well-behaved, so that we can use data to infer the safety of unvisited

locations. Formally, it assumes the domain

is endowed with a positive deﬁnite kernel

kq(·,·)

and that the constraint’s norm in the associated Reproducing Kernel Hilbert Space [

] is bounded,

kqkkq≤Bq

. This lets us use Gaussian Processes (GPs) [

]to construct high-probability conﬁdence

intervals for

. We specify the GP prior over

through a mean function, which we assume to be

Agents can use their transition graph to ﬁnd a path between two goals. In a continuous domain, the path can

be tracked with a controller (e.g., MPC)

zero everywhere w.l.o.g.,

µ(v) = 0,∀v∈V

, and a kernel function,

, that captures the covariance

between different locations. If we have access to

measurements, at

VT={vt}T

t=1

perturbed by

i.i.d. Gaussian noise,

yT={q(vt) + ηt}T

t=1

with

ηt∼ N (0, σ2)

, we can compute the posterior mean

and covariance over the constraint at unseen locations

µT(v) = k>

T(v)(KT+σ2I)−1yT

and

kt(v, v0) = k(v, v0)−k>

T(v)(KT+σ2I)−1kT(v0)

, where

kT(v) = [k(v1, v), ..., k(vT, v)]>, KT

is the positive deﬁnite kernel matrix [k(v, v0)]v,v0∈VTand I∈RT×Tdenotes the identity matrix.

In this work, we make the same assumptions about the safe seed and the regularity of qand ρ.

Approximations of the feasible set

. Based on the GP posterior above, GOOSE builds monotonic con-

ﬁdence intervals for the constraint at each iteration

t(v):= max{lq

t−1(v), µq

t−1(v)−βq

tσq

t−1(v)}

and

t(v):= min{uq

t−1(v), µq

t−1(v) + βq

tσq

t−1(v)}

, which contain the true constraint function for

every

v∈V

and

t≥1

, with high probability if

βq

is selected as in [

] or Section 5. GOOSE uses

these conﬁdence intervals within a set

S⊆V

together with the

-Lipschitz continuity of

, to

deﬁne operators that determine which locations are safe in plausible worst- and best-case scenarios,

pt(S) = {v∈V, |∃z∈S:lq

t(z)−Lqd(v, z)≥0},(3)

oq

t(S) = {v∈V, |∃z∈S:uq

t(z)−q−Lqd(v, z)≥0}.(4)

Notice that the pessimistic operator relies on the lower bound,

, while the optimistic one on the

upper bound,

. Moreover, the optimistic one uses a margin

q

to exclude "barely" safe locations

as the agent might get stuck learning about them. Finally, to disregard locations the agent could

not safely reach or from where it could not safely return, GOOSE introduces the

Rergodic(·,·)

operator.

Rergodic(pt(S), S)

indicates locations in

or locations in

pt(S)

reachable from

and from where the

agent can return to

along a path contained in

pt(S)

. Combining

pt(S)

and

Rergodic(·,·)

,GOOSE

deﬁnes the pessimistic and ergodic operator

Pt(·)

, which it uses to update the pessimistic safe set.

Similarly, it deﬁnes ˜

Ot(·)using oq

t(·)to compute the optimistic safe set.

4MACOPT and SAFEMAC

This section presents MACOPT and SAFEMAC, our algorithms for unconstrained and safety-constrained

multi-agent coverage control, which we then formally analyze in Section 5.

4.1 MACOPT: unconstrained multi-agent coverage control

Greedy sensing regions

. In sequential optimization, it is crucial to balance exploration and exploita-

tion. GP-UCB [

] is a theoretically sound strategy to strike such a trade-off that works well in

practice. Agents evaluate the objective at locations that maximize an upper conﬁdence bound over the

objective given by the GP model such that locations with either a high posterior mean (exploitation) or

standard deviation (exploration) are visited. We construct a valid upper conﬁdence bound for the cov-

erage

F(X)

starting from our conﬁdence intervals on

, by replacing the true density

with its upper

bound

uρ

in Eq. (1). Next, we apply the greedy algorithm to this upper bound (Line 3 of Algorithm 1)

to select

candidate locations for evaluating the density. However, this simple exploration strategy

may perform poorly, due to the fact that in order to reduce the uncertainty over the coverage

we must learn the density

at all locations inside the sensing region,

i=1 Di

, rather than simply at

It is a form of partial monitoring [

], where the objective

differs from the quantity we measure, i.e.,

the density

. Next, we explain how to choose locations where to observe the density for a given

Uncertainty sampling

. Given location assignments

for the agents, we measure the density to

efﬁciently learn the function

F(X)

. Intuitively, agent

observes the density where it’s most uncertain

within the area it covers that is not covered by agents

{1,...i−1}

, i.e.,

Di−

(Line 4 of Alg. 2, Fig. 2a).

Stopping criterion

. The algorithm terminates when a near-optimal solution is achieved. Intuitively,

this occurs when the uncertainty about the coverage value of the greedy recommendation is low. For-

mally, we require the sum of the uncertainties over the sampling targets to be below a threshold, i.e. ,

wt=PN

i=1 uρ

t−1(xg,i

t)−lρ

t−1(xg,i

t)≤ρ

(Line 3 of Algorithm 2). Importantly, this stopping criterion

requires the conﬁdence intervals to shrink only at regions that potentially maximize the coverage.

MACOPT

. Now, we introduce MACOPT in Algorithm 2. At round

, we select the sensing locations

for the agents,

, by greedily optimizing the upper conﬁdence bound of the coverage. Then, each

ﬂaticon.com

xg,1

xg,2

xg,3

(a) Uncertainty sampling in MACOPT

So,

Sp,B1

Sp,B2

xg,1

xg,2

xg,3

(b) Illustration of multi-agent GOOSE

Figure 2: a) The contours represent the density uncertainty, and the red

’s correspond to the

maximum coverage locations evaluated by the GREEDY Algorithm 1. While these locations maximize

coverage, they may not be informative about the coverage since the uncertainty can be low. Therefore,

the agents collect measurements at the maximum uncertainty of the density in a disc (green

’s,

xg,i

also known as uncertainty sampling. b) In a constrained environment, SAFEMACevaluates

xg,i

for all

agents in the optimistic set So,q

t(violet) and set it as a next goal. It forms an expander region (dark

blue) to safely expand the pessimistic safe set Sp

t(green) toward the goal.

agent

collects noisy density measurements at the points of highest uncertainty within

Di−

. Finally,

we update our GP over the density and, if the sum of maximum uncertainties within each sensing

region is small, we stop the algorithm.

4.2 SAFEMAC: safety-constrained multi-agent coverage control

Intuition

. We adopt a perspective similar to GOOSE as we separate the exploration of the safe set

from the maximization of the coverage. Given an over and under approximation of the safe set (whose

computation is discussed later), we want to explore optimistically optimal goals for each agent, similar

to MACOPT. To this end, we ﬁnd the maximizers of the density upper bound in the optimistic safe set

with the GREEDY algorithm. Then, we deﬁne sampling goals to learn the coverage at those locations.

Phases of SAFEMAC

. Coverage values depend both on the density and the feasible region (Eq. (2)).

Thus, there are two sensible sampling goals given a disk assignment: i) optimistic coverage: if we are

uncertain about the density within the disks, we target locations with the highest density uncertainty

(Line 6 of Algorithm 4); ii) optimistic exploration: if we know the density within the disk but there are

locations under it that we cannot classify as either safe (in

) or unsafe (in

V\So,q

), we target those

with the highest constraint uncertainty among them (Line 8). If all the goal locations are safe with high

probability, which can only happen during optimistic coverage, we safely evaluate the density there

(Line 19). Otherwise, we explore the constraint with a goal directed strategy that aims at classifying

them as either safe or unsafe similar to GOOSE (Line 9-12). In case this changes the topological

connection of the optimistic feasible set, we recompute the disks as this may change GREEDY’s output

(Line 15-17). We repeat this loop until we know the feasibility of all the points under the disks

recommended by GREEDY and their density uncertainty is low (Line 4). Next, we explain how the

multiple agents coordinate their individual safe regions to evaluate a goal (MACOPT in batches), how

the agents progress toward their goals (safe expansion) and ﬁnally we describe SAFEMACconvergence.

MACOPT in batches

. In the multi-agent setting of GOOSE (see Fig. 2b), each agent

maintains

Sp,i

pessimistic (or

So,q,i

an optimistic) belief of the safe locations, obtained by iteratively applying

Pt(·)

the pessimistic ( or

Ot(·)

the optimistic) ergodic operators (see Section 3) to the previous pessimistic

belief

Sp,i

t−1

(Line 11 of Algorithm 4). Since the agents cannot navigate to an arbitrary location in

the constrained case, SAFEMACcomputes coverage maximizers on a restricted region, obtained by

ignoring the known unsafe locations. To denote such a restricted region, we deﬁne a union set

Su,i

t:=

So,q,i

t∪Sp,i

, which is the largest set known to be optimistically or pessimistically safe up to time

Moreover, if the agents are topologically disconnected, they cannot travel from one safe region to an-

other and the best strategy for any batch of agents is to maximize coverage locally. For this, we form a

collection of batches

, such that any batch

B∈ Bt

contains agents that lie in topologically connected

regions determined by the union set (Line 13-14 ). SAFEMACcomputes a GREEDY solution for each

B∈ Bt

in their corresponding

Su,B

t:=∪i∈BSu,i

. This is the largest set where the agents can ﬁnd an

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Near-OptimalMulti-AgentLearningforSafeCoverageControlManishPrajapatETHZurichmanishp@ai.ethz.chMatteoTurchettaETHZurichmatteotu@inf.ethz.chMelanieN.ZeilingeryETHZurichmzeilinger@ethz.chAndreasKrauseyETHZurichkrausea@ethz.chAbstractInmulti-agentcoveragecontrolproblems,agentsnavigatetheirenvironmenttor...

展开>> 收起<<

Near-Optimal Multi-Agent Learning for Safe Coverage Control Manish Prajapat.pdf

共44页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Near-Optimal Multi-Agent Learning for Safe Coverage Control Manish Prajapat

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: