and
j
t
is a random, i.i.d added noise.
(xj
t, yj
t)
is agent
j
’s
measurement at time
t
. Note that the agents using this region
sensing model must trade-off between sensing a wider area with
lower accuracy versus a highly accurate sensing action over a
smaller region. The support of the vector
xt
is appropriately
weighted so that
kxtk2= 1
to ensure each sensing action has
a constant power. This helps us in modeling observation noise
as a function of the agent’s distance from the region [
9
]. Since
each action has a constant power and every observation has an
i.i.d added noise with a constant variance, the signal to noise
ratio in the unit squares comprising the rectangular sensing
block reduces as the size of the sensing region is increased.
Cost model:
We introduce the additional realistic setting
that the sensing actions have different associated costs. First,
we consider that the agent travelling from location
a
to location
b
incurs a travel time cost
cd(a, b)
.
1
Second, we assume that
executing each sensing action at location
b
incurs a time cost
cs(b)
. Therefore,
T
time steps after starting from location
x0
,
an agent
j
has executed actions
{xj
t}T
t=1
and incurs a total
cost defined by Cj(T) = PT
t=1 cd(xj
t−1, xj
t) + cs(xj
t).
Communication setup:
We assume that communication,
although unreliable, will be available sometimes and the agents
should utilize it when possible. The agents share their own past
measurements asynchronously with teammates, but do not wait
for communication from their teammates at any time since this
wait could be arbitrarily long and thus cause a loss of valuable
sensing time. In the absence of synchronicity, we also do not
require that the set of available past measurements remain
consistent across agents since communication problems can
disrupt it. We will denote the set of measurements available to
an agent
j
at time
t
by
Dj
t={(xt0, yt0)|{t0}⊆{1, . . . , t−1}}
,
|Dj
t| ≤ t−1
, which includes its own past observations and
those received from its teammates till time t.
III. RELATED WORK
Autonomous target search has diverse applications in envi-
ronment monitoring [
10
], wildlife protection [
11
] as well as
search and rescue operations [
12
]. In robotics, informative path
planning (IPP) problems focus on adaptive decision making to
reach a specified goal state or region. In contrast to our setting,
common IPP algorithms consider a known environment [
13
],
are myopic or use non-adaptive lookahead [
14
], and assume
weakly coupled sensing and movement actions [15].
Bayesian optimization and active learning methods are
another approach to active search [
16
], [
17
], [
18
]. Unfortunately,
they mostly address single-agent systems, or if multi-agent
they assume central coordination [
19
], [
20
] and except for
[
21
] lack any realistic assumptions on sensing actions. Multi-
agent asynchronous active search algorithms proposed in [
9
],
[
22
] tackle several of these challenges but they are myopic
in nature. Further, [
23
] introduced cost effective non-myopic
active search but their simplified setting excludes simultaneous
evaluation of multiple search points with different costs.
1
We assume a constant travelling speed and compute the Euclidean distance
between locations aand b.
Our active search formulation has close similarities with
planning under uncertainty using a Partially Observable Markov
Decision Process (POMDP) [
24
]. Monte Carlo Tree Search
(MCTS) [
25
], [
26
] has found success as a generic online
planning algorithm in large POMDPs [
27
], but is mostly limited
to the single agent setting [28], [29].
Decentralized POMDP (Dec-POMDP) [
30
], [
31
] is another
framework for decentralized active information gathering using
multiple agents which is typically solved using offline, cen-
tralized planning followed by online, decentralized execution
[
32
], [
33
]. Decentralized MCTS (Dec-MCTS) algorithms have
also been proposed for multi-robot active perception under a
cost budget [
34
], [
35
] but they typically rely on each agent
maintaining a joint probability distribution over its own belief
as well as those of the other agents.
Finally, cost aware active search can be viewed as a multi-
objective sequential decision making problem. [
36
] developed
an MCTS algorithm for cost budgeted POMDPs using a
scalarized version of the reward-cost trade-off whereas [
37
]
introduced multi-objective MCTS (MO-MCTS) for discovering
global pareto-optimal decision sequences in the search tree.
Unfortunately, MO-MCTS is computationally expensive and
unsuitable for online planning. [
38
] proposed the Pareto MCTS
algorithm for multi-objective IPP but they ignore uncertainty
due to partial observability in the search space.
IV. OUR PROPOSED ALGORITHM: CAST
A. Background
We first briefly describe the concepts essential to the planning
and decision making components of our algorithm.
Monte Carlo Tree Search (MCTS)
is an online algorithm
that combines tree search with random sampling in a domain-
independent manner. In our setup, a cost-aware agent would
benefit from the ability to lookahead into the adaptive evolution
of its belief about the target’s distribution in the environment
in response to possible observations from the actions it
might execute. We therefore consider MCTS as the basis for
developing our online planning method with finite horizon
lookahead. Unfortunately, the presence of uncertainty about
targets (number and location) in the unknown environment
together with the noisy observations introduces additional
challenges in our problem formulation.
Pareto optimality:
Our formulation of cost aware active
search described in Section II can be viewed as a multi-
objective sequential decision making problem. A common
approach to solving such multi-objective optimization problems
is scalarization i.e. considering a weighted combination result-
ing in a single-objective problem that trades off the different
objectives. However, tuning the weight attributed to each
objective is challenging since they might be scaling quantities
having different units and their relative importance might be
context dependent. In contrast, pareto optimization builds on
the idea that some solutions to the multi-objective optimization
problem are categorically worse than others and are dominated
by a set of pareto-optimal solution vectors forming a pareto