the quality of the resulting motion [2]–[4].
To accelerate the computation of MI, several novel al-
gorithms were proposed. For instance, the efficient com-
putation of Cauchy-Schwarz Quadratic Mutual Information
(CSQMI) is introduced in [5], [6]. In addition, the high-
throughput computation of Shannon Mutual information is
proposed in the Fast Shannon Mutual Information (FSMI)
algorithm [7]. More recently, the computation of Shannon MI
is reformulated recursively and continuously along the range
measurements in the Fast Continuous Mutual Information
(FCMI) algorithm [8], which leads to a significant reduction
in amount of compute needed for MI across an entire map.
Although these algorithms reduce the time complexity for MI
computation, their optimized implementations on CPUs and
GPUs still achieve a throughput that is orders-of-magnitude
below what is required for autonomous exploration, espe-
cially in large environments. This motivates the design of
specialized hardware accelerators for computing MI.
Hardware accelerators have been proposed for robotics
applications such as motion-planning [9]–[12] and visual-
inertial odometry [13], [14]. To our best knowledge, the
first and only hardware accelerator for computing MI was
proposed in [15], which contains 16 high-throughput com-
putation cores that compute FSMI. For moderately-sized
environments (i.e., a map size of 200 ×200 to 512 ×512
cells), this accelerator computes MI for the entire map in
near real time (i.e., less than 10Hz), which is far below the
rate in which the robot makes range measurements (around
30-60Hz).
The computation of mutual information may seem like a
task that can be easily parallelized. One might be tempted
to think, for example, that computational cores dedicated to
various parts of the map can work in parallel to compute the
mutual information for all cells in the map all at the same
time. Indeed, under certain widely-utilized assumptions, the
computation decouples and cells can be evaluated separately.
Unfortunately, however, the computation does not scale, not
because it can not be decoupled, but because it becomes
impossible to feed all cores with the data that they need for
their computations. More specifically, all cores need to access
different portions of the map, which resides in memory, all
at the same time. The memory bandwidth does not scale
to keep all cores busy. In fact, we show in this paper that
naively scaling up the number cores does not lead to a fast
computation of the mutual information metric, as most cores
end up underutilized due to insufficient memory bandwidth.
The main contribution of this paper is a scalable, multi-
core hardware architecture for computing mutual information
for a 2D occupancy grid map. This architecture has two
key components: (i) a memory management system that
provides sufficient bandwidth to all cores, and (ii) a low-
latency, energy-efficient core (i.e. processing element) design
that fully utilizes all available bandwidth. Specifically, the
memory management system exploits the memory access
pattern of the cores and splits the storage of the occupancy
grid map into smaller memory banks so that all cores access
different banks every cycle. The core contains a feedback
architecture designed to handle the dependencies that are
introduced by the recursive nature of the FCMI algorithm [8],
so that all compute units within the core are fully utilized.
Each core is time-interleaved to compute the FCMI along
multiple rays in order to avoid stalls associated with the
feedback architecture.
The proposed hardware architecture is implemented with
16 cores on a Xilinx Zynq-7000 XC7Z045 FPGA. For a
robot equipped with a 60-ray range sensor utilizing an occu-
pancy grid map of size 201×201, our FPGA implementation
computes MI for the entire map in 1.55 ms (647 Hz, or
much faster than real time), which is 381×faster than an
optimized C++ implementation running on a ARM Cortex
A57 (embedded cellphone processor) and 71×faster than
an optimized CUDA implementation running on a NVIDIA
GTX 980 GPU. In addition, our FPGA implementation
consumes 1.7mJ of energy per MI map compute, which
is 1314×lower than ARM Cortex A57 and 2650×lower
than NVIDIA GTX 980 GPU. A CPU on a robot can
therefore offload Mutual Information computation to such an
accelerator, while interfacing with the sensors and actuators
and performing motion planning. The extremely low latency
and energy consumption of our accelerator allows robots to
explore efficiently without relying on heuristic approxima-
tions even with the energy budget of a few Joules.
II. PRELIMINARIES
A. The FCMI Algorithm
Occupancy grid map [16] is a common probabilistic
representation of the 2D environment and used in several
well known autonomous exploration systems, such as the
frontier-exploration algorithm [17] and Shannon MI-based
mapping algorithms [18]–[22]. In an information theoretic
framework, MI is used to quantify the expected information
gain from potential range measurements in an unknown
environment. The unknown environment is represented by
an occupancy grid map, which is updated continuously
by the robot throughout an exploration trial. The FCMI
algorithm [8] computes the mutual information between the
map and potential range measurement rays emanating from
the sensors (e.g., LiDAR) on such robot.
Let Z={Zθ1, Zθ2. . .}represent range measurements
made by sensor rays with angles Θ = {θ1, θ2, . . .}emanating
from robot’s location x. Let ∆θdenote the angular resolution
of the sensor rays, which is constant across all rays. The
Shannon MI Ix(M;Z)between the measurements Zmade
at location xand map Mis defined as
Ix(M;Z) = X
θ∈Θ
Ix(M;Zθ)∆θ.(1)
Note that the MI between a single range measurement Zθ
and map Mcan be written as
Ix(M;Zθ) = hx(Zθ)−hx(Zθ|M),(2)
where hx(Zθ)is the entropy of the range measurement,
and hx(Zθ|M)is the conditional entropy of the range
measurement given the map.