2.2 Success Measurement
Success of a particular strategy can be measured by
time taken to fully explore a given environment. In
the context of simulating, this can be measured by the
number of actions taken before visiting all empty cells.
In order to avoid modeling a desired or complex be-
havior through the reward function, the reward func-
tion remains kept simple:
R(s, a, s0) =
1s0is an unvisited cell,
−1s0is an obstacle,
0otherwise
(1)
The environment also has a natural decay, since we
are limited to one hour (1800 time steps).
One of the goals is to come up with a strategy that
can outperform Roomba’s baseline algorithm, which
will be outlined in a subsequent section.
Performances amongst the agents can be compared
using the following metrics: percentage of open space
explored, number of collisions with obstacles, and
number of time steps to completion.
3 Related Work
3.1 Path Planning using Discretized States
The coverage algorithm described in [9] involves
discretizing a state space with obstacles, and plan-
ning safe paths betweens states. Based on simulation
results, it was shown that using a 4-connected grid
representation of the environment created paths that
were no more than 2% less efficient than if it were 8-
connected. This means that restricting the turn angles
of the agent to 90 degrees does not have a drastically
negative effect on simulation results, in exchange for
reducing the size of the state and action space.
3.2 Boustrophedon Cellular Decomposition
Boustrophedon or S-shaped pathing, follows the same
pathing that farmers use on fields of crops, or lawn
mowers on a field of grass. The algorithm involves
finding a perimeter, then traversing back and forth
along the length of the perimeter across the entire
space [5].
This pathing concept was extended in [2] by at-
tempting to divide the region into segments, each of
which can be completed by executing Boustrophedon
pathing.
3.3 Combination of Basic Policies
A combination of policies was shown to outperform
any individual strategy in [5]. The basic pathing strate-
gies were:
- Outward Spiral
- Wall Walking
- Random Walk
- Boustrophedon Pathing
It was not stated how the combination policy was
determined; however, a possible approach would be to
learn a scoring function to evaluate which strategy to
use based on the current set of observations.
3.4 Deep Reinforcement Learning
One of the main challenges of this task is for the
agent to learn a policy that generalizes well so that
its policy is not specific to a particular environment.
A classic example of generalized reinforcement learn-
ing is presented in [8]. Minh, et al. introduced a
Deep Q-Network (DQN), a neural network used for Q-
Learning. They used a convolutional neural network
to train seven Atari 2600 games. They were able to
achieve super-human results in six of the seven games
it was tested on, with no adjustment of the architecture
or hyperparameters between different games.
We will apply Deep Q-Learning to the simulated en-
vironment to produce an agent that can explore well in
unseen environments.
4 Methods
4.1 Baseline Roomba Algorithm
The basic Roomba strategy consists of exploring out-
ward in a spiral fashion until the nearest obstacle is
2