
Pitis et al. (2020) does not directly take into account any information regarding the agent’s epistemic
uncertainty or performance. Their method simply fits a kernel density estimate over the set of
previously visited goals, then draws samples as goal candidates and selects the sample with the
lowest likelihood under the kernel density estimate. This method benefits from reliably producing
increasingly difficult goals for the agent and being immune to problems such as a goal generator
network destabilizing. There are multiple drawbacks. Firstly, if an agent suffers from catastrophic
forgetting, the curriculum won’t be able to correct the problem effectively. Secondly, if the incremental
goals are insufficiently/overly challenging for the agent, the curriculum won’t properly adapt. Pitis
et al. (2020) shares some commonalities with pseudocount techniques in setting goals exclusively
based on visitation frequency.
Similarly, count-based and pseudocount-based algorithms (Tang et al.; Bellemare et al.) rely on
counting the number of times that a state or a region of state space has been visited. In high
dimensional or continuous environments, these algorithms rely on various compression or learned
similarity metrics (Machado et al.; Schmidhuber, 2008) or basic statistics in order to identify which
regions of the state space are considered similar and adjacent to one another. Intrinsic exploration
bonuses can be awarded and goals can be set based on visitation frequencies to accelerate exploration.
These methods, taking into account exclusively visitation frequency, do not directly to into account
the agent’s epistemic uncertainty or performance and may set goals too ambitiously or lazily and
won’t account for catastrophic forgetting.
Other techniques, such as Bharadhwaj et al. (2020); Portelas et al. (2019); Florensa et al. (2018),
only indirectly take into account the agent’s understanding of the state space through approximations
based on statistics regarding agents’ historical ability to reach goals. Since such metrics offer
minimal insight into the agent’s understanding of the task, these methods select goals that may not be
maximally informative to the agent. The major weakness of these methods is that the goal-generation
networks can destabilize, and are required to set goals with only very abstract data about the agent’s
understanding of the state space.
Another class of algorithms, such as Pathak et al.; Zhang et al. (2020), rely on an external ensemble
of networks and estimate epistemic uncertainty throughout the state space. Pathak et al. maintains
an external network of forward models, which observe the same data as the agent, and uses the
disagreement among the forward models as an epistemic uncertainty estimate. Zhang et al. (2020)
does the same, but with three external critics. The major problem with these algorithms is that they
are estimating the epistemic uncertainty of other networks, not the agent’s network. Because these
external networks’ understanding of the state space will drift away from the agent’s over time, the
epistemic uncertainty estimate of the external networks is not representative of the agent’s epistemic
uncertainty estimate. Additionally, Zhang et al. (2020) requires privileged access to the entire state
space in order to stabilize.
Other curiosity-related methods, such as random network distillation (Burda et al.), use different
formulations of a similar mechanism; estimating the uncertainty of networks disconnected from the
agent’s computation graph. Random network distillation relies on using the agent’s experiences
to train an external network to predict the output of a frozen, randomly initialized network known
as the target network. As an agent visits a particular state more, the external network improves its
prediction of the output of the target network, making it an effective exploration technique to provide
intrinsic reward bonuses proportionate to the size of the external network’s prediction error. However,
this approach relies on estimating the agent’s error in the random network prediction task to gauge
uncertainty in the Q-learning task. These are two very different tasks with different difficulties and
will learn at different speeds, thus this technique provides a poor estimate of epistemic uncertainty.
A number of techniques (Chane-Sane et al., 2021; Nachum et al., 2018; Christen et al., 2021; OpenAI
et al., 2021) have used multiple reinforcement learning agents in order to solve long-horizon tasks,
by either using the agents in an adversarial fashion or in a hierarchy. Using multiple reinforcement
learning agents incurs a significant computation expense, and comes along with significant technical
challenges such as nonstationarity between agents. Consideration of these techniques is outside the
scope of this work, as this work focuses on maximizing the learning of a single agent.
3