during the update. A sparse update prior can also be motivated by the fact that in the real world, many
factors of variation are constant over extended periods of time. For instance, several objects in a
physical simulation may be stationary until some force acts upon them. Additionally, this is useful in
the partially observable setting where the agent observes a constrained viewpoint and has to keep
track of objects that are not visible for many time steps. In this work, we introduce Variational Sparse
Gating (VSG), a stochastic gating mechanism that sparsely updates the latent states at each step.
Recurrent State-Space Model (RSSM) (Hafner et al.,2019) was introduced in PLaNet where the
model state was composed of two paths, an image representation path and a recurrent path. Dream-
erV1 (Hafner et al.,2020) and DreamerV2 (Hafner et al.,2021) utilized them to achieve state-
of-the-art results in continuous and discrete control tasks (Hafner et al.,2019). While the image
representation path which is stochastic accounts for multiple possible future states, the recurrent path
is deterministic to retain information over multiple time steps to facilitate gradient-based optimiza-
tion. (Hafner et al.,2019) showed that both components were important for solving tasks, where
the stochastic part was more important to account for partial observability of the initial states. By
leveraging the proposed gating mechanism (Variational Sparse Gating (VSG)), we demonstrate that a
purely stochastic model with a single component can achieve competitive results, and call it Simple
Variational Sparse Gating (SVSG). To the best of our knowledge, this is the first work that shows
that purely stochastic models achieve competitive performance on continuous control tasks when
compared to leading agents.
Existing benchmarks (Bellemare et al.,2013;Chevalier-Boisvert et al.,2018;Tassa et al.,2018) for RL
do not test the capability of agents in both partial observability and stochasticity. The Atari (Bellemare
et al.,2013) benchmark comprises of 55 games but most of the games are deterministic and a lot of
compute is required to train on them. Some tasks in the Atari and Minigrid benchmarks are partially-
observable but either lack stochasticity or are hard exploration tasks. Also, these benchmarks do not
allow for controlling the factors of variation. We developed a new partially-observable and stochastic
environment, called BringBackShapes (BBS), where the task is to push objects to a predefined goal
area. Solving tasks in BBS require agents to remember states of previously observed objects and
avoid noisy distractor objects. Furthermore, VSG and SVSG outperformed leading model-based and
model-free baselines. We also present studies with varying partial-observability and stochasticity to
demonstrate that the proposed agents have better memory for tracking observed objects and are more
robust to increasing levels of noise. Lastly, the proposed methods were also evaluated on existing
benchmarks - DeepMind Control (DMC) (Tassa et al.,2018), DMC with Natural Background (Zhang
et al.,2021;Nguyen et al.,2021b), and Atari (Bellemare et al.,2013). On the existing benchmarks,
the proposed method performed better on tasks with changing viewpoints and sparse rewards.
Our key contributions are summarized as follows:
•Variational Sparse Gating
: We introduce Variational Sparse Gating (VSG), where the recurrent
states are sparsely updated through a stochastic gating mechanism. A comprehensive empirical
evaluation shows that VSG outperforms baselines on tasks requiring long-term memory.
•Simple Variational Sparse Gating
: We also propose Simple Variational Sparse Gating (SVSG)
which has a purely stochastic state, and achieves competitive results on continuous control tasks
when compared with agents that also use a deterministic component.
•BringBackShapes
: We developed the BringBackShapes (BBS) environment to evaluate agents
on partially-observable and stochastic settings where these variations can be controlled. Our
experiments show that the proposed agents are more robust to such variations.
2 Variational Sparse Gating
Reinforcement Learning
: The visual control task can be formulated as a Partially Observable
Markov Decision Process (POMDP) with discrete time steps
t∈[1; T]
. The agent selects action
at∼p(at|o≤t, a<t
) to interact with the environment and receives the next observation and scalar
reward
ot, rt∼p(ot, rt|o<t, r<t
), respectively, at each time step. The goal is to learn a policy that
maximizes the expected discounted sum of rewards Ep(PT
t=1 γtrt), where γis the discount factor.
Agent
: Agent is composed of a world model and a policy (Fig. 1). World models (Sec. 2.1) encode a
sequence of observations and actions into latent representations. The agents behavior (Appendix B)
is derived to maximize expected returns on the trajectories generated from the learned world model.
While training, the world model is learned with collected experience, the policy is improved on
2