
stochasticity). To achieve this separation, we condition the policy on a latent variable representa-
tion of the future while minimizing the mutual information between the latent variable and future
stochastic rewards and transitions in the environment. By only capturing the controllable factors in
the latent variable, DoC can maximize over each action step without also attempting to maximize
environment transitions as shown in Figure 1 (right). Theoretically, we show that DoC policies are
consistent with their conditioning inputs, ensuring that conditioning on a high-return future will cor-
rectly induce high-return behavior. Empirically, we show that DoC can outperform both RCSL and
na¨
ıve variational methods on highly stochastic environments.
2 RELATED WORK
Return-Conditioned Supervised Learning. Since offline RL algorithms (Fujimoto et al.,2019;
Wu et al.,2019;Kumar et al.,2020) can be sensitive to hyper-parameters and difficult to apply in
practice (Emmons et al.,2021;Kumar et al.,2021), return-conditioned supervised learning (RCSL)
has become a popular alternative, particularly when the environment is deterministic and near-expert
demonstrations are available (Brandfonbrener et al.,2022). RCSL learns to predict behaviors (ac-
tions) by conditioning on desired returns (Schmidhuber,2019;Kumar et al.,2019) using an MLP
policy (Emmons et al.,2021) or a transformer-based policy that encapsulates history (Chen et al.,
2021). Richer information other than returns, such as goals (Codevilla et al.,2018;Ghosh et al.,
2019) or trajectory-level aggregates (Furuta et al.,2021), have also been used as inputs to a condi-
tional policy in practice. Our work also conditions policies on richer trajectory-level information in
the form of a latent variable representation of the future, with additional theoretical justifications of
such conditioning in stochastic environments.
RCSL Failures in Stochastic Environments. Despite the empirical success of RCSL achieved by
DT and RvS, recent work has noted the failure modes in stochastic environments. Paster et al. (2020)
and ˇ
Strupl et al. (2022) presented counter-examples where online RvS can diverge in stochastic
environments. Brandfonbrener et al. (2022) identified near-determinism as a necessary condition for
RCSL to achieve optimality guarantees similar to other offline RL algorithms but did not propose a
solution for RCSL in stochastic settings. Paster et al. (2022) identified this same issue with stochastic
transitions and proposed to cluster offline trajectories and condition the policy on the average cluster
returns. However, the approach in Paster et al. (2022) has technical limitations (see Appendix C),
does not account for reward stochasticity, and still conditions the policy on (expected) returns, which
can lead to undesirable policy-averaging, i.e., a single policy covering two very different behaviors
(clusters) that happen to have the same return. Villaflor et al. (2022) also identifies overly optimistic
behavior of DT and proposes to use discrete β-VAE to induce diverse future predictions a policy
can condition on. This approach only differs the issue with stochastic environments to stochastic
latent variables, i.e., the latent variables will still contain stochastic environment information that
the policy cannot reliably reproduce.
Learning Latent Variables from Offline Data. Various works have explored learning a latent
variable representation of the future (or past) transitions in offline data via maximum likelihood and
use the latent variable to assist planning (Lynch et al.,2020), imitation learning (Kipf et al.,2019;
Ajay et al.,2020;Hakhamaneshi et al.,2021), offline RL (Ajay et al.,2020;Zhou et al.,2020),
or online RL (Fox et al.,2017;Krishnan et al.,2017;Goyal et al.,2019;Shankar & Gupta,2020;
Singh et al.,2020;Wang et al.,2021;Venuto et al.,2021). These works generally focus on the
benefit of increased temporal abstraction afforded by using latent variables as higher-level actions in
a hierarchical policy. Villaflor et al. (2022) has introduced latent variable models into RCSL, which
is one of the essential tools that enables our method, but they did not incoporate the appropriate
constraints which can allow RCSL to effectively combat environment stochasticity, as we will see
in our work.
3 PRELIMINARIES
Environment Notation We consider the problem of learning a decision-making agent to interact
with a sequential, finite-horizon environment. At time t= 0, the agent observes an initial state s0
determined by the environment. After observing stat a timestep 0≤t≤H, the agent chooses an
action at. After the action is applied the environment yields an immediate scalar reward rtand, if
3