DICHOTOMY OF CONTROL SEPARATING WHAT YOU CANCONTROL FROM WHAT YOUCANNOT Mengjiao Yang_2

2025-05-06 0 0 1.7MB 19 页 10玖币
侵权投诉
DICHOTOMY OF CONTROL: SEPARATING WHAT YOU
CAN CONTROL FROM WHAT YOU CANNOT
Mengjiao Yang
University of California, Berkeley
Google Research, Brain Team
sherryy@google.com
Dale Schuurmans
University of Alberta
Google Research, Brain Team
Pieter Abbeel
University of California, Berkeley
Ofir Nachum
Google Research, Brain Team
ABSTRACT
Future- or return-conditioned supervised learning is an emerging paradigm for
offline reinforcement learning (RL), where the future outcome (i.e., return) asso-
ciated with an observed action sequence is used as input to a policy trained to
imitate those same actions. While return-conditioning is at the heart of popular al-
gorithms such as decision transformer (DT), these methods tend to perform poorly
in highly stochastic environments, where an occasional high return can arise from
randomness in the environment rather than the actions themselves. Such situations
can lead to a learned policy that is inconsistent with its conditioning inputs; i.e.,
using the policy to act in the environment, when conditioning on a specific desired
return, leads to a distribution of real returns that is wildly different than desired.
In this work, we propose the dichotomy of control (DoC), a future-conditioned su-
pervised learning framework that separates mechanisms within a policy’s control
(actions) from those beyond a policy’s control (environment stochasticity). We
achieve this separation by conditioning the policy on a latent variable represen-
tation of the future, and designing a mutual information constraint that removes
any information from the latent variable associated with randomness in the envi-
ronment. Theoretically, we show that DoC yields policies that are consistent with
their conditioning inputs, ensuring that conditioning a learned policy on a desired
high-return future outcome will correctly induce high-return behavior. Empiri-
cally, we show that DoC is able to achieve significantly better performance than
DT on environments that have highly stochastic rewards and transitions1.
1 INTRODUCTION
Offline reinforcement learning (RL) aims to extract an optimal policy solely from an existing dataset
of previous interactions (Fujimoto et al.,2019;Wu et al.,2019;Kumar et al.,2020). As researchers
begin to scale offline RL to large image, text, and video datasets (Agarwal et al.,2020;Fan et al.,
2022;Baker et al.,2022;Reed et al.,2022;Reid et al.,2022), a family of methods known as return-
conditioned supervised learning (RCSL), including Decision Transformer (DT) (Chen et al.,2021;
Lee et al.,2022) and RL via Supervised Learning (RvS) (Emmons et al.,2021), have gained pop-
ularity due to their algorithmic simplicity and ease of scaling. At the heart of RCSL is the idea of
conditioning a policy on a specific future outcome, often a return (Srivastava et al.,2019;Kumar
et al.,2019;Chen et al.,2021) but also sometimes a goal state or generic future event (Codevilla
et al.,2018;Ghosh et al.,2019;Lynch et al.,2020). RCSL trains a policy to imitate actions as-
sociated with a conditioning input via supervised learning. During inference (i.e., at evaluation),
the policy is conditioned on a desirable high-return or future outcome, with the hope of inducing
behavior that can achieve this desirable outcome.
1Code available at https://github.com/google-research/google-research/tree/
master/dichotomy_of_control.
1
arXiv:2210.13435v1 [cs.LG] 24 Oct 2022
S
S’
S’
S’
a1
a2
a
a
a
𝔼
S
S’
S’
S’
a1
a2
a
a
a
Dichotomy of ControlRCSL / Decision Transformer
T= 0.01
r= 100
r= 10
T= 1
r= 100
r= 10
T= 1
T= 0.01
Figure 1: Illustration of DT (RCSL) and DoC. Circles and squares denote states and actions. Solid
arrows denote policy decisions. Dotted arrows denote (stochastic) environment transitions. All
arrows and nodes are present in the dataset, i.e., there are 4 trajectories, 2 of which achieve 0 reward.
DT maximizes returns across an entire trajectory, leading to suboptimal policies when a large return
(r= 100) is achieved only due to very low-probability environment transitions (T= 0.01). DoC
separates policy stochasticity from that of the environment and only tries to control action decisions
(solid arrows), achieving optimal control through maximizing expected returns at each timestep.
Despite the empirical advantages that come with supervised training (Emmons et al.,2021;Kumar
et al.,2021), RCSL can be highly suboptimal in stochastic environments (Brandfonbrener et al.,
2022), where the future an RCSL policy conditions on (e.g., return) can be primarily determined by
randomness in the environment rather than the data collecting policy itself. Figure 1 (left) illustrates
an example, where conditioning an RCSL policy on the highest return observed in the dataset (r=
100) leads to a policy (a1) that relies on a stochastic transition of very low probability (T= 0.01)
to achieve the desired return of r= 100; by comparison the choice of a2is much better in terms
of average return, as it surely achieves r= 10. The crux of the issue is that the RCSL policy is
inconsistent with its conditioning input. Conditioning the policy on a desired return (i.e., 100) to
act in the environment leads to a distribution of real returns (i.e., 0.01 100) that is wildly different
from the return value being conditioned on. This issue would not have occurred if the policy could
also maximize the transition probability that led to the high-return state, but this is not possible as
transition probabilities are a part of the environment and not subject to the policy’s control.
A number of works propose a generalization of RCSL, known as future-conditioned supervised
learning methods. These techniques have been shown to be effective in imitation learning (Singh
et al.,2020;Pertsch et al.,2020), offline Q-learning (Ajay et al.,2020), and online policy gradi-
ent (Venuto et al.,2021). It is common in future-conditioned supervised learning to apply a KL
divergence regularizer on the latent variable – inspired by variational auto-encoders (VAE) (Kingma
& Welling,2013) and measured with respect to a learned prior conditioned only on past informa-
tion – to limit the amount of future information captured in the latent variable. It is natural to ask
whether this regularizer could remedy the insconsistency of RCSL. Unfortunately, as the KL regu-
larizer makes no distinction between future information that is controllable versus that which is not,
such an approach will still exhibit inconsistency, in the sense that the latent variable representation
may contain information about the future that is due only to environment stochasticity.
It is clear that the major issue with both RCSL and na¨
ıve variational methods is that they make
no distinction between stochasticity of the policy (controllable) and stochasticity of the environment
(uncontrollable). An optimal policy should maximize over the controllable (actions) and take expec-
tations over uncontrollable (e.g., transitions) as shown in Figure 1 (right). This implies that, under a
variational approach, the latent variable representation that a policy conditions on should not incor-
porate any information that is solely due to randomness in the environment. In other words, while
the latent representation can and should include information about future behavior (i.e., actions), it
should not reveal any information about the rewards or transitions associated with this behavior.
To this end, we propose a future-conditioned supervised learning framework termed dichotomy of
control (DoC), which, in Stoic terms (Shapiro,2014), has “the serenity to accept the things it cannot
change, courage to change the things it can, and wisdom to know the difference. DoC separates
mechanisms within a policy’s control (actions) from those beyond a policy’s control (environment
2
stochasticity). To achieve this separation, we condition the policy on a latent variable representa-
tion of the future while minimizing the mutual information between the latent variable and future
stochastic rewards and transitions in the environment. By only capturing the controllable factors in
the latent variable, DoC can maximize over each action step without also attempting to maximize
environment transitions as shown in Figure 1 (right). Theoretically, we show that DoC policies are
consistent with their conditioning inputs, ensuring that conditioning on a high-return future will cor-
rectly induce high-return behavior. Empirically, we show that DoC can outperform both RCSL and
na¨
ıve variational methods on highly stochastic environments.
2 RELATED WORK
Return-Conditioned Supervised Learning. Since offline RL algorithms (Fujimoto et al.,2019;
Wu et al.,2019;Kumar et al.,2020) can be sensitive to hyper-parameters and difficult to apply in
practice (Emmons et al.,2021;Kumar et al.,2021), return-conditioned supervised learning (RCSL)
has become a popular alternative, particularly when the environment is deterministic and near-expert
demonstrations are available (Brandfonbrener et al.,2022). RCSL learns to predict behaviors (ac-
tions) by conditioning on desired returns (Schmidhuber,2019;Kumar et al.,2019) using an MLP
policy (Emmons et al.,2021) or a transformer-based policy that encapsulates history (Chen et al.,
2021). Richer information other than returns, such as goals (Codevilla et al.,2018;Ghosh et al.,
2019) or trajectory-level aggregates (Furuta et al.,2021), have also been used as inputs to a condi-
tional policy in practice. Our work also conditions policies on richer trajectory-level information in
the form of a latent variable representation of the future, with additional theoretical justifications of
such conditioning in stochastic environments.
RCSL Failures in Stochastic Environments. Despite the empirical success of RCSL achieved by
DT and RvS, recent work has noted the failure modes in stochastic environments. Paster et al. (2020)
and ˇ
Strupl et al. (2022) presented counter-examples where online RvS can diverge in stochastic
environments. Brandfonbrener et al. (2022) identified near-determinism as a necessary condition for
RCSL to achieve optimality guarantees similar to other offline RL algorithms but did not propose a
solution for RCSL in stochastic settings. Paster et al. (2022) identified this same issue with stochastic
transitions and proposed to cluster offline trajectories and condition the policy on the average cluster
returns. However, the approach in Paster et al. (2022) has technical limitations (see Appendix C),
does not account for reward stochasticity, and still conditions the policy on (expected) returns, which
can lead to undesirable policy-averaging, i.e., a single policy covering two very different behaviors
(clusters) that happen to have the same return. Villaflor et al. (2022) also identifies overly optimistic
behavior of DT and proposes to use discrete β-VAE to induce diverse future predictions a policy
can condition on. This approach only differs the issue with stochastic environments to stochastic
latent variables, i.e., the latent variables will still contain stochastic environment information that
the policy cannot reliably reproduce.
Learning Latent Variables from Offline Data. Various works have explored learning a latent
variable representation of the future (or past) transitions in offline data via maximum likelihood and
use the latent variable to assist planning (Lynch et al.,2020), imitation learning (Kipf et al.,2019;
Ajay et al.,2020;Hakhamaneshi et al.,2021), offline RL (Ajay et al.,2020;Zhou et al.,2020),
or online RL (Fox et al.,2017;Krishnan et al.,2017;Goyal et al.,2019;Shankar & Gupta,2020;
Singh et al.,2020;Wang et al.,2021;Venuto et al.,2021). These works generally focus on the
benefit of increased temporal abstraction afforded by using latent variables as higher-level actions in
a hierarchical policy. Villaflor et al. (2022) has introduced latent variable models into RCSL, which
is one of the essential tools that enables our method, but they did not incoporate the appropriate
constraints which can allow RCSL to effectively combat environment stochasticity, as we will see
in our work.
3 PRELIMINARIES
Environment Notation We consider the problem of learning a decision-making agent to interact
with a sequential, finite-horizon environment. At time t= 0, the agent observes an initial state s0
determined by the environment. After observing stat a timestep 0tH, the agent chooses an
action at. After the action is applied the environment yields an immediate scalar reward rtand, if
3
t < H, a next state st+1. We use τ:= (st, at, rt)H
t=0 to denote a generic episode generated from
interactions with the environment, and use τi:j:= (st, at, rt)j
t=ito denote a generic sub-episode,
with the understanding that τ0:1refers to an empty sub-episode. The return associated with an
episode τis defined as R(τ) := PH
t=0 rt.
We will use Mto denote the environment. We assume that Mis determined by a stochastic reward
function R, stochastic transition function T, and unique initial state s0, so that rt∼ R(τ0:t1, st, at)
and st+1 T (τ0:t1, st, at)during interactions with the environment. Note that these definitions
specify a history-dependent environment, as opposed to a less general Markovian environment.
Learning a Policy in RCSL In future- or return-conditioned supervised learning, one uses a fixed
training data distribution Dof episodes τ(collected by unknown and potentially multiple agents) to
learn a policy π, where πis trained to predict atconditioned on the history τ0:t1, the observation st,
and an additional conditioning variable zthat may depend on both the past and future of the episode.
For example, in return-conditioned supervised learning, policy training minimizes the following
objective over π:
LRCSL(π) := Eτ∼D "H
X
t=0
log π(at|τ0:t1, st, z(τ))#,(1)
where z(τ)is the return R(τ).
Inconsistency of RCSL To apply an RCSL-trained policy πduring inference — i.e., interacting
online with the environment — one must first choose a specific z.2For example, one might set z
to be the maximal return observed in the dataset, in the hopes of inducing a behavior policy which
achieves this high return. Using πzas a shorthand to denote the policy πconditioned on a specific
z, we define the expected return VM(πz)of πzin Mas,
VM(πz) := EτPr[·|πz,M][R(τ)] .(2)
Ideally the expected return induced by πzis close to z, i.e., zVM(πz), so that acting according
to πconditioned on a high return induces behavior which actually achieves a high return. However,
RCSL training according to Equation 1 will generally yield policies that are highly inconsistent in
stochastic environments, meaning that the achieved returns may be significantly different than z
(i.e., VM(πz)6=z). This has been highlighted in various previous works (Brandfonbrener et al.,
2022;Paster et al.,2022;ˇ
Strupl et al.,2022;Eysenbach et al.,2022;Villaflor et al.,2022), and we
provided our own example in Figure 1.
Approaches to Mitigating Inconsistency A number of future-conditioned supervised learning
approaches propose to learn a stochastic latent variable embedding of the future, q(z|τ), while reg-
ularizing qwith a KL-divergence from a learnable prior conditioned only on the past p(z|s0)(Ajay
et al.,2020;Venuto et al.,2021;Lynch et al.,2020), thereby minimizing:
LVAE(π, q, p) := Eτ∼D,zq(z|τ)"H
X
t=0
log π(at|τ0:t1, st, z)#+β·Eτ∼D [DKL(q(z|τ)kp(z|s0))] .
(3)
One could consider adopting such a future-conditioned objective in RCSL. However, since the KL
regularizer makes no distinction between observations the agent can control (actions) from those it
cannot (environment stochasticity), the choice of coefficient βapplied to the regularizer introduces
a ‘lose-lose’ trade-off. Namely, as noted in Ajay et al. (2020), if the regularization coefficient is too
large (β1), the policy will not learn diverse behavior (since the KL limits how much information
of the future actions is contained in z); while if the coefficient is too small (β < 1), the policy’s
learned behavior will be inconsistent (in the sense that zwill contain information of environment
stochasticity that the policy cannot reliably reproduce). The discrete β-VAE incoporated by Villaflor
et al. (2022) with β < 1corresponds to this second failure mode.
2For simplicitly, we assume zis chosen at timestep t= 0 and held constant throughout an entire episode.
As noted in Brandfonbrener et al. (2022), this protocol also encompasses instances like DT (Chen et al.,2021)
in which zat timestep tis the (desired) return summed starting at t.
4
摘要:

DICHOTOMYOFCONTROL:SEPARATINGWHATYOUCANCONTROLFROMWHATYOUCANNOTMengjiaoYangUniversityofCalifornia,BerkeleyGoogleResearch,BrainTeamsherryy@google.comDaleSchuurmansUniversityofAlbertaGoogleResearch,BrainTeamPieterAbbeelUniversityofCalifornia,BerkeleyOrNachumGoogleResearch,BrainTeamABSTRACTFuture-orre...

展开>> 收起<<
DICHOTOMY OF CONTROL SEPARATING WHAT YOU CANCONTROL FROM WHAT YOUCANNOT Mengjiao Yang_2.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.7MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注