DICHOTOMY OF CONTROL SEPARATING WHAT YOU CANCONTROL FROM WHAT YOUCANNOT Mengjiao Yang_2

2025-05-06 0 0 1.7MB 19 页 10玖币

侵权投诉

DICHOTOMY OF CONTROL: SEPARATING WHAT YOU

CAN CONTROL FROM WHAT YOU CANNOT

Mengjiao Yang

University of California, Berkeley

Google Research, Brain Team

sherryy@google.com

Dale Schuurmans

University of Alberta

Google Research, Brain Team

Pieter Abbeel

University of California, Berkeley

Oﬁr Nachum

Google Research, Brain Team

ABSTRACT

Future- or return-conditioned supervised learning is an emerging paradigm for

ofﬂine reinforcement learning (RL), where the future outcome (i.e., return) asso-

ciated with an observed action sequence is used as input to a policy trained to

imitate those same actions. While return-conditioning is at the heart of popular al-

gorithms such as decision transformer (DT), these methods tend to perform poorly

in highly stochastic environments, where an occasional high return can arise from

randomness in the environment rather than the actions themselves. Such situations

can lead to a learned policy that is inconsistent with its conditioning inputs; i.e.,

using the policy to act in the environment, when conditioning on a speciﬁc desired

return, leads to a distribution of real returns that is wildly different than desired.

In this work, we propose the dichotomy of control (DoC), a future-conditioned su-

pervised learning framework that separates mechanisms within a policy’s control

(actions) from those beyond a policy’s control (environment stochasticity). We

achieve this separation by conditioning the policy on a latent variable represen-

tation of the future, and designing a mutual information constraint that removes

any information from the latent variable associated with randomness in the envi-

ronment. Theoretically, we show that DoC yields policies that are consistent with

their conditioning inputs, ensuring that conditioning a learned policy on a desired

high-return future outcome will correctly induce high-return behavior. Empiri-

cally, we show that DoC is able to achieve signiﬁcantly better performance than

DT on environments that have highly stochastic rewards and transitions1.

1 INTRODUCTION

Ofﬂine reinforcement learning (RL) aims to extract an optimal policy solely from an existing dataset

of previous interactions (Fujimoto et al.,2019;Wu et al.,2019;Kumar et al.,2020). As researchers

begin to scale ofﬂine RL to large image, text, and video datasets (Agarwal et al.,2020;Fan et al.,

2022;Baker et al.,2022;Reed et al.,2022;Reid et al.,2022), a family of methods known as return-

conditioned supervised learning (RCSL), including Decision Transformer (DT) (Chen et al.,2021;

Lee et al.,2022) and RL via Supervised Learning (RvS) (Emmons et al.,2021), have gained pop-

ularity due to their algorithmic simplicity and ease of scaling. At the heart of RCSL is the idea of

conditioning a policy on a speciﬁc future outcome, often a return (Srivastava et al.,2019;Kumar

et al.,2019;Chen et al.,2021) but also sometimes a goal state or generic future event (Codevilla

et al.,2018;Ghosh et al.,2019;Lynch et al.,2020). RCSL trains a policy to imitate actions as-

sociated with a conditioning input via supervised learning. During inference (i.e., at evaluation),

the policy is conditioned on a desirable high-return or future outcome, with the hope of inducing

behavior that can achieve this desirable outcome.

1Code available at https://github.com/google-research/google-research/tree/

master/dichotomy_of_control.

arXiv:2210.13435v1 [cs.LG] 24 Oct 2022

S’

𝔼

S’

Dichotomy of ControlRCSL / Decision Transformer

T= 0.01

r= 100

r= 10

T= 1

r= 100

r= 10

T= 1

T= 0.01

Figure 1: Illustration of DT (RCSL) and DoC. Circles and squares denote states and actions. Solid

arrows denote policy decisions. Dotted arrows denote (stochastic) environment transitions. All

arrows and nodes are present in the dataset, i.e., there are 4 trajectories, 2 of which achieve 0 reward.

DT maximizes returns across an entire trajectory, leading to suboptimal policies when a large return

(r= 100) is achieved only due to very low-probability environment transitions (T= 0.01). DoC

separates policy stochasticity from that of the environment and only tries to control action decisions

(solid arrows), achieving optimal control through maximizing expected returns at each timestep.

Despite the empirical advantages that come with supervised training (Emmons et al.,2021;Kumar

et al.,2021), RCSL can be highly suboptimal in stochastic environments (Brandfonbrener et al.,

2022), where the future an RCSL policy conditions on (e.g., return) can be primarily determined by

randomness in the environment rather than the data collecting policy itself. Figure 1 (left) illustrates

an example, where conditioning an RCSL policy on the highest return observed in the dataset (r=

100) leads to a policy (a1) that relies on a stochastic transition of very low probability (T= 0.01)

to achieve the desired return of r= 100; by comparison the choice of a2is much better in terms

of average return, as it surely achieves r= 10. The crux of the issue is that the RCSL policy is

inconsistent with its conditioning input. Conditioning the policy on a desired return (i.e., 100) to

act in the environment leads to a distribution of real returns (i.e., 0.01 ∗100) that is wildly different

from the return value being conditioned on. This issue would not have occurred if the policy could

also maximize the transition probability that led to the high-return state, but this is not possible as

transition probabilities are a part of the environment and not subject to the policy’s control.

A number of works propose a generalization of RCSL, known as future-conditioned supervised

learning methods. These techniques have been shown to be effective in imitation learning (Singh

et al.,2020;Pertsch et al.,2020), ofﬂine Q-learning (Ajay et al.,2020), and online policy gradi-

ent (Venuto et al.,2021). It is common in future-conditioned supervised learning to apply a KL

divergence regularizer on the latent variable – inspired by variational auto-encoders (VAE) (Kingma

& Welling,2013) and measured with respect to a learned prior conditioned only on past informa-

tion – to limit the amount of future information captured in the latent variable. It is natural to ask

whether this regularizer could remedy the insconsistency of RCSL. Unfortunately, as the KL regu-

larizer makes no distinction between future information that is controllable versus that which is not,

such an approach will still exhibit inconsistency, in the sense that the latent variable representation

may contain information about the future that is due only to environment stochasticity.

It is clear that the major issue with both RCSL and na¨

ıve variational methods is that they make

no distinction between stochasticity of the policy (controllable) and stochasticity of the environment

(uncontrollable). An optimal policy should maximize over the controllable (actions) and take expec-

tations over uncontrollable (e.g., transitions) as shown in Figure 1 (right). This implies that, under a

variational approach, the latent variable representation that a policy conditions on should not incor-

porate any information that is solely due to randomness in the environment. In other words, while

the latent representation can and should include information about future behavior (i.e., actions), it

should not reveal any information about the rewards or transitions associated with this behavior.

To this end, we propose a future-conditioned supervised learning framework termed dichotomy of

control (DoC), which, in Stoic terms (Shapiro,2014), has “the serenity to accept the things it cannot

change, courage to change the things it can, and wisdom to know the difference.” DoC separates

mechanisms within a policy’s control (actions) from those beyond a policy’s control (environment

stochasticity). To achieve this separation, we condition the policy on a latent variable representa-

tion of the future while minimizing the mutual information between the latent variable and future

stochastic rewards and transitions in the environment. By only capturing the controllable factors in

the latent variable, DoC can maximize over each action step without also attempting to maximize

environment transitions as shown in Figure 1 (right). Theoretically, we show that DoC policies are

consistent with their conditioning inputs, ensuring that conditioning on a high-return future will cor-

rectly induce high-return behavior. Empirically, we show that DoC can outperform both RCSL and

na¨

ıve variational methods on highly stochastic environments.

2 RELATED WORK

Return-Conditioned Supervised Learning. Since ofﬂine RL algorithms (Fujimoto et al.,2019;

Wu et al.,2019;Kumar et al.,2020) can be sensitive to hyper-parameters and difﬁcult to apply in

practice (Emmons et al.,2021;Kumar et al.,2021), return-conditioned supervised learning (RCSL)

has become a popular alternative, particularly when the environment is deterministic and near-expert

demonstrations are available (Brandfonbrener et al.,2022). RCSL learns to predict behaviors (ac-

tions) by conditioning on desired returns (Schmidhuber,2019;Kumar et al.,2019) using an MLP

policy (Emmons et al.,2021) or a transformer-based policy that encapsulates history (Chen et al.,

2021). Richer information other than returns, such as goals (Codevilla et al.,2018;Ghosh et al.,

2019) or trajectory-level aggregates (Furuta et al.,2021), have also been used as inputs to a condi-

tional policy in practice. Our work also conditions policies on richer trajectory-level information in

the form of a latent variable representation of the future, with additional theoretical justiﬁcations of

such conditioning in stochastic environments.

RCSL Failures in Stochastic Environments. Despite the empirical success of RCSL achieved by

DT and RvS, recent work has noted the failure modes in stochastic environments. Paster et al. (2020)

and ˇ

Strupl et al. (2022) presented counter-examples where online RvS can diverge in stochastic

environments. Brandfonbrener et al. (2022) identiﬁed near-determinism as a necessary condition for

RCSL to achieve optimality guarantees similar to other ofﬂine RL algorithms but did not propose a

solution for RCSL in stochastic settings. Paster et al. (2022) identiﬁed this same issue with stochastic

transitions and proposed to cluster ofﬂine trajectories and condition the policy on the average cluster

returns. However, the approach in Paster et al. (2022) has technical limitations (see Appendix C),

does not account for reward stochasticity, and still conditions the policy on (expected) returns, which

can lead to undesirable policy-averaging, i.e., a single policy covering two very different behaviors

(clusters) that happen to have the same return. Villaﬂor et al. (2022) also identiﬁes overly optimistic

behavior of DT and proposes to use discrete β-VAE to induce diverse future predictions a policy

can condition on. This approach only differs the issue with stochastic environments to stochastic

latent variables, i.e., the latent variables will still contain stochastic environment information that

the policy cannot reliably reproduce.

Learning Latent Variables from Ofﬂine Data. Various works have explored learning a latent

variable representation of the future (or past) transitions in ofﬂine data via maximum likelihood and

use the latent variable to assist planning (Lynch et al.,2020), imitation learning (Kipf et al.,2019;

Ajay et al.,2020;Hakhamaneshi et al.,2021), ofﬂine RL (Ajay et al.,2020;Zhou et al.,2020),

or online RL (Fox et al.,2017;Krishnan et al.,2017;Goyal et al.,2019;Shankar & Gupta,2020;

Singh et al.,2020;Wang et al.,2021;Venuto et al.,2021). These works generally focus on the

beneﬁt of increased temporal abstraction afforded by using latent variables as higher-level actions in

a hierarchical policy. Villaﬂor et al. (2022) has introduced latent variable models into RCSL, which

is one of the essential tools that enables our method, but they did not incoporate the appropriate

constraints which can allow RCSL to effectively combat environment stochasticity, as we will see

in our work.

3 PRELIMINARIES

Environment Notation We consider the problem of learning a decision-making agent to interact

with a sequential, ﬁnite-horizon environment. At time t= 0, the agent observes an initial state s0

determined by the environment. After observing stat a timestep 0≤t≤H, the agent chooses an

action at. After the action is applied the environment yields an immediate scalar reward rtand, if

t < H, a next state st+1. We use τ:= (st, at, rt)H

t=0 to denote a generic episode generated from

interactions with the environment, and use τi:j:= (st, at, rt)j

t=ito denote a generic sub-episode,

with the understanding that τ0:−1refers to an empty sub-episode. The return associated with an

episode τis deﬁned as R(τ) := PH

t=0 rt.

We will use Mto denote the environment. We assume that Mis determined by a stochastic reward

function R, stochastic transition function T, and unique initial state s0, so that rt∼ R(τ0:t−1, st, at)

and st+1 ∼ T (τ0:t−1, st, at)during interactions with the environment. Note that these deﬁnitions

specify a history-dependent environment, as opposed to a less general Markovian environment.

Learning a Policy in RCSL In future- or return-conditioned supervised learning, one uses a ﬁxed

training data distribution Dof episodes τ(collected by unknown and potentially multiple agents) to

learn a policy π, where πis trained to predict atconditioned on the history τ0:t−1, the observation st,

and an additional conditioning variable zthat may depend on both the past and future of the episode.

For example, in return-conditioned supervised learning, policy training minimizes the following

objective over π:

LRCSL(π) := Eτ∼D "H

t=0

−log π(at|τ0:t−1, st, z(τ))#,(1)

where z(τ)is the return R(τ).

Inconsistency of RCSL To apply an RCSL-trained policy πduring inference — i.e., interacting

online with the environment — one must ﬁrst choose a speciﬁc z.2For example, one might set z

to be the maximal return observed in the dataset, in the hopes of inducing a behavior policy which

achieves this high return. Using πzas a shorthand to denote the policy πconditioned on a speciﬁc

z, we deﬁne the expected return VM(πz)of πzin Mas,

VM(πz) := Eτ∼Pr[·|πz,M][R(τ)] .(2)

Ideally the expected return induced by πzis close to z, i.e., z≈VM(πz), so that acting according

to πconditioned on a high return induces behavior which actually achieves a high return. However,

RCSL training according to Equation 1 will generally yield policies that are highly inconsistent in

stochastic environments, meaning that the achieved returns may be signiﬁcantly different than z

(i.e., VM(πz)6=z). This has been highlighted in various previous works (Brandfonbrener et al.,

2022;Paster et al.,2022;ˇ

Strupl et al.,2022;Eysenbach et al.,2022;Villaﬂor et al.,2022), and we

provided our own example in Figure 1.

Approaches to Mitigating Inconsistency A number of future-conditioned supervised learning

approaches propose to learn a stochastic latent variable embedding of the future, q(z|τ), while reg-

ularizing qwith a KL-divergence from a learnable prior conditioned only on the past p(z|s0)(Ajay

et al.,2020;Venuto et al.,2021;Lynch et al.,2020), thereby minimizing:

LVAE(π, q, p) := Eτ∼D,z∼q(z|τ)"H

t=0

−log π(at|τ0:t−1, st, z)#+β·Eτ∼D [DKL(q(z|τ)kp(z|s0))] .

(3)

One could consider adopting such a future-conditioned objective in RCSL. However, since the KL

regularizer makes no distinction between observations the agent can control (actions) from those it

cannot (environment stochasticity), the choice of coefﬁcient βapplied to the regularizer introduces

a ‘lose-lose’ trade-off. Namely, as noted in Ajay et al. (2020), if the regularization coefﬁcient is too

large (β≥1), the policy will not learn diverse behavior (since the KL limits how much information

of the future actions is contained in z); while if the coefﬁcient is too small (β < 1), the policy’s

learned behavior will be inconsistent (in the sense that zwill contain information of environment

stochasticity that the policy cannot reliably reproduce). The discrete β-VAE incoporated by Villaﬂor

et al. (2022) with β < 1corresponds to this second failure mode.

2For simplicitly, we assume zis chosen at timestep t= 0 and held constant throughout an entire episode.

As noted in Brandfonbrener et al. (2022), this protocol also encompasses instances like DT (Chen et al.,2021)

in which zat timestep tis the (desired) return summed starting at t.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DICHOTOMYOFCONTROL:SEPARATINGWHATYOUCANCONTROLFROMWHATYOUCANNOTMengjiaoYangUniversityofCalifornia,BerkeleyGoogleResearch,BrainTeamsherryy@google.comDaleSchuurmansUniversityofAlbertaGoogleResearch,BrainTeamPieterAbbeelUniversityofCalifornia,BerkeleyOrNachumGoogleResearch,BrainTeamABSTRACTFuture-orre...

展开>> 收起<<

DICHOTOMY OF CONTROL SEPARATING WHAT YOU CANCONTROL FROM WHAT YOUCANNOT Mengjiao Yang_2.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DICHOTOMY OF CONTROL SEPARATING WHAT YOU CANCONTROL FROM WHAT YOUCANNOT Mengjiao Yang_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: