Improving Policy Learning via Language Dynamics Distillation Victor Zhong12 Jesse Mu3 Luke Zettlemoyer12 Edward Grefenstette45and Tim Rocktäschel4

2025-05-08 2 0 6.56MB 16 页 10玖币

侵权投诉

Improving Policy Learning via Language Dynamics

Distillation

Victor Zhong1,2, Jesse Mu3, Luke Zettlemoyer1,2, Edward Grefenstette4,5 and Tim Rocktäschel4

1University of Washington

2Meta AI Research

3Stanford University

4University College London

5Cohere

Abstract

Recent work has shown that augmenting environments with language descriptions

improves policy learning. However, for environments with complex language

abstractions, learning how to ground language to observations is difﬁcult due

to sparse, delayed rewards. We propose Language Dynamics Distillation (

LDD

which pretrains a model to predict environment dynamics given demonstrations

with language descriptions, and then ﬁne-tunes these language-aware pretrained

representations via reinforcement learning (RL). In this way, the model is trained to

both maximize expected reward and retain knowledge about how language relates

to environment dynamics. On SILG, a benchmark of ﬁve tasks with language de-

scriptions that evaluate distinct generalization challenges on unseen environments

(NetHack, ALFWorld, RTFM, Messenger, and Touchdown),

LDD

outperforms

tabula-rasa RL, VAE pretraining, and methods that learn from unlabeled demon-

strations in inverse RL and reward shaping with pretrained experts. In our analyses,

we show that language descriptions in demonstrations improve sample-efﬁciency

and generalization across environments, and that dynamics modeling with expert

demonstrations is more effective than with non-experts.

1 Introduction

Language is a powerful medium that humans use to reason about abstractions—its compositionality

allows efﬁcient descriptions that generalize across environments and tasks. Consider an agent that

follows instructions to clean the house (e.g. ﬁnd the dirty dishes and wash them). In tabula-rasa

reinforcement learning (RL), the agent observes raw perceptual features of the environment, then

grounds these visual features to language cues to learn how to behave through trial and error. In

contrast, we can provide the agent with language descriptions that describe abstractions which are

present in the environment (e.g. there is a sink to your left and dishes on a table to your right), thereby

simplifying the grounding challenge. Language descriptions of observations occur naturally in many

environments such as text prompts in graphical user interfaces [Liu et al., 2018], dialogue [He et al.,

2018], and interactive games [Küttler et al., 2020]. Recent work has also shown improvements in

visual manipulation [Shridhar et al., 2021] and navigation [Zhong et al., 2021, Tam et al., 2022] by

captioning the observations with language descriptions. Despite these gains, learning how to interpret

language descriptions is difﬁcult through RL, especially on environments with complex language

abstractions [Zhong et al., 2021].

Corresponding author Victor Zhong vzhong@cs.washington.edu

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.00066v1 [cs.LG] 30 Sep 2022

Observations from

expert

demonstrations

without actions

Teacher

dynamics

model

Final

policy

Initialize

ﬁne-tuning

Record

observations

Next-obs

prediction

Distill

Environment with language observations

Dynamics modeling

Policy

learner

Policy learning

Figure 1: Language Dynamics Distillation (

LDD

uses cheap unlabeled demonstrations to learn a

dynamics model of the environment, which is used to initialize and distill grounded representations

into the policy learner. During the dynamics modeling phase (purple), we train a teacher model to

predict the next observation given prior observations using unlabeled demonstrations. In the policy

learning phase (red), we initialize a model with the teacher and distill intermediate representations

from the teacher during reinforcement learning. The traditional policy learning loop is shown in

green. LDD-speciﬁc components are shown in blue.

We present Language Dynamics Distillation (

LDD

), a method that improves RL by learning a dy-

namics model on cheaply obtained unlabeled (i.e. no action labels) demonstrations with language

descriptions. When learning how to use language descriptions effectively, one central challenge

is how to disentangle language understanding from policy performance from sparse, delayed re-

wards. Our motivation is to learn initial language grounding via dynamics modeling from an ofﬂine

dataset, away from the credit assignment and non-stationarity challenges posed by RL. While labeled

demonstrations that tell the agent how to to act in each situation are expensive to collect, for many

environments one can cheaply obtain unlabeled demonstrations (e.g. videos of experts performing the

task) [Yang et al., 2019, Stadie et al., 2017]. Intuitively,

LDD

exploits these unlabeled demonstrations

to learn how to associate language descriptions with abstractions in the environment. This knowledge

is then used to bootstrap and more quickly learn policies that generalize to instructions and manuals

in new environments. Given unlabeled demonstrations with language descriptions (e.g. captions of

scene content), we ﬁrst pretrain the model to predict the next observation given prior observations,

similar to language modeling. A copy of this model is stored as a ﬁxed teacher that grounds language

descriptions to predict environment dynamics. We then train a model with RL, while distilling inter-

mediate representations from the teacher to avoid catastrophic forgetting of how to interpret language

descriptions for dynamics modeling. In this way, the model learns to both maximize expected reward

while retaining knowledge about how language descriptions relate to environment dynamics.

We evaluate

LDD

on the recent SILG benchmark [Zhong et al., 2021], which consists of ﬁve diverse en-

vironments with language descriptions including NetHack [Küttler et al., 2020], ALFWorld [Shridhar

et al., 2021], RTFM [Zhong et al., 2020], Messenger [Hanjie et al., 2021], and Touchdown [Chen et al.,

2018]. These environments present unique challenges in language-grounded policy-learning across

complexity of instructions, visual observations, action space, reasoning procedure, and generalization.

By learning a dynamics model from cheaply obtained unlabeled demonstrations,

LDD

consistently

outperforms reinforcement learning with language descriptions both in terms of sample efﬁciency and

generalization performance. Moreover, we compare LDD to other techniques that inject prior knowl-

edge in VAE pretraining [Kingma and Welling, 2013], inverse reinforcement learning [Hanna and

Stone, 2017, Torabi et al., 2018, Guo et al., 2019], and reward shaping with a pretrained expert [Merel

et al., 2017].

LDD

achieves top performance on all environments in terms of task completion and

reward. In addition to comparing

LDD

to other methods, we ablate

LDD

to quantify the effect of

language observations in dynamics modeling, and the importance of dynamics modeling with expert

demonstrations. On two environments where we can control for the presence of language descriptions

(NetHack game messages and Touchdown panorama captions), we show that language descriptions

improve sample-efﬁciency and generalization. Finally, across all environments, we ﬁnd that dynamics

modeling with expert demonstrations is more effective than with non-expert rollouts.

2 Related Work

Learning by observing language.

Recent work studies generalization to language instructions

and manuals that specify new tasks and environments. These settings range from photorealistic/3D

navigation [Anderson et al., 2018, Chen et al., 2018, Ku et al., 2020, Shridhar et al., 2020] to multi-

hop reference games [Narasimhan et al., 2015, Zhong et al., 2020, Hanjie et al., 2021]. We use a

collection of these tasks to evaluate

LDD

. There is also work where understanding language is not

necessary to achieve the task, however its inclusion (e.g. via captions, scene descriptions) makes

learning more efﬁcient. Shridhar et al. [2021] show that one can quickly learn policies in a simulated

kitchen environment described in text, then transfer this policy to the 3D visual environment. Zhong

et al. [2021] similarly transform photorealistic navigation to a symbolic form via image segmentation,

then learn a policy that transfers to the original photorealistic setting. In work concurrent to ours, Tam

et al. [2022] generate oracle captions of observations for simulated robotic control and city navigation,

which improve policy learning.

LDD

is complementary to these—in addition to incorporating language

descriptions as features, we show that learning a dynamics model from unlabeled demonstrations

with language descriptions improves sample efﬁciency and results in better policies.

Imitation learning from observations.

There is prior work on model-free as well as model-based

imitation learning from observations. Model-free methods encourage the imitator to produce state

distributions similar to those produced by the demonstrator, for example via generative adversarial

learning [Merel et al., 2017] and reward shaping [Kimura et al., 2018]. In contrast,

LDD

only requires

intermediate representations extracted from an expert dynamics model on states encountered by the

learner, which are cheaper to compute than rollouts from an expert policy. Model-based approaches

learn dynamics models that predict state-transitions given the current state and an action. Hanna and

Stone [2017] learn an inverse model to map state-transitions to actions, which is then used to annotate

unlabeled trajectories for imitation learning. Edwards et al. [2019] learn a forward dynamics model

that predicts future states given state and latent action pairs. In contrast,

LDD

does not assume priors

over the action space distribution. For instance, on ALFWorld, our method works even though it is

impossible to enumerate the action space. In our experiments, we extend model-free reward shaping

and model-based inverse dynamics modeling to account for language descriptions and compare

LDD

these methods.

Representation learning in RL.

In representation learning for RL, the agent learns representations

of the environment using rewards and objectives based on the difference between the state and prior

states [Strehl and Littman, 2008], raw visual observations [Jaderberg et al., 2017], learned agent

representations [Raileanu and Rocktäschel, 2020], and random network observations [Burda et al.,

2019]. In intrinsic exploration methods [Raileanu and Rocktäschel, 2020, Burda et al., 2019], the

training objective encourages dissimilarity (e.g. in observation/state space) to prior agent experi-

ence so that the agent discovers novel states. Unlike intrinsic exploration, the distillation objective

in Language Dynamics Distillation encourages similarity to expert behaviour, as opposed to dissim-

ilarity to prior agent experience. In reconstruction based representation learning methods [Strehl

and Littman, 2008, Jaderberg et al., 2017], the training objective encourages the agent to learn

intermediate representations that also capture the dynamics and structure of the environment by

reconstructing the observations (e.g. predicting what objects are in scene). Language Dynamics

Distillation is similar to reconstruction methods for representation learning, however unlike the latter,

the dynamics model in

LDD

is trained on trajectories obtained from an expert policy as opposed to the

agent policy. Language Dynamics Distillation is complementary to intrinsic exploration methods and

to reconstruction based representation learning methods.

3 Language Dynamics Distillation

Recent work improves policy learning by augmenting environment observations with language de-

scriptions [Shridhar et al., 2021, Zhong et al., 2021, Tam et al., 2022]. For environments with complex

language abstractions, however, learning how to associate language to environment observations is

difﬁcult through RL due to sparse, delayed rewards. In Language Dynamics Distillation (

LDD

), we

pretrain the model on unlabeled demonstrations (i.e. no annotated actions) with language descrip-

tions to predict the dynamics of the environment, then ﬁne-tune the language-aware model via RL.

LDD

consists of two phases. In the ﬁrst dynamics modeling phase, we pretrain the model to predict

future observations given unannotated demonstrations. We store a copy of the model as a ﬁxed

teacher that has learned grounded representations useful for predicting how the environment behaves

under an expert policy. In the second reinforcement learning phase, we ﬁne-tune the model through

policy learning, while distilling representations from the teacher. This way, the model is trained

to both maximize expected reward and retain knowledge about the dynamics of the environment.

Fig 1 illustrates the components of LDD.

3.1 Background

Markov decision process.

Consider a MDP

M={S,A, P, r, γ}

. Here,

and

respectively are

the discrete state (e.g. language goals, descriptions, visual observations) and action spaces of the

problem.

P(st+1|st, at)

is the transition probability of transitioning into state

st+1

by taking action

from state

r(s, a)

is the reward function given some state and action pair.

is a discount factor

to prioritize short-term rewards.

Actor-critic methods for policy learning.

In RL, we learn a policy

π(s;θ)

that maps from ob-

servations to actions

π:S → A

. Let

R(τ)

denote the total discounted reward over the trajectory

. The objective is to maximize the expected reward

Jπ(θ) = Eπ[R(τ)]

following the policy

optimizing its parameters θ. For trajectory length T, the policy gradient is

∇Eπ[R(τ)] = Eπ" R(τ)

t=1

∇log π(at, st)!#=Eπ" T

t=1

Gt∇log π(at, st)!# (1)

where

Gt=P∞

k=0 γkrt+k+1

is the return or discounted future reward at time

. We consider

the actor-critic family of policy gradient methods, where a critic is learned to reduce variance in

the gradient estimate. Let

V(s) = Eπ[Gt|st=s]

denote the state value, which corresponds to the

expected returns by following the policy

from a state

. Actor critic methods estimate the state value

function by learning another parametrized function

to bootstrap the estimation of the discounted

return

. For instance, with one-step bootstrapping, we have

Gt≈rt+1 +γV (st+1;φ)

. The critic

objective is then

JV(φ) = 1

2(rt+1 +γV (st+1;φ)−V(st;φ))2

We minimize a weighted sum of the

policy objective and the critic objective Jac(θ, φ) = −Jπ(θ) + αVJV(φ).

3.2 Dynamics modeling during pretraining

In addition to policy learning, Language Dynamics Distillation learns a dynamics model from unla-

beled demonstrations to initialize and distill into the policy learner. Consider a set of demonstrations

without labeled actions

Tσ={τ1, τ2,...τn}

obtained by rolling out some policy

σ(at, st)

, where

each demonstration

τ= [s1, s2,...sT]

consists of a sequence of observations. We learn a dynamics

model δ(s1. . . st;ζ)to predict the next observation st+1 given the previous observations.

Jδ(ζ) = 1

nT n

i=1 T

t=1

sim (st+1, δ(s1,...st;ζ))!! (2)

where

sim

is a differentiable similarity function between the predicted state

δ(s1,...st)

and the

observed state

st+1

, and

are parameters of the dynamics model. In the environments we consider,

sim is the cross-entropy loss across a grid of symbols denoting entities present in the scene.

3.3 Dynamics distillation during policy learning

Fig 1 shows the decomposition of the model into a representation network

frep

, a policy head

fπ

, a

value head

, and a dynamics head

fδ

. The three heads share parameters because their inputs are

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingPolicyLearningviaLanguageDynamicsDistillationVictorZhong1,2,JesseMu3,LukeZettlemoyer1,2,EdwardGrefenstette4,5andTimRocktäschel41UniversityofWashington2MetaAIResearch3StanfordUniversity4UniversityCollegeLondon5CohereAbstractRecentworkhasshownthataugmentingenvironmentswithlanguagedescriptions...

展开>> 收起<<

Improving Policy Learning via Language Dynamics Distillation Victor Zhong12 Jesse Mu3 Luke Zettlemoyer12 Edward Grefenstette45and Tim Rocktäschel4.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Policy Learning via Language Dynamics Distillation Victor Zhong12 Jesse Mu3 Luke Zettlemoyer12 Edward Grefenstette45and Tim Rocktäschel4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: