Improving Policy Learning via Language Dynamics Distillation Victor Zhong12 Jesse Mu3 Luke Zettlemoyer12 Edward Grefenstette45and Tim Rocktäschel4

2025-05-08 0 0 6.56MB 16 页 10玖币
侵权投诉
Improving Policy Learning via Language Dynamics
Distillation
Victor Zhong1,2, Jesse Mu3, Luke Zettlemoyer1,2, Edward Grefenstette4,5 and Tim Rocktäschel4
1University of Washington
2Meta AI Research
3Stanford University
4University College London
5Cohere
Abstract
Recent work has shown that augmenting environments with language descriptions
improves policy learning. However, for environments with complex language
abstractions, learning how to ground language to observations is difficult due
to sparse, delayed rewards. We propose Language Dynamics Distillation (
LDD
),
which pretrains a model to predict environment dynamics given demonstrations
with language descriptions, and then fine-tunes these language-aware pretrained
representations via reinforcement learning (RL). In this way, the model is trained to
both maximize expected reward and retain knowledge about how language relates
to environment dynamics. On SILG, a benchmark of five tasks with language de-
scriptions that evaluate distinct generalization challenges on unseen environments
(NetHack, ALFWorld, RTFM, Messenger, and Touchdown),
LDD
outperforms
tabula-rasa RL, VAE pretraining, and methods that learn from unlabeled demon-
strations in inverse RL and reward shaping with pretrained experts. In our analyses,
we show that language descriptions in demonstrations improve sample-efficiency
and generalization across environments, and that dynamics modeling with expert
demonstrations is more effective than with non-experts.
1 Introduction
Language is a powerful medium that humans use to reason about abstractions—its compositionality
allows efficient descriptions that generalize across environments and tasks. Consider an agent that
follows instructions to clean the house (e.g. find the dirty dishes and wash them). In tabula-rasa
reinforcement learning (RL), the agent observes raw perceptual features of the environment, then
grounds these visual features to language cues to learn how to behave through trial and error. In
contrast, we can provide the agent with language descriptions that describe abstractions which are
present in the environment (e.g. there is a sink to your left and dishes on a table to your right), thereby
simplifying the grounding challenge. Language descriptions of observations occur naturally in many
environments such as text prompts in graphical user interfaces [Liu et al., 2018], dialogue [He et al.,
2018], and interactive games [Küttler et al., 2020]. Recent work has also shown improvements in
visual manipulation [Shridhar et al., 2021] and navigation [Zhong et al., 2021, Tam et al., 2022] by
captioning the observations with language descriptions. Despite these gains, learning how to interpret
language descriptions is difficult through RL, especially on environments with complex language
abstractions [Zhong et al., 2021].
Corresponding author Victor Zhong vzhong@cs.washington.edu
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.00066v1 [cs.LG] 30 Sep 2022
Observations from
expert
demonstrations
without actions
Teacher
dynamics
model
Final
policy
Initialize
RL
fine-tuning
Record
observations
Next-obs
prediction
Distill
Environment with language observations
Dynamics modeling
Policy
learner
Policy learning
Figure 1: Language Dynamics Distillation (
LDD
).
LDD
uses cheap unlabeled demonstrations to learn a
dynamics model of the environment, which is used to initialize and distill grounded representations
into the policy learner. During the dynamics modeling phase (purple), we train a teacher model to
predict the next observation given prior observations using unlabeled demonstrations. In the policy
learning phase (red), we initialize a model with the teacher and distill intermediate representations
from the teacher during reinforcement learning. The traditional policy learning loop is shown in
green. LDD-specific components are shown in blue.
We present Language Dynamics Distillation (
LDD
), a method that improves RL by learning a dy-
namics model on cheaply obtained unlabeled (i.e. no action labels) demonstrations with language
descriptions. When learning how to use language descriptions effectively, one central challenge
is how to disentangle language understanding from policy performance from sparse, delayed re-
wards. Our motivation is to learn initial language grounding via dynamics modeling from an offline
dataset, away from the credit assignment and non-stationarity challenges posed by RL. While labeled
demonstrations that tell the agent how to to act in each situation are expensive to collect, for many
environments one can cheaply obtain unlabeled demonstrations (e.g. videos of experts performing the
task) [Yang et al., 2019, Stadie et al., 2017]. Intuitively,
LDD
exploits these unlabeled demonstrations
to learn how to associate language descriptions with abstractions in the environment. This knowledge
is then used to bootstrap and more quickly learn policies that generalize to instructions and manuals
in new environments. Given unlabeled demonstrations with language descriptions (e.g. captions of
scene content), we first pretrain the model to predict the next observation given prior observations,
similar to language modeling. A copy of this model is stored as a fixed teacher that grounds language
descriptions to predict environment dynamics. We then train a model with RL, while distilling inter-
mediate representations from the teacher to avoid catastrophic forgetting of how to interpret language
descriptions for dynamics modeling. In this way, the model learns to both maximize expected reward
while retaining knowledge about how language descriptions relate to environment dynamics.
We evaluate
LDD
on the recent SILG benchmark [Zhong et al., 2021], which consists of five diverse en-
vironments with language descriptions including NetHack [Küttler et al., 2020], ALFWorld [Shridhar
et al., 2021], RTFM [Zhong et al., 2020], Messenger [Hanjie et al., 2021], and Touchdown [Chen et al.,
2018]. These environments present unique challenges in language-grounded policy-learning across
complexity of instructions, visual observations, action space, reasoning procedure, and generalization.
By learning a dynamics model from cheaply obtained unlabeled demonstrations,
LDD
consistently
outperforms reinforcement learning with language descriptions both in terms of sample efficiency and
generalization performance. Moreover, we compare LDD to other techniques that inject prior knowl-
edge in VAE pretraining [Kingma and Welling, 2013], inverse reinforcement learning [Hanna and
2
Stone, 2017, Torabi et al., 2018, Guo et al., 2019], and reward shaping with a pretrained expert [Merel
et al., 2017].
LDD
achieves top performance on all environments in terms of task completion and
reward. In addition to comparing
LDD
to other methods, we ablate
LDD
to quantify the effect of
language observations in dynamics modeling, and the importance of dynamics modeling with expert
demonstrations. On two environments where we can control for the presence of language descriptions
(NetHack game messages and Touchdown panorama captions), we show that language descriptions
improve sample-efficiency and generalization. Finally, across all environments, we find that dynamics
modeling with expert demonstrations is more effective than with non-expert rollouts.
2 Related Work
Learning by observing language.
Recent work studies generalization to language instructions
and manuals that specify new tasks and environments. These settings range from photorealistic/3D
navigation [Anderson et al., 2018, Chen et al., 2018, Ku et al., 2020, Shridhar et al., 2020] to multi-
hop reference games [Narasimhan et al., 2015, Zhong et al., 2020, Hanjie et al., 2021]. We use a
collection of these tasks to evaluate
LDD
. There is also work where understanding language is not
necessary to achieve the task, however its inclusion (e.g. via captions, scene descriptions) makes
learning more efficient. Shridhar et al. [2021] show that one can quickly learn policies in a simulated
kitchen environment described in text, then transfer this policy to the 3D visual environment. Zhong
et al. [2021] similarly transform photorealistic navigation to a symbolic form via image segmentation,
then learn a policy that transfers to the original photorealistic setting. In work concurrent to ours, Tam
et al. [2022] generate oracle captions of observations for simulated robotic control and city navigation,
which improve policy learning.
LDD
is complementary to these—in addition to incorporating language
descriptions as features, we show that learning a dynamics model from unlabeled demonstrations
with language descriptions improves sample efficiency and results in better policies.
Imitation learning from observations.
There is prior work on model-free as well as model-based
imitation learning from observations. Model-free methods encourage the imitator to produce state
distributions similar to those produced by the demonstrator, for example via generative adversarial
learning [Merel et al., 2017] and reward shaping [Kimura et al., 2018]. In contrast,
LDD
only requires
intermediate representations extracted from an expert dynamics model on states encountered by the
learner, which are cheaper to compute than rollouts from an expert policy. Model-based approaches
learn dynamics models that predict state-transitions given the current state and an action. Hanna and
Stone [2017] learn an inverse model to map state-transitions to actions, which is then used to annotate
unlabeled trajectories for imitation learning. Edwards et al. [2019] learn a forward dynamics model
that predicts future states given state and latent action pairs. In contrast,
LDD
does not assume priors
over the action space distribution. For instance, on ALFWorld, our method works even though it is
impossible to enumerate the action space. In our experiments, we extend model-free reward shaping
and model-based inverse dynamics modeling to account for language descriptions and compare
LDD
to
these methods.
Representation learning in RL.
In representation learning for RL, the agent learns representations
of the environment using rewards and objectives based on the difference between the state and prior
states [Strehl and Littman, 2008], raw visual observations [Jaderberg et al., 2017], learned agent
representations [Raileanu and Rocktäschel, 2020], and random network observations [Burda et al.,
2019]. In intrinsic exploration methods [Raileanu and Rocktäschel, 2020, Burda et al., 2019], the
training objective encourages dissimilarity (e.g. in observation/state space) to prior agent experi-
ence so that the agent discovers novel states. Unlike intrinsic exploration, the distillation objective
in Language Dynamics Distillation encourages similarity to expert behaviour, as opposed to dissim-
ilarity to prior agent experience. In reconstruction based representation learning methods [Strehl
and Littman, 2008, Jaderberg et al., 2017], the training objective encourages the agent to learn
intermediate representations that also capture the dynamics and structure of the environment by
reconstructing the observations (e.g. predicting what objects are in scene). Language Dynamics
Distillation is similar to reconstruction methods for representation learning, however unlike the latter,
the dynamics model in
LDD
is trained on trajectories obtained from an expert policy as opposed to the
agent policy. Language Dynamics Distillation is complementary to intrinsic exploration methods and
to reconstruction based representation learning methods.
3
3 Language Dynamics Distillation
Recent work improves policy learning by augmenting environment observations with language de-
scriptions [Shridhar et al., 2021, Zhong et al., 2021, Tam et al., 2022]. For environments with complex
language abstractions, however, learning how to associate language to environment observations is
difficult through RL due to sparse, delayed rewards. In Language Dynamics Distillation (
LDD
), we
pretrain the model on unlabeled demonstrations (i.e. no annotated actions) with language descrip-
tions to predict the dynamics of the environment, then fine-tune the language-aware model via RL.
LDD
consists of two phases. In the first dynamics modeling phase, we pretrain the model to predict
future observations given unannotated demonstrations. We store a copy of the model as a fixed
teacher that has learned grounded representations useful for predicting how the environment behaves
under an expert policy. In the second reinforcement learning phase, we fine-tune the model through
policy learning, while distilling representations from the teacher. This way, the model is trained
to both maximize expected reward and retain knowledge about the dynamics of the environment.
Fig 1 illustrates the components of LDD.
3.1 Background
Markov decision process.
Consider a MDP
M={S,A, P, r, γ}
. Here,
S
and
A
respectively are
the discrete state (e.g. language goals, descriptions, visual observations) and action spaces of the
problem.
P(st+1|st, at)
is the transition probability of transitioning into state
st+1
by taking action
at
from state
st
.
r(s, a)
is the reward function given some state and action pair.
γ
is a discount factor
to prioritize short-term rewards.
Actor-critic methods for policy learning.
In RL, we learn a policy
π(s;θ)
that maps from ob-
servations to actions
π:S → A
. Let
R(τ)
denote the total discounted reward over the trajectory
τ
. The objective is to maximize the expected reward
Jπ(θ) = Eπ[R(τ)]
following the policy
π
by
optimizing its parameters θ. For trajectory length T, the policy gradient is
Eπ[R(τ)] = Eπ" R(τ)
T
X
t=1
log π(at, st)!#=Eπ" T
X
t=1
Gtlog π(at, st)!# (1)
where
Gt=P
k=0 γkrt+k+1
is the return or discounted future reward at time
t
. We consider
the actor-critic family of policy gradient methods, where a critic is learned to reduce variance in
the gradient estimate. Let
V(s) = Eπ[Gt|st=s]
denote the state value, which corresponds to the
expected returns by following the policy
π
from a state
s
. Actor critic methods estimate the state value
function by learning another parametrized function
V
to bootstrap the estimation of the discounted
return
Gt
. For instance, with one-step bootstrapping, we have
Gtrt+1 +γV (st+1;φ)
. The critic
objective is then
JV(φ) = 1
2(rt+1 +γV (st+1;φ)V(st;φ))2
We minimize a weighted sum of the
policy objective and the critic objective Jac(θ, φ) = Jπ(θ) + αVJV(φ).
3.2 Dynamics modeling during pretraining
In addition to policy learning, Language Dynamics Distillation learns a dynamics model from unla-
beled demonstrations to initialize and distill into the policy learner. Consider a set of demonstrations
without labeled actions
Tσ={τ1, τ2,...τn}
obtained by rolling out some policy
σ(at, st)
, where
each demonstration
τ= [s1, s2,...sT]
consists of a sequence of observations. We learn a dynamics
model δ(s1. . . st;ζ)to predict the next observation st+1 given the previous observations.
Jδ(ζ) = 1
nT n
X
i=1 T
X
t=1
sim (st+1, δ(s1,...st;ζ))!! (2)
where
sim
is a differentiable similarity function between the predicted state
δ(s1,...st)
and the
observed state
st+1
, and
ζ
are parameters of the dynamics model. In the environments we consider,
sim is the cross-entropy loss across a grid of symbols denoting entities present in the scene.
3.3 Dynamics distillation during policy learning
Fig 1 shows the decomposition of the model into a representation network
frep
, a policy head
fπ
, a
value head
fV
, and a dynamics head
fδ
. The three heads share parameters because their inputs are
4
摘要:

ImprovingPolicyLearningviaLanguageDynamicsDistillationVictorZhong1,2,JesseMu3,LukeZettlemoyer1,2,EdwardGrefenstette4,5andTimRocktäschel41UniversityofWashington2MetaAIResearch3StanfordUniversity4UniversityCollegeLondon5CohereAbstractRecentworkhasshownthataugmentingenvironmentswithlanguagedescriptions...

展开>> 收起<<
Improving Policy Learning via Language Dynamics Distillation Victor Zhong12 Jesse Mu3 Luke Zettlemoyer12 Edward Grefenstette45and Tim Rocktäschel4.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:6.56MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注