Generalization with Lossy Affordances Leveraging Broad Offline Data for Learning Visuomotor Tasks Kuan Fang Patrick Yin Ashvin Nair Homer Walke Gengchen Yan Sergey Levine

2025-04-24 0 0 3.81MB 16 页 10玖币
侵权投诉
Generalization with Lossy Affordances: Leveraging
Broad Offline Data for Learning Visuomotor Tasks
Kuan Fang, Patrick Yin, Ashvin Nair, Homer Walke, Gengchen Yan, Sergey Levine
University of California, Berkeley
Abstract: The use of broad datasets has proven to be crucial for generalization for
a wide range of fields. However, how to effectively make use of diverse multi-task
data for novel downstream tasks still remains a grand challenge in reinforcement
learning and robotics. To tackle this challenge, we introduce a framework that
acquires goal-conditioned policies for unseen temporally extended tasks via of-
fline reinforcement learning on broad data, in combination with online fine-tuning
guided by subgoals in a learned lossy representation space. When faced with a
novel task goal, our framework uses an affordance model to plan a sequence of
lossy representations as subgoals that decomposes the original task into easier
problems. Learned from the broad prior data, the lossy representation emphasizes
task-relevant information about states and goals while abstracting away redundant
contexts that hinder generalization. It thus enables subgoal planning for unseen
tasks, provides a compact input to the policy, and facilitates reward shaping dur-
ing fine-tuning. We show that our framework can be pre-trained on large-scale
datasets of robot experience from prior work and efficiently fine-tuned for novel
tasks, entirely from visual inputs without any manual reward engineering. 1
Keywords: Reinforcement Learning, Representation Learning, Planning
Broad Offline Data
state
encoder
goal-conditioned
policy
affordance
model
Target Task Lossy Representation Space
actions
planned
subgoals
pre-training
fine-tuning
final goal
initial state
Figure 1: Fine-Tuning with Lossy Affordance Planner (FLAP). Our framework leverages broad offline
data for new temporarily extended tasks using learned lossy representations of states and goals. We pre-train a
state encoder, and affordance model, and a goal-conditioned policy on the offline data collected from diverse
environments and fine-tune the policy to solve the target tasks without any explicit reward signals. To provide
guidance for the policy, subgoals are planned in the lossy representation space given visual inputs.
1 Introduction
Learning-based methods can enable robotic systems to automatically acquire large repertoires of
behaviors that can potentially generalize to diverse real-world scenarios. However, attaining such
generalizability requires the ability to leverage large-scale datasets, which presents a conundrum
when building general-purpose robotic systems: How can we tractably endow robots with the de-
sired skills if each behavior requires a laborious data collection and lengthy learning process to
master? In much the same way that humans use their past experience and expertise to acquire new
1Project webpage: sites.google.com/view/project-flap
6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.
arXiv:2210.06601v2 [cs.RO] 18 Apr 2023
skills more quickly, the answer for robots may be to effectively leverage prior data. But in the setting
of robotic learning, this raises a number of complex questions: What sort of knowledge should we
derive from this prior data? How do we break up diverse and uncurated prior datasets into distinct
skills? And how do we then use and adapt these skills for solving novel tasks?
To learn useful behaviors from data of a wide range of tasks, we can employ goal-conditioned
reinforcement learning (GCRL) [1,2], where a policy is trained to attain a specified goal (e.g.,
indicated as an image) from the current state. This makes it practical to learn from previously
collected large datasets without explicit reward annotations and to share knowledge across tasks
that exhibit similar physical phenomena. However, it is usually difficult to learn goal-conditioned
policies that can solve temporally extended tasks in zero shot, as such policies are typically only
effective for short-horizon goals [3]. To solve the target tasks, we would need to transfer the learned
knowledge to the testing environment and efficiently fine-tune the policy in an online manner.
Our proposed solution combines representation learning, planning, and online fine-tuning. The
key intuition is that, if we can learn a suitable representation of states and goals that generalizes
effectively across environments, then we can plan subgoals in this representation space to solve
long-horizon tasks, and also leverage it to help finetune the goal-conditioned policies on the new
task online. Core to this planning procedure is the use of affordance models [4,5], which predict
potentially reachable states that can serve as subgoal proposals for the planning process. Good state
representations are necessary for this process: (1) as inputs and outputs for the affordance model,
which must generalize effectively across tasks and domains (since if the affordance model doesn’t
generalize, it can’t provide guidance for policy fine-tuning); (2) as inputs into the policy, so as to
facilitate rapid adaptation; (3) as measures of state proximity for use as reward functions for fine-
tuning the policy, since well-shaped rewards are essential for rapid online training but notoriously
difficult to obtain without manual engineering.
To this end, we propose Fine-Tuning with Lossy Affordance Planner (FLAP), a framework that
leverages diverse offline data for learning representations, goal-conditioned policies, and affordance
models that enable rapid fine-tuning to new tasks in target scenes. As shown in Fig. 1, our lossy
representations are acquired during the offline pre-training process for the goal-conditioned policy
using a variational information bottleneck (VIB) [6]. Intuitively, this representation learns to capture
the minimal information necessary to determine if and how a particular goal can be reached from a
particular state, making it ideally suited for planning over subgoals and providing reward shaping
during fine-tuning. When a new task is specified to the robot with a goal image, we use an affordance
model that predicts reachable states in this learned lossy representation space to plan subgoals for
prospectively accomplishing this goal. During the fine-tuning process, the goal-conditioned policy is
fine-tuned on each subgoal given the informative reward signal computed from the learned represen-
tations. Both the offline and online stage in this process operates entirely from images and does not
require any hand-designed reward functions beyond those extracted automatically from the learned
representation. Building the components based on the prior work [5,7], we demonstrate that the par-
ticular combination of lossy representation learning, goal-conditioned policies, and planning-driven
fine-tuning can enable performance that significantly exceeds that of prior methods. We evaluate our
method in both real-world and simulated environments using previously collected offline data [8,7]
for learning novel robotic manipulation tasks. Compared to baselines, the proposed method achieves
higher success rates with fewer online fine-tuning iterations.
2 Preliminaries
Goal-conditioned reinforcement learning. We consider a goal-conditioned reinforcement learning
(GCRL) setting with states and goals as images. The goal-reaching tasks can be represented by a
Markov Decision Process (MDP) denoted by a tuple M= (S,A, ρ, P, G, γ)with state space S,
action space A, initial state probability ρ, transition probability P, a goal space G, and discount
factor γ. Each goal-reaching task can be defined by a pair of initial state s0∈ S and desired goal
sg∈ G. We assume states and goals are defined in the same space, i.e. G=S. By selecting the
action at A at each time step t, the goal-conditioned policy π(at|st, sg)aims to reach a state s
such that d(ssg), where dis a distance metric and is the threshold for reaching the goal.
We use the sparse reward function rt(st+1, sg)which outputs 0when the goal is reached and 1
otherwise and the objective is defined as the average expected discounted return Etγtrt].
2
( | , ) ( · ) *π
φ
φ
( )
φ
affordance policy
candidate plans
optimal plan
π
φ
φRL loss
Variational Information Bottleneck
(a) Pre-Training on Offline Data (b) Fine-Tuning with Lossy Affordance Plans
Figure 2: FLAP Architecture. (a) We pre-train the representation with RL on the offline data. We first
encode the initial state and the goal and then compute the RL loss with policy π, the value function V, and the
Q-function Q. (b) We use planned subgoals to guide the policy during online fine-tuning (right). Given the
encodings of the initial state and the final goal, subgoal sequences are recursively generated by the affordance
model in the learned representation space with latent codes sampled from the prior p(u). The optimal subgoal
sequence ˆz
1:Kis then selected according to Eq. 6for guiding the goal-conditioned policy.
Planning with affordance models. When learning to solve long-horizon tasks, subgoals can sig-
nificantly improve performance by breaking down the original problem into easier short-horizon
tasks. We can use sampling-based planning to find a suitable sequence of Ksubgoals ˆs1:K, which
samples multiple candidate sequences and chooses the optimal plan based on a cost function. In
high-dimensional state spaces, such a planning process can be computationally intractable, since
most sampled candidates will be unlikely to form a reasonable plan. To facilitate planning, we
would need to focus on sampling subgoal candidates that are realistic and reachable from the cur-
rent state. Following the prior work [5,7], we use affordance models to capture the distribution
of reachable future states and recursively propose subgoals conditioned on the initial state of each
task. The affordance model can be defined as a generative model m(s0|s, u), where uis the latent
representation that captures the information of the transition from the current state sto the goal s0.
It can be trained in a conditional variational autoencoder (CVAE) [9] paradigm using the encoder
q(u|s, s0)to estimate the evidence lower bound (ELBO) [10].
Offline pre-training and online fine-tuning. Offline reinforcement learning pre-trains the
policy and value function on a dataset Doffline, consisting of previously collected experiences
(si
t, ai
t, ri
t, si
t+1), where iand tare the index of the trajectory and time step. In this work, we assume
Doffline does not include data of the target task. To solve the target task, the pre-trained policy is
then fine-tuned by exploring the environment and collecting online data Donline. Following Fang
et al. [7], we first pre-train both the policy πand the affordance model mon Doffline, and then use
the subgoals planned using the affordance model to guide the online exploration of π. In the online
phase, we freeze the weight of mand fine-tune πon a combination of Doffline and Donline. While
our contribution is orthogonal to the specific choice of the offline RL algorithm, we use implicit
Q-learning [11] which is well-suited to online fine-tuning using an expectile loss to construct value
estimates and advantage weighted regression to extract a policy.
3 Pre-Training and Fine-Tuning with a Lossy Affordance Planner
The objective of our method is to enable a robot to leverage diverse offline data to efficiently learn
how to achieve new goals, potentially in new scenes. While we do not require the offline dataset to
contain data of the target tasks, two assumptions are made for enabling effective possible generaliza-
tion. First, we will consider tasks that require the robot to compose previously seen behaviors (e.g.,
pushing, grasping, and opening) in order to perform a temporally extended task that sequences these
behaviors in a new way (e.g., pushing an obstruction out of the way and then opening a drawer).
Second, we will consider tasks that require the robot to perform behaviors that resemble those in the
prior data, but with new objects that it had not seen before, which presents a generalization challenge
for the policies and the affordance models.
Our method is based on offline goal-conditioned reinforcement learning and subgoal planning with
affordance models. To solve temporarily extended tasks with novel objects, we construct the goal-
conditioned policy and the affordance model in a learned representation space of states and goals
that picks up on task-relevant information (e.g., object identity and location), while abstracting away
unnecessary visual distractors (e.g., camera pose, illumination, background). In this section, we first
describe a paradigm for jointly pre-training the goal-conditioned policy and the lossy representation
through offline reinforcement learning. Then, we describe how to construct an affordance model
in this lossy representation space for guiding the goal-conditioned policy with planned subgoals.
3
Finally, we describe how we utilize the lossy affordance model in a complete system for vision-
based robotic control to set goals and finetune policies online.
3.1 Goal-Conditioned Reinforcement Learning with Lossy Representations
To capture the task-relevant information of states and goals, we learn a parametric state encoder
φ(z|s)to project both states and goals from the high-dimensional space (e.g., RGB images) Sinto a
learned representation space Z. We use ztand zgto denote the representation of the state stand the
goal g, respectively. In the learned representation space, we construct the goal-conditioned policy
π(a|zt, zg), as well as the value function V(zt, zg)and Q-function Q(zt, zg, a). Both Vand Qare
used for training the policy via the offline RL algorithm, which in our prototype is IQL [11], and
Vis also used for selecting subgoals in the planner, which will be discussed in Sec. 3.2. When
training φ,π,V, and Qto optimize the goal-reaching objective described in Sec. 2,φwould need
to extract sufficient information for selecting the action to transit from stto gand estimating the
required discounted number of steps.
We jointly pre-train the representation with reinforcement learning on the offline data as shown in
Fig. 2. The original IQL objective optimizes LRL =Lπ+LQ+LVas described in Kostrikov
et al. [11]. To facilitate generalization of the pre-trained policy and the affordance model, we would
like to abstract away redundant domain-specific information from the learned representation. For
this purpose, we add a variational information bottleneck (VIB) [6] to the reinforcement learning
objective, which constrains the mutual information I(st;zt)and I(g;zg)between the state space
and the representation space by a constant C. The joint training of φ,π,Vand Qcan be written as
an optimization problem using θto denote the model parameters:
max
θLRL s.t. I(g;zg)C, I(st;zt)C, for t= 0, ..., T (1)
Following Alemi et al. [6], we convert this into an unconstrained optimization problem by applying
the bottleneck on the Q-function representation and directly selecting the Lagrange multiplier α,
resulting in the full VIB objective:
L=LRL αDKL(φ(zt|st)kp(z)) αDKL(φ(zg|g)kp(z)) (2)
where we use the normal distribution as p(z), the prior distribution of z.
By optimizing this objective, we obtain lossy representations that disentangle relevant factors such
as object poses from irrelevant factors such as scene background. We exploit this property of the
learned representations in several ways: learning policies and affordance models that generalize
from prior data, having a prior for optimization, and as a distance metric for rewards during finetun-
ing. Next, we will describe how we generate subgoals utilizing the bottleneck representation.
3.2 Composing Subgoals using Lossy Affordances
The effectiveness of using subgoals to guide the goal-conditioned policy relies on two conditions.
First, each pair of adjacent subgoals needs to be in the distribution of Doffline so that the pre-trained
goal-conditioned policy will have a sufficient success rate for the transition. Second, the subgoals
should break down the original task into subtasks of reasonable difficulties. As shown in Fig. 2, we
devise an affordance model and a cost function to plan subgoals that satisfy these conditions and use
the learned lossy representation to facilitate generalization to novel target tasks.
Instead of sampling goals from the original high-dimensional state space, we propose subgoals ˆz1:K
in the learned lossy representation space. For this purpose, we learn a lossy affordance model to
capture the distribution p(z0|z), where zis the representation of a state sand z0corresponds to the
future state s0that is reachable within tsteps. The lossy affordance model can be constructed as
a parametric model m(z0|z, u), where uis a latent code that represents the transition from zto z0.
Given a sequence of sampled latent codes u1:K, we can recursively sample the k-th subgoal ˆzkin
the sequence ˆz1:Kwith mby taking ˆzk1and ukas inputs, where we denote ˆz0=z0.
The lossy affordance model is pre-trained on the offline dataset Doffline. Given each pair (s, s0)sam-
pled from Doffline, the pre-trained encoder φproduces a distribution of lossy representation φ(z0|s0).
Given the sampled representation z, we would like the affordance model to propose representations
that follow the distribution φ(z0|s0). Using pm(z0|z)to denote the marginalized distribution of the
4
摘要:

GeneralizationwithLossyAffordances:LeveragingBroadOfineDataforLearningVisuomotorTasksKuanFang,PatrickYin,AshvinNair,HomerWalke,GengchenYan,SergeyLevineUniversityofCalifornia,BerkeleyAbstract:Theuseofbroaddatasetshasproventobecrucialforgeneralizationforawiderangeofelds.However,howtoeffectivelymakeu...

展开>> 收起<<
Generalization with Lossy Affordances Leveraging Broad Offline Data for Learning Visuomotor Tasks Kuan Fang Patrick Yin Ashvin Nair Homer Walke Gengchen Yan Sergey Levine.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:3.81MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注