Generalization with Lossy Affordances Leveraging Broad Ofﬂine Data for Learning Visuomotor Tasks Kuan Fang Patrick Yin Ashvin Nair Homer Walke Gengchen Yan Sergey Levine

2025-04-24 0 0 3.81MB 16 页 10玖币

侵权投诉

Generalization with Lossy Affordances: Leveraging

Broad Ofﬂine Data for Learning Visuomotor Tasks

Kuan Fang, Patrick Yin, Ashvin Nair, Homer Walke, Gengchen Yan, Sergey Levine

University of California, Berkeley

Abstract: The use of broad datasets has proven to be crucial for generalization for

a wide range of ﬁelds. However, how to effectively make use of diverse multi-task

data for novel downstream tasks still remains a grand challenge in reinforcement

learning and robotics. To tackle this challenge, we introduce a framework that

acquires goal-conditioned policies for unseen temporally extended tasks via of-

ﬂine reinforcement learning on broad data, in combination with online ﬁne-tuning

guided by subgoals in a learned lossy representation space. When faced with a

novel task goal, our framework uses an affordance model to plan a sequence of

lossy representations as subgoals that decomposes the original task into easier

problems. Learned from the broad prior data, the lossy representation emphasizes

task-relevant information about states and goals while abstracting away redundant

contexts that hinder generalization. It thus enables subgoal planning for unseen

tasks, provides a compact input to the policy, and facilitates reward shaping dur-

ing ﬁne-tuning. We show that our framework can be pre-trained on large-scale

datasets of robot experience from prior work and efﬁciently ﬁne-tuned for novel

tasks, entirely from visual inputs without any manual reward engineering. 1

Keywords: Reinforcement Learning, Representation Learning, Planning

Broad Oﬄine Data

state

encoder

goal-conditioned

policy

aﬀordance

model

Target Task Lossy Representation Space

actions

planned

subgoals

pre-training

ﬁne-tuning

ﬁnal goal

initial state

Figure 1: Fine-Tuning with Lossy Affordance Planner (FLAP). Our framework leverages broad ofﬂine

data for new temporarily extended tasks using learned lossy representations of states and goals. We pre-train a

state encoder, and affordance model, and a goal-conditioned policy on the ofﬂine data collected from diverse

environments and ﬁne-tune the policy to solve the target tasks without any explicit reward signals. To provide

guidance for the policy, subgoals are planned in the lossy representation space given visual inputs.

1 Introduction

Learning-based methods can enable robotic systems to automatically acquire large repertoires of

behaviors that can potentially generalize to diverse real-world scenarios. However, attaining such

generalizability requires the ability to leverage large-scale datasets, which presents a conundrum

when building general-purpose robotic systems: How can we tractably endow robots with the de-

sired skills if each behavior requires a laborious data collection and lengthy learning process to

master? In much the same way that humans use their past experience and expertise to acquire new

1Project webpage: sites.google.com/view/project-ﬂap

6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.

arXiv:2210.06601v2 [cs.RO] 18 Apr 2023

skills more quickly, the answer for robots may be to effectively leverage prior data. But in the setting

of robotic learning, this raises a number of complex questions: What sort of knowledge should we

derive from this prior data? How do we break up diverse and uncurated prior datasets into distinct

skills? And how do we then use and adapt these skills for solving novel tasks?

To learn useful behaviors from data of a wide range of tasks, we can employ goal-conditioned

reinforcement learning (GCRL) [1,2], where a policy is trained to attain a speciﬁed goal (e.g.,

indicated as an image) from the current state. This makes it practical to learn from previously

collected large datasets without explicit reward annotations and to share knowledge across tasks

that exhibit similar physical phenomena. However, it is usually difﬁcult to learn goal-conditioned

policies that can solve temporally extended tasks in zero shot, as such policies are typically only

effective for short-horizon goals [3]. To solve the target tasks, we would need to transfer the learned

knowledge to the testing environment and efﬁciently ﬁne-tune the policy in an online manner.

Our proposed solution combines representation learning, planning, and online ﬁne-tuning. The

key intuition is that, if we can learn a suitable representation of states and goals that generalizes

effectively across environments, then we can plan subgoals in this representation space to solve

long-horizon tasks, and also leverage it to help ﬁnetune the goal-conditioned policies on the new

task online. Core to this planning procedure is the use of affordance models [4,5], which predict

potentially reachable states that can serve as subgoal proposals for the planning process. Good state

representations are necessary for this process: (1) as inputs and outputs for the affordance model,

which must generalize effectively across tasks and domains (since if the affordance model doesn’t

generalize, it can’t provide guidance for policy ﬁne-tuning); (2) as inputs into the policy, so as to

facilitate rapid adaptation; (3) as measures of state proximity for use as reward functions for ﬁne-

tuning the policy, since well-shaped rewards are essential for rapid online training but notoriously

difﬁcult to obtain without manual engineering.

To this end, we propose Fine-Tuning with Lossy Affordance Planner (FLAP), a framework that

leverages diverse ofﬂine data for learning representations, goal-conditioned policies, and affordance

models that enable rapid ﬁne-tuning to new tasks in target scenes. As shown in Fig. 1, our lossy

representations are acquired during the ofﬂine pre-training process for the goal-conditioned policy

using a variational information bottleneck (VIB) [6]. Intuitively, this representation learns to capture

the minimal information necessary to determine if and how a particular goal can be reached from a

particular state, making it ideally suited for planning over subgoals and providing reward shaping

during ﬁne-tuning. When a new task is speciﬁed to the robot with a goal image, we use an affordance

model that predicts reachable states in this learned lossy representation space to plan subgoals for

prospectively accomplishing this goal. During the ﬁne-tuning process, the goal-conditioned policy is

ﬁne-tuned on each subgoal given the informative reward signal computed from the learned represen-

tations. Both the ofﬂine and online stage in this process operates entirely from images and does not

require any hand-designed reward functions beyond those extracted automatically from the learned

representation. Building the components based on the prior work [5,7], we demonstrate that the par-

ticular combination of lossy representation learning, goal-conditioned policies, and planning-driven

ﬁne-tuning can enable performance that signiﬁcantly exceeds that of prior methods. We evaluate our

method in both real-world and simulated environments using previously collected ofﬂine data [8,7]

for learning novel robotic manipulation tasks. Compared to baselines, the proposed method achieves

higher success rates with fewer online ﬁne-tuning iterations.

2 Preliminaries

Goal-conditioned reinforcement learning. We consider a goal-conditioned reinforcement learning

(GCRL) setting with states and goals as images. The goal-reaching tasks can be represented by a

Markov Decision Process (MDP) denoted by a tuple M= (S,A, ρ, P, G, γ)with state space S,

action space A, initial state probability ρ, transition probability P, a goal space G, and discount

factor γ. Each goal-reaching task can be deﬁned by a pair of initial state s0∈ S and desired goal

sg∈ G. We assume states and goals are deﬁned in the same space, i.e. G=S. By selecting the

action at∈ A at each time step t, the goal-conditioned policy π(at|st, sg)aims to reach a state s

such that d(s−sg)≤, where dis a distance metric and is the threshold for reaching the goal.

We use the sparse reward function rt(st+1, sg)which outputs 0when the goal is reached and −1

otherwise and the objective is deﬁned as the average expected discounted return E[Σtγtrt].

( | , ) ( · ) ẑ*π

ẑ

( )

aﬀordance policy

candidate plans

optimal plan

φRL loss

Variational Information Bottleneck

(a) Pre-Training on Oﬄine Data (b) Fine-Tuning with Lossy Aﬀordance Plans

Figure 2: FLAP Architecture. (a) We pre-train the representation with RL on the ofﬂine data. We ﬁrst

encode the initial state and the goal and then compute the RL loss with policy π, the value function V, and the

Q-function Q. (b) We use planned subgoals to guide the policy during online ﬁne-tuning (right). Given the

encodings of the initial state and the ﬁnal goal, subgoal sequences are recursively generated by the affordance

model in the learned representation space with latent codes sampled from the prior p(u). The optimal subgoal

sequence ˆz∗

1:Kis then selected according to Eq. 6for guiding the goal-conditioned policy.

Planning with affordance models. When learning to solve long-horizon tasks, subgoals can sig-

niﬁcantly improve performance by breaking down the original problem into easier short-horizon

tasks. We can use sampling-based planning to ﬁnd a suitable sequence of Ksubgoals ˆs1:K, which

samples multiple candidate sequences and chooses the optimal plan based on a cost function. In

high-dimensional state spaces, such a planning process can be computationally intractable, since

most sampled candidates will be unlikely to form a reasonable plan. To facilitate planning, we

would need to focus on sampling subgoal candidates that are realistic and reachable from the cur-

rent state. Following the prior work [5,7], we use affordance models to capture the distribution

of reachable future states and recursively propose subgoals conditioned on the initial state of each

task. The affordance model can be deﬁned as a generative model m(s0|s, u), where uis the latent

representation that captures the information of the transition from the current state sto the goal s0.

It can be trained in a conditional variational autoencoder (CVAE) [9] paradigm using the encoder

q(u|s, s0)to estimate the evidence lower bound (ELBO) [10].

Ofﬂine pre-training and online ﬁne-tuning. Ofﬂine reinforcement learning pre-trains the

policy and value function on a dataset Dofﬂine, consisting of previously collected experiences

(si

t, ai

t, ri

t, si

t+1), where iand tare the index of the trajectory and time step. In this work, we assume

Dofﬂine does not include data of the target task. To solve the target task, the pre-trained policy is

then ﬁne-tuned by exploring the environment and collecting online data Donline. Following Fang

et al. [7], we ﬁrst pre-train both the policy πand the affordance model mon Dofﬂine, and then use

the subgoals planned using the affordance model to guide the online exploration of π. In the online

phase, we freeze the weight of mand ﬁne-tune πon a combination of Dofﬂine and Donline. While

our contribution is orthogonal to the speciﬁc choice of the ofﬂine RL algorithm, we use implicit

Q-learning [11] which is well-suited to online ﬁne-tuning using an expectile loss to construct value

estimates and advantage weighted regression to extract a policy.

3 Pre-Training and Fine-Tuning with a Lossy Affordance Planner

The objective of our method is to enable a robot to leverage diverse ofﬂine data to efﬁciently learn

how to achieve new goals, potentially in new scenes. While we do not require the ofﬂine dataset to

contain data of the target tasks, two assumptions are made for enabling effective possible generaliza-

tion. First, we will consider tasks that require the robot to compose previously seen behaviors (e.g.,

pushing, grasping, and opening) in order to perform a temporally extended task that sequences these

behaviors in a new way (e.g., pushing an obstruction out of the way and then opening a drawer).

Second, we will consider tasks that require the robot to perform behaviors that resemble those in the

prior data, but with new objects that it had not seen before, which presents a generalization challenge

for the policies and the affordance models.

Our method is based on ofﬂine goal-conditioned reinforcement learning and subgoal planning with

affordance models. To solve temporarily extended tasks with novel objects, we construct the goal-

conditioned policy and the affordance model in a learned representation space of states and goals

that picks up on task-relevant information (e.g., object identity and location), while abstracting away

unnecessary visual distractors (e.g., camera pose, illumination, background). In this section, we ﬁrst

describe a paradigm for jointly pre-training the goal-conditioned policy and the lossy representation

through ofﬂine reinforcement learning. Then, we describe how to construct an affordance model

in this lossy representation space for guiding the goal-conditioned policy with planned subgoals.

Finally, we describe how we utilize the lossy affordance model in a complete system for vision-

based robotic control to set goals and ﬁnetune policies online.

3.1 Goal-Conditioned Reinforcement Learning with Lossy Representations

To capture the task-relevant information of states and goals, we learn a parametric state encoder

φ(z|s)to project both states and goals from the high-dimensional space (e.g., RGB images) Sinto a

learned representation space Z. We use ztand zgto denote the representation of the state stand the

goal g, respectively. In the learned representation space, we construct the goal-conditioned policy

π(a|zt, zg), as well as the value function V(zt, zg)and Q-function Q(zt, zg, a). Both Vand Qare

used for training the policy via the ofﬂine RL algorithm, which in our prototype is IQL [11], and

Vis also used for selecting subgoals in the planner, which will be discussed in Sec. 3.2. When

training φ,π,V, and Qto optimize the goal-reaching objective described in Sec. 2,φwould need

to extract sufﬁcient information for selecting the action to transit from stto gand estimating the

required discounted number of steps.

We jointly pre-train the representation with reinforcement learning on the ofﬂine data as shown in

Fig. 2. The original IQL objective optimizes LRL =Lπ+LQ+LVas described in Kostrikov

et al. [11]. To facilitate generalization of the pre-trained policy and the affordance model, we would

like to abstract away redundant domain-speciﬁc information from the learned representation. For

this purpose, we add a variational information bottleneck (VIB) [6] to the reinforcement learning

objective, which constrains the mutual information I(st;zt)and I(g;zg)between the state space

and the representation space by a constant C. The joint training of φ,π,Vand Qcan be written as

an optimization problem using θto denote the model parameters:

max

θLRL s.t. I(g;zg)≤C, I(st;zt)≤C, for t= 0, ..., T (1)

Following Alemi et al. [6], we convert this into an unconstrained optimization problem by applying

the bottleneck on the Q-function representation and directly selecting the Lagrange multiplier α,

resulting in the full VIB objective:

L=LRL −αDKL(φ(zt|st)kp(z)) −αDKL(φ(zg|g)kp(z)) (2)

where we use the normal distribution as p(z), the prior distribution of z.

By optimizing this objective, we obtain lossy representations that disentangle relevant factors such

as object poses from irrelevant factors such as scene background. We exploit this property of the

learned representations in several ways: learning policies and affordance models that generalize

from prior data, having a prior for optimization, and as a distance metric for rewards during ﬁnetun-

ing. Next, we will describe how we generate subgoals utilizing the bottleneck representation.

3.2 Composing Subgoals using Lossy Affordances

The effectiveness of using subgoals to guide the goal-conditioned policy relies on two conditions.

First, each pair of adjacent subgoals needs to be in the distribution of Dofﬂine so that the pre-trained

goal-conditioned policy will have a sufﬁcient success rate for the transition. Second, the subgoals

should break down the original task into subtasks of reasonable difﬁculties. As shown in Fig. 2, we

devise an affordance model and a cost function to plan subgoals that satisfy these conditions and use

the learned lossy representation to facilitate generalization to novel target tasks.

Instead of sampling goals from the original high-dimensional state space, we propose subgoals ˆz1:K

in the learned lossy representation space. For this purpose, we learn a lossy affordance model to

capture the distribution p(z0|z), where zis the representation of a state sand z0corresponds to the

future state s0that is reachable within ∆tsteps. The lossy affordance model can be constructed as

a parametric model m(z0|z, u), where uis a latent code that represents the transition from zto z0.

Given a sequence of sampled latent codes u1:K, we can recursively sample the k-th subgoal ˆzkin

the sequence ˆz1:Kwith mby taking ˆzk−1and ukas inputs, where we denote ˆz0=z0.

The lossy affordance model is pre-trained on the ofﬂine dataset Dofﬂine. Given each pair (s, s0)sam-

pled from Dofﬂine, the pre-trained encoder φproduces a distribution of lossy representation φ(z0|s0).

Given the sampled representation z, we would like the affordance model to propose representations

that follow the distribution φ(z0|s0). Using pm(z0|z)to denote the marginalized distribution of the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeneralizationwithLossyAffordances:LeveragingBroadOfineDataforLearningVisuomotorTasksKuanFang,PatrickYin,AshvinNair,HomerWalke,GengchenYan,SergeyLevineUniversityofCalifornia,BerkeleyAbstract:Theuseofbroaddatasetshasproventobecrucialforgeneralizationforawiderangeofelds.However,howtoeffectivelymakeu...

展开>> 收起<<

Generalization with Lossy Affordances Leveraging Broad Ofﬂine Data for Learning Visuomotor Tasks Kuan Fang Patrick Yin Ashvin Nair Homer Walke Gengchen Yan Sergey Levine.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generalization with Lossy Affordances Leveraging Broad Ofﬂine Data for Learning Visuomotor Tasks Kuan Fang Patrick Yin Ashvin Nair Homer Walke Gengchen Yan Sergey Levine

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: