skills more quickly, the answer for robots may be to effectively leverage prior data. But in the setting
of robotic learning, this raises a number of complex questions: What sort of knowledge should we
derive from this prior data? How do we break up diverse and uncurated prior datasets into distinct
skills? And how do we then use and adapt these skills for solving novel tasks?
To learn useful behaviors from data of a wide range of tasks, we can employ goal-conditioned
reinforcement learning (GCRL) [1,2], where a policy is trained to attain a specified goal (e.g.,
indicated as an image) from the current state. This makes it practical to learn from previously
collected large datasets without explicit reward annotations and to share knowledge across tasks
that exhibit similar physical phenomena. However, it is usually difficult to learn goal-conditioned
policies that can solve temporally extended tasks in zero shot, as such policies are typically only
effective for short-horizon goals [3]. To solve the target tasks, we would need to transfer the learned
knowledge to the testing environment and efficiently fine-tune the policy in an online manner.
Our proposed solution combines representation learning, planning, and online fine-tuning. The
key intuition is that, if we can learn a suitable representation of states and goals that generalizes
effectively across environments, then we can plan subgoals in this representation space to solve
long-horizon tasks, and also leverage it to help finetune the goal-conditioned policies on the new
task online. Core to this planning procedure is the use of affordance models [4,5], which predict
potentially reachable states that can serve as subgoal proposals for the planning process. Good state
representations are necessary for this process: (1) as inputs and outputs for the affordance model,
which must generalize effectively across tasks and domains (since if the affordance model doesn’t
generalize, it can’t provide guidance for policy fine-tuning); (2) as inputs into the policy, so as to
facilitate rapid adaptation; (3) as measures of state proximity for use as reward functions for fine-
tuning the policy, since well-shaped rewards are essential for rapid online training but notoriously
difficult to obtain without manual engineering.
To this end, we propose Fine-Tuning with Lossy Affordance Planner (FLAP), a framework that
leverages diverse offline data for learning representations, goal-conditioned policies, and affordance
models that enable rapid fine-tuning to new tasks in target scenes. As shown in Fig. 1, our lossy
representations are acquired during the offline pre-training process for the goal-conditioned policy
using a variational information bottleneck (VIB) [6]. Intuitively, this representation learns to capture
the minimal information necessary to determine if and how a particular goal can be reached from a
particular state, making it ideally suited for planning over subgoals and providing reward shaping
during fine-tuning. When a new task is specified to the robot with a goal image, we use an affordance
model that predicts reachable states in this learned lossy representation space to plan subgoals for
prospectively accomplishing this goal. During the fine-tuning process, the goal-conditioned policy is
fine-tuned on each subgoal given the informative reward signal computed from the learned represen-
tations. Both the offline and online stage in this process operates entirely from images and does not
require any hand-designed reward functions beyond those extracted automatically from the learned
representation. Building the components based on the prior work [5,7], we demonstrate that the par-
ticular combination of lossy representation learning, goal-conditioned policies, and planning-driven
fine-tuning can enable performance that significantly exceeds that of prior methods. We evaluate our
method in both real-world and simulated environments using previously collected offline data [8,7]
for learning novel robotic manipulation tasks. Compared to baselines, the proposed method achieves
higher success rates with fewer online fine-tuning iterations.
2 Preliminaries
Goal-conditioned reinforcement learning. We consider a goal-conditioned reinforcement learning
(GCRL) setting with states and goals as images. The goal-reaching tasks can be represented by a
Markov Decision Process (MDP) denoted by a tuple M= (S,A, ρ, P, G, γ)with state space S,
action space A, initial state probability ρ, transition probability P, a goal space G, and discount
factor γ. Each goal-reaching task can be defined by a pair of initial state s0∈ S and desired goal
sg∈ G. We assume states and goals are defined in the same space, i.e. G=S. By selecting the
action at∈ A at each time step t, the goal-conditioned policy π(at|st, sg)aims to reach a state s
such that d(s−sg)≤, where dis a distance metric and is the threshold for reaching the goal.
We use the sparse reward function rt(st+1, sg)which outputs 0when the goal is reached and −1
otherwise and the objective is defined as the average expected discounted return E[Σtγtrt].
2