Continual Vision-based Reinforcement Learning with Group Symmetries Shiqi Liu1Mengdi Xu1Peide Huang1Xilun Zhang1Yongkang Liu2

2025-05-06 0 0 2.22MB 15 页 10玖币

侵权投诉

Continual Vision-based Reinforcement Learning with

Group Symmetries

Shiqi Liu∗,1Mengdi Xu∗,1Peide Huang1Xilun Zhang1Yongkang Liu2

Kentaro Oguchi2Ding Zhao1

Abstract: Continual reinforcement learning aims to sequentially learn a variety

of tasks, retaining the ability to perform previously encountered tasks while si-

multaneously developing new policies for novel tasks. However, current con-

tinual RL approaches overlook the fact that certain tasks are identical under

basic group operations like rotations or translations, especially with visual in-

puts. They may unnecessarily learn and maintain a new policy for each sim-

ilar task, leading to poor sample efﬁciency and weak generalization capability.

To address this, we introduce a unique Continual Vision-basedReinforcement

Learning method that recognizes Group Symmetries, called COVERS, cultivat-

ing a policy for each group of equivalent tasks rather than an individual task.

COVERS employs a proximal policy gradient-based (PPO-based) algorithm to

train each policy, which contains an equivariant feature extractor and takes inputs

with different modalities, including image observations and robot proprioceptive

states. It also utilizes an unsupervised task clustering mechanism that relies on

1-Wasserstein distance on the extracted invariant features. We evaluate COVERS

on a sequence of table-top manipulation tasks in simulation and on a real robot

platform. Our results show that COVERS accurately assigns tasks to their respec-

tive groups and signiﬁcantly outperforms baselines by generalizing to unseen but

equivariant tasks in seen task groups. Demos are available on our project page:

https://sites.google.com/view/rl-covers/.

Keywords: Continual Learning, Symmetry, Manipulation

1 INTRODUCTION

Quick adaptation to unseen tasks has been a key objective in the ﬁeld of reinforcement learning

(RL) [1,2,3]. RL algorithms are usually trained in simulated environments and then deployed

in the real world. However, pre-trained RL agents are likely to encounter new tasks during their

deployment due to the nonstationarity of the environment. Blindly reusing policies obtained during

training can result in substantial performance drops and even catastrophic failures [4,5].

Continual RL (CRL), also referred to as lifelong RL, addresses this issue by sequentially learning

a series of tasks. It achieves this by generating task-speciﬁc policies for the current task, while si-

multaneously preserving the ability to solve previously encountered tasks [3,6,7,8,9]. Existing

CRL works that rely on the task delineations to handle non-stationary initial states, dynamics or re-

ward functions can greatly boost task performance, particularly when signiﬁcant task changes occur

[7]. However, in realistic task-agnostic settings, these delineations are unknown a prior and have to

∗indicates equal contribution.

1Department of Mechanical Engineering, Carnegie Mellon University.

2R&D, Toyota Motor North America.

arXiv:2210.12301v2 [cs.LG] 14 Jun 2023

be identiﬁed by the agents. In this work, we explore how to deﬁne and detect task delineations to

enhance robots’ learning capabilities in task-agnostic CRL.

Equivariant

Policy

Network

Reflect Task

Configuration

Reflect

Action

Figure 1: This example illustrates how group

symmetry enhances adaptability. The robot is

instructed to close drawers situated in two dis-

tinct locations with top-down images as inputs.

Considering the symmetry of the drawers’ lo-

cations around the robot’s position, the optimal

control policies are equivalent but mirrored.

Our key insight is that robotic control tasks typ-

ically preserve certain desirable structures, such

as group symmetries. Existing CRL approaches

typically delineate task boundaries based on sta-

tistical measures, such as maximum a posteriori

estimates and likelihoods [7,8]. However, these

measures overlook the geometric information in-

herent in task representations, which naturally

emerge in robotic control tasks, as demonstrated

in Figure 1. Consider the drawer-closing exam-

ple: conventional CRL works using image inputs

would treat each mirrored conﬁguration as a new

task and learn the task from scratch. Yet, we, as

humans, understand that the mirrored task conﬁg-

uration can be easily resolved by correspondingly

reﬂecting the actions. Learning the mirrored task

from scratch hampers positive task interference

and limits the agent’s adaptivity. To address this

issue, our goal is to exploit the geometric sim-

ilarity among tasks in the task-agnostic CRL setting to facilitate rapid adaptation to unseen but

geometrically equivalent tasks.

In this work, we propose COVERS, a task-agnostic vision-based CRL algorithm with strong sample

efﬁciency and generalization capability by encoding group symmetries in the state and action spaces.

We deﬁne a task group as the set that contains equivalent tasks under the same group operation, such

as rotations and reﬂections. We state our main contributions as follows:

1. COVERS grows a PPO-based [10] policy with an equivariant feature extractor for each task

group, instead of a single task, to solve unseen tasks in seen groups in a zero-shot manner.

2. COVERS utilizes a novel unsupervised task grouping mechanism, which automatically

detects group boundaries based on 1-Wasserstein distance in the invariant feature space.

3. In non-stationary table-top manipulation environments, COVERS performs better than

baselines in terms of average rewards and success rates. Moreover, we show that (a) the

group symmetric information from the equivariant feature extractor promotes the adaptiv-

ity by maximizing the positive interference within each group, and (b) the task grouping

mechanism recovers the ground truth group indexes, which helps minimize the negative

interference among different groups.

2 Related Work

Task-Agnostic CRL. CRL has been a long-standing problem that aims to train RL agents adaptable

to non-stationary environments with evolving world models [11,12,13,14,15,5,16,17,18,19]. In

task-agnostic CRL where task identiﬁcations are unrevealed, existing methods have addressed the

problem through a range of techniques. These include hierarchical task modeling with stochastic

processes [7,8], meta-learning [3,20], online system identiﬁcation [21], learning a representation

from experience [9,22], and experience replay [14,23]. Considering that in realistic situations, the

new task may not belong to the same task distribution as past tasks, we develop an ensemble model of

policy networks capable of handling diverse unseen tasks, rather than relying on a single network to

model dynamics or latent representations. Moreover, prior work often depends on data distribution-

wise similarity or distances between latent variables, implicitly modeling task relationships. In

contrast, we aim to introduce beneﬁcial inductive bias explicitly by developing policy networks

with equivariant feature extractors to capture the geometric structures of tasks.

Timesteps

Drawer Close

Button Press

Plate Slide Goal Reach

Streaming

Groups

Figure 2: The continual learning environment setup involves four task groups, including Plate Slide,

Button Press, Drawer Close, and Goal Reach. Groups streamingly come in.

Symmetries in RL. There has been a surge of interest in modeling symmetries in components of

Markov Decision Processes (MDPs) to improve generalization and efﬁciency [24,25,26,27,28,29,

30,31,32,33,34,35]. MDP homomorphic network [26] preserves equivariant under symmetries in

the state-action spaces of an MDP by imposing an equivariance constraint on the policy and value

network. As a result, it reduces the RL agent’s solution space and increases sample efﬁciency. This

single-agent MDP homomorphic network is then extended to the multi-agent domain by factorizing

global symmetries into local symmetries [27]. SO(2)-Equivariant RL [28] extends the discrete sym-

metry group to the group of continuous planar rotations, SO(2), to boost the performance in robotic

manipulation tasks. In contrast, we seek to exploit the symmetric properties to improve the general-

ization capability of task-agnostic CRL algorithms and handle inputs with multiple modalities.

3 Preliminary

Markov decision process. We consider a Markov decision process (MDP) as a 5-tuple

(S,A, T, R, γ), where Sand Aare the state and action space, respectively. T:S × A → ∆(S)

is the transition function, R:S × A → Ris the reward function, and γis the discount factor. We

aim to ﬁnd an optimal policy πθ:S → A parameterized by θthat maximizes the expected return

Eτ∼πθhPH−1

t=0 γtr(st, at)i, where His the episode length.

Invariance and equivariance. Let Gbe a mathematical group. f:X → Y is a mapping function.

For a transformation Lg:X → X that satisﬁes f(x) = f(Lg[x]),∀g∈G, x ∈ X , we say fis

invariant to Lg. Equivariance is closely related to invariance. If we can ﬁnd another transformation

Kg:Y → Y that fulﬁlls Kg[f(x)] = f(Lg[x]),∀g∈G, x ∈ X then we say fis equivariant to

transformation Lg. It’s worth noting that invariance is a special case of equivariance.

MDP with group symmetries. In MDPs with symmetries [24,25,26], we can identify at least one

mathematical group Gof a transformation Lg:S → S and a state-dependent action transformation

g:A→A, such that R(s, a) = RLg[s], Ks

g[a], T (s, a, s′) = TLg[s], Ks

g[a], Lg[s′]hold

for all g∈G, s, s′∈ S, a ∈ A.

Equivariant convolutional layer. Let Gbe a Euclidean group, with the special orthogonal group

and reﬂection group as subgroups. We use the equivariant convolutional layer developed by Weiler

and Cesa [36], where each layer consists of G-steerable kernels k:R2→Rcout ×cin that satisﬁes

k(gx) = ρout (g)k(x)ρin g−1,∀g∈G, x ∈R2.ρin and ρout are the types of input vector ﬁeld

fin :R2→Rcin and output vector ﬁeld fout :R2→Rcout , respectively.

Equivariant MLP. An equivariant multi-layer perceptron (MLP) consists of both equivariant linear

layers and equivariant nonlinearities. An equivariant linear layer is a linear function Wthat maps

from one vector space Vin with type ρin to another vector space with type ρout for a given group G.

Formally ∀x∈Vin,∀g∈G:ρout(g)W x =W ρin(g)x. Here we use the numerical method proposed

by Finzi et al. [37] to parameterize MLPs that are equivariant to arbitrary groups.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContinualVision-basedReinforcementLearningwithGroupSymmetriesShiqiLiu∗,1MengdiXu∗,1PeideHuang1XilunZhang1YongkangLiu2KentaroOguchi2DingZhao1Abstract:Continualreinforcementlearningaimstosequentiallylearnavarietyoftasks,retainingtheabilitytoperformpreviouslyencounteredtaskswhilesi-multaneouslydevelopi...

展开>> 收起<<

Continual Vision-based Reinforcement Learning with Group Symmetries Shiqi Liu1Mengdi Xu1Peide Huang1Xilun Zhang1Yongkang Liu2.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Continual Vision-based Reinforcement Learning with Group Symmetries Shiqi Liu1Mengdi Xu1Peide Huang1Xilun Zhang1Yongkang Liu2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: