Continual Vision-based Reinforcement Learning with Group Symmetries Shiqi Liu1Mengdi Xu1Peide Huang1Xilun Zhang1Yongkang Liu2

2025-05-06 0 0 2.22MB 15 页 10玖币
侵权投诉
Continual Vision-based Reinforcement Learning with
Group Symmetries
Shiqi Liu,1Mengdi Xu,1Peide Huang1Xilun Zhang1Yongkang Liu2
Kentaro Oguchi2Ding Zhao1
Abstract: Continual reinforcement learning aims to sequentially learn a variety
of tasks, retaining the ability to perform previously encountered tasks while si-
multaneously developing new policies for novel tasks. However, current con-
tinual RL approaches overlook the fact that certain tasks are identical under
basic group operations like rotations or translations, especially with visual in-
puts. They may unnecessarily learn and maintain a new policy for each sim-
ilar task, leading to poor sample efficiency and weak generalization capability.
To address this, we introduce a unique Continual Vision-basedReinforcement
Learning method that recognizes Group Symmetries, called COVERS, cultivat-
ing a policy for each group of equivalent tasks rather than an individual task.
COVERS employs a proximal policy gradient-based (PPO-based) algorithm to
train each policy, which contains an equivariant feature extractor and takes inputs
with different modalities, including image observations and robot proprioceptive
states. It also utilizes an unsupervised task clustering mechanism that relies on
1-Wasserstein distance on the extracted invariant features. We evaluate COVERS
on a sequence of table-top manipulation tasks in simulation and on a real robot
platform. Our results show that COVERS accurately assigns tasks to their respec-
tive groups and significantly outperforms baselines by generalizing to unseen but
equivariant tasks in seen task groups. Demos are available on our project page:
https://sites.google.com/view/rl-covers/.
Keywords: Continual Learning, Symmetry, Manipulation
1 INTRODUCTION
Quick adaptation to unseen tasks has been a key objective in the field of reinforcement learning
(RL) [1,2,3]. RL algorithms are usually trained in simulated environments and then deployed
in the real world. However, pre-trained RL agents are likely to encounter new tasks during their
deployment due to the nonstationarity of the environment. Blindly reusing policies obtained during
training can result in substantial performance drops and even catastrophic failures [4,5].
Continual RL (CRL), also referred to as lifelong RL, addresses this issue by sequentially learning
a series of tasks. It achieves this by generating task-specific policies for the current task, while si-
multaneously preserving the ability to solve previously encountered tasks [3,6,7,8,9]. Existing
CRL works that rely on the task delineations to handle non-stationary initial states, dynamics or re-
ward functions can greatly boost task performance, particularly when significant task changes occur
[7]. However, in realistic task-agnostic settings, these delineations are unknown a prior and have to
indicates equal contribution.
1Department of Mechanical Engineering, Carnegie Mellon University.
2R&D, Toyota Motor North America.
arXiv:2210.12301v2 [cs.LG] 14 Jun 2023
be identified by the agents. In this work, we explore how to define and detect task delineations to
enhance robots’ learning capabilities in task-agnostic CRL.
Equivariant
Policy
Network
Reflect Task
Configuration
Reflect
Action
Figure 1: This example illustrates how group
symmetry enhances adaptability. The robot is
instructed to close drawers situated in two dis-
tinct locations with top-down images as inputs.
Considering the symmetry of the drawers’ lo-
cations around the robot’s position, the optimal
control policies are equivalent but mirrored.
Our key insight is that robotic control tasks typ-
ically preserve certain desirable structures, such
as group symmetries. Existing CRL approaches
typically delineate task boundaries based on sta-
tistical measures, such as maximum a posteriori
estimates and likelihoods [7,8]. However, these
measures overlook the geometric information in-
herent in task representations, which naturally
emerge in robotic control tasks, as demonstrated
in Figure 1. Consider the drawer-closing exam-
ple: conventional CRL works using image inputs
would treat each mirrored configuration as a new
task and learn the task from scratch. Yet, we, as
humans, understand that the mirrored task config-
uration can be easily resolved by correspondingly
reflecting the actions. Learning the mirrored task
from scratch hampers positive task interference
and limits the agent’s adaptivity. To address this
issue, our goal is to exploit the geometric sim-
ilarity among tasks in the task-agnostic CRL setting to facilitate rapid adaptation to unseen but
geometrically equivalent tasks.
In this work, we propose COVERS, a task-agnostic vision-based CRL algorithm with strong sample
efficiency and generalization capability by encoding group symmetries in the state and action spaces.
We define a task group as the set that contains equivalent tasks under the same group operation, such
as rotations and reflections. We state our main contributions as follows:
1. COVERS grows a PPO-based [10] policy with an equivariant feature extractor for each task
group, instead of a single task, to solve unseen tasks in seen groups in a zero-shot manner.
2. COVERS utilizes a novel unsupervised task grouping mechanism, which automatically
detects group boundaries based on 1-Wasserstein distance in the invariant feature space.
3. In non-stationary table-top manipulation environments, COVERS performs better than
baselines in terms of average rewards and success rates. Moreover, we show that (a) the
group symmetric information from the equivariant feature extractor promotes the adaptiv-
ity by maximizing the positive interference within each group, and (b) the task grouping
mechanism recovers the ground truth group indexes, which helps minimize the negative
interference among different groups.
2 Related Work
Task-Agnostic CRL. CRL has been a long-standing problem that aims to train RL agents adaptable
to non-stationary environments with evolving world models [11,12,13,14,15,5,16,17,18,19]. In
task-agnostic CRL where task identifications are unrevealed, existing methods have addressed the
problem through a range of techniques. These include hierarchical task modeling with stochastic
processes [7,8], meta-learning [3,20], online system identification [21], learning a representation
from experience [9,22], and experience replay [14,23]. Considering that in realistic situations, the
new task may not belong to the same task distribution as past tasks, we develop an ensemble model of
policy networks capable of handling diverse unseen tasks, rather than relying on a single network to
model dynamics or latent representations. Moreover, prior work often depends on data distribution-
wise similarity or distances between latent variables, implicitly modeling task relationships. In
contrast, we aim to introduce beneficial inductive bias explicitly by developing policy networks
with equivariant feature extractors to capture the geometric structures of tasks.
2
Timesteps
Drawer Close
Button Press
Plate Slide Goal Reach
Streaming
Groups
Figure 2: The continual learning environment setup involves four task groups, including Plate Slide,
Button Press, Drawer Close, and Goal Reach. Groups streamingly come in.
Symmetries in RL. There has been a surge of interest in modeling symmetries in components of
Markov Decision Processes (MDPs) to improve generalization and efficiency [24,25,26,27,28,29,
30,31,32,33,34,35]. MDP homomorphic network [26] preserves equivariant under symmetries in
the state-action spaces of an MDP by imposing an equivariance constraint on the policy and value
network. As a result, it reduces the RL agent’s solution space and increases sample efficiency. This
single-agent MDP homomorphic network is then extended to the multi-agent domain by factorizing
global symmetries into local symmetries [27]. SO(2)-Equivariant RL [28] extends the discrete sym-
metry group to the group of continuous planar rotations, SO(2), to boost the performance in robotic
manipulation tasks. In contrast, we seek to exploit the symmetric properties to improve the general-
ization capability of task-agnostic CRL algorithms and handle inputs with multiple modalities.
3 Preliminary
Markov decision process. We consider a Markov decision process (MDP) as a 5-tuple
(S,A, T, R, γ), where Sand Aare the state and action space, respectively. T:S × A ∆(S)
is the transition function, R:S × A Ris the reward function, and γis the discount factor. We
aim to find an optimal policy πθ:S A parameterized by θthat maximizes the expected return
EτπθhPH1
t=0 γtr(st, at)i, where His the episode length.
Invariance and equivariance. Let Gbe a mathematical group. f:X → Y is a mapping function.
For a transformation Lg:X → X that satisfies f(x) = f(Lg[x]),gG, x X , we say fis
invariant to Lg. Equivariance is closely related to invariance. If we can find another transformation
Kg:Y → Y that fulfills Kg[f(x)] = f(Lg[x]),gG, x ∈ X then we say fis equivariant to
transformation Lg. It’s worth noting that invariance is a special case of equivariance.
MDP with group symmetries. In MDPs with symmetries [24,25,26], we can identify at least one
mathematical group Gof a transformation Lg:S → S and a state-dependent action transformation
Ks
g:A→A, such that R(s, a) = RLg[s], Ks
g[a], T (s, a, s) = TLg[s], Ks
g[a], Lg[s]hold
for all gG, s, s∈ S, a ∈ A.
Equivariant convolutional layer. Let Gbe a Euclidean group, with the special orthogonal group
and reflection group as subgroups. We use the equivariant convolutional layer developed by Weiler
and Cesa [36], where each layer consists of G-steerable kernels k:R2Rcout ×cin that satisfies
k(gx) = ρout (g)k(x)ρin g1,gG, x R2.ρin and ρout are the types of input vector field
fin :R2Rcin and output vector field fout :R2Rcout , respectively.
Equivariant MLP. An equivariant multi-layer perceptron (MLP) consists of both equivariant linear
layers and equivariant nonlinearities. An equivariant linear layer is a linear function Wthat maps
from one vector space Vin with type ρin to another vector space with type ρout for a given group G.
Formally xVin,gG:ρout(g)W x =W ρin(g)x. Here we use the numerical method proposed
by Finzi et al. [37] to parameterize MLPs that are equivariant to arbitrary groups.
3
摘要:

ContinualVision-basedReinforcementLearningwithGroupSymmetriesShiqiLiu∗,1MengdiXu∗,1PeideHuang1XilunZhang1YongkangLiu2KentaroOguchi2DingZhao1Abstract:Continualreinforcementlearningaimstosequentiallylearnavarietyoftasks,retainingtheabilitytoperformpreviouslyencounteredtaskswhilesi-multaneouslydevelopi...

展开>> 收起<<
Continual Vision-based Reinforcement Learning with Group Symmetries Shiqi Liu1Mengdi Xu1Peide Huang1Xilun Zhang1Yongkang Liu2.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:2.22MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注