Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

2025-05-06 0 0 2.21MB 25 页 10玖币
侵权投诉
Decomposed Mutual Information Optimization for
Generalized Context in Meta-Reinforcement
Learning
Yao Mu
The University of Hong Kong
muyao@connect.hku.hk
Yuzheng Zhuang
Huawei Noah’s Ark Lab
zhuangyuzheng@huawei.com
Fei Ni
Tianjin University
fei_ni@tju.edu.cn
Bin Wang
Huawei Noah’s Ark Lab
wangbin158@huawei.com
Jianyu Chen
Tsinghua University
jianyuchen@tsinghua.edu.cn
Jianye Hao
Huawei Noah’s Ark Lab
haojianye@huawei.com
Ping Luo
The University of Hong Kong
pluo@cs.hku.hk
Abstract
Adapting to the changes in transition dynamics is essential in robotic applications.
By learning a conditional policy with a compact context, context-aware meta-
reinforcement learning provides a flexible way to adjust behavior according to
dynamics changes. However, in real-world applications, the agent may encounter
complex dynamics changes. Multiple confounders can influence the transition
dynamics, making it challenging to infer accurate context for decision-making.
This paper addresses such a challenge by
D
ec
O
mposed
M
utual
IN
formation
O
ptimization (DOMINO) for context learning, which explicitly learns a disentan-
gled context to maximize the mutual information between the context and historical
trajectories, while minimizing the state transition prediction error. Our theoretical
analysis shows that DOMINO can overcome the underestimation of the mutual
information caused by multi-confounded challenges via learning disentangled
context and reduce the demand for the number of samples collected in various
environments. Extensive experiments show that the context learned by DOMINO
benefits both model-based and model-free reinforcement learning algorithms for
dynamics generalization in terms of sample efficiency and performance in unseen
environments. Open-sourced code is released on our homepage.
1 Introduction
Dynamics generalization in deep reinforcement learning (RL) investigates the problem of training a
RL agent in a few kinds of environments and adapting across unseen system dynamics or structures,
such as different physical parameters or robot mythologies. Meta-Reinforcement Learning (Meta-
RL) has been proposed to tackle the problem by training on a range of tasks, and fast adapting
to a new task with the learned prior knowledge. However, training in meta-RL requires orders of
magnitudes more samples than single-task RL since the agent not only has to learn to infer the change
Ping Luo is the corresponding author. Yao Mu and Fei Ni conducted this work during the internship in
Huawei Noah’s Ark Lab.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04209v1 [cs.LG] 9 Oct 2022
LengthDynamicsMass
Length
Mass
Mass
Cripple
MassDynamicsCrippled Leg
Mass
Cripple
MassDynamicsCrippled Leg
Cartpole Cheetah Ant
(a) Example of the multi-confounded environments (b) Performance comparison
Figure 1: Generalization with complex dynamics changes. The transition dynamics of the robot
may simultaneously influenced by multiple confounders, such as mass ( ), length of leg ( ), or a
crippled leg ( ). In real-world situations, all the possible confounders may change simultaneously,
which bring challenges to robotic dynamics generalization. DOMINO addresses such problem by
decomposed MI optimization and achieves the state-of-the-art performance.
of environment but also has to learn the corresponding policies. Context-aware meta-RL methods
take a step further and show promising potential to capture local dynamics explicitly by learning an
additional context vector from historical trajectories [
1
4
]. The historical trajectories are sampled
from the joint distribution of multiple confounders, which are the key factors that cause the dynamics
changes. Accordingly, if multiple confounders affect the dynamics simultaneously, the state transition
distribution will become highly multi-modal, leading to challenges in extracting accurate context.
Recent advanced context-aware meta-RL methods [
5
8
] further improve meta-RL via contrastive
learning, which optimizes the InfoNCE bound [
9
] of the mutual information in essence. These
methods show a promising improvement in entangled context learning, which performs well in
single confounded environments. However, as demonstrated in Figure 1a, in real-world situations for
robotic applications with partially unspecified dynamics, the transition dynamics can be influenced
by multiple confounders simultaneously, such as mass changes, damping, friction, or malfunctional
modules like a crippled leg. For example, when a transportation robot is working in the wild, the load
will dynamically change as the task progresses, while the humidity and roughness of the road also
may vary. Moreover, some works also construct a confounder set for unsupervised RL environment
generalization[
10
15
]. RIA [
16
] also constructs confounder sets with multiple confounders for
unsupervised dynamics generalization. Such changeable environments bring great challenges to the
robot for capturing contextual information, which motivates our study.
Contribution.
In this paper, we give a theoretical analysis which demonstrates that when the
number of confounders increases, InfoNCE will be a loose bound of mutual information (MI) with
the samples in limited seen environments, which is called MI underestimation [
17
]. To tackle this
problem, we propose a
D
ec
O
mposed
M
utual
IN
formation
O
ptimization (DOMINO) framework for
context learning in meta-RL. The context encoder aims to embed the past state-action pairs into
disentangled context vectors and is optimized by maximizing the mutual information between the
disentangled context vectors and historical trajectories while minimizing the state transition prediction
error. DOMINO decomposes the full MI optimization problem into a summation of
N
smaller MI
optimization problems by learning disentangled context. We then theoretically prove that DOMINO
could alleviate the underestimation bias of the InfoNCE and reduce the demand for the samples
collected in various environments [
18
,
19
]. Last, with the learned disentangled context, we further
develop the context-aware model-based and model-free algorithms to learn the context-conditioned
policy and illustrate that DOMINO can consistently improve generalization performance in both
ways to overcome the challenge of multi-confounded dynamics.
Extensive experiments demonstrate that DOMINO benefits meta-RL on both the generalization
performance in unseen environments and sample efficiency during the training process under the
challenging multi-confounded setting. For example, as show in Figure 1b, it achieves 1.5 times per-
formance improvement to T-MCL [
3
] in the Cheetah domain and 2.6 times performance improvement
to T-MCL in the Crippled-Ant domain. Visualization of the learned context demonstrates that the
disentangled context generated by DOMINO under different environments could be more clearly
distinguished in the embedding space, which indicates its advantage to extract high-quality contextual
information from the environment.
2
2 Related Work
2.1 Meta-Reinforcement Learning
Meta-RL extends the framework of meta-learning [
20
,
21
] to reinforcement learning, aiming to
learn an adaptive policy being able to generalize to unseen tasks. Specifically, meta-RL methods
learn the policy based on the prior knowledge discovered from various training environments and
reuse the policy to fast adapt to unseen testing environments after zero or few shots. Gradient-
based meta-RL algorithms [
22
25
] learn a model initialization and adapt the parameters with few
policy gradient updates in new dynamics. Context-based meta-RL algorithms [
1
4
] learn contextual
information to capture local dynamics explicitly and show great potential to tackle generalization tasks
in complicated environments. Many model-free context-based methods are proposed to learn a policy
conditioned on the latent context that can adapt with off-policy data by leveraging context information
and is trained by maximizing the expected return. PEARL [
1
] adapts to a new environment by
inferring latent context variables from a small number of trajectories. Recent advanced methods
further improve the quality of contextual representation leveraging contrastive learning [
5
8
]. Unlike
the model-free methods mentioned above, context-aware world models are proposed to learn the
dynamics with confounders directly. CaDM [
26
] learns a global model that generalizes across tasks by
training a latent context to capture the local dynamics. T-MCL [
4
] combines multiple-choice learning
with context-aware world model and achieves state-of-the-art results on the dynamics generalization
tasks. RIA [
16
] further expands this method into unsupervised setting without environment label by
intervention, and enhances the context learning via MI optimization.
However, existing context-based approaches focus on learning entangled context, in which each
trajectory is encoded into only one context vector. In a multi-confounding environment, learning
entangled contexts requires orders of magnitude higher samples to capture accurate dynamics infor-
mation. To tackle this challenge, different from RIA [
16
] and T-MCL [
4
] , DOMINO infers several
disentangled context vectors from a single trajectory and divides the whole MI optimization into
the summation of smaller ones. The proposed decomposed MI optimization reduces the amount
of demand for diverse samples and thus improves the generalization of the policy to overcome the
adaptation problem in multi-confounded unseen environments.
2.2 Mutual Information Optimization for Representation Learning
Representation learning based on mutual information (MI) maximization has been applied in various
tasks such as computer vision [
27
,
28
], natural language processing [
29
,
19
], and RL [
30
], exploiting
noise-contrastive estimation (NCE) [
31
], InfoNCE [
9
] and variational objectives [
32
]. InfoNCE
has gained recent interest with respect to variational approaches due to its lower variance [
33
] and
superior performance in downstream tasks. However, InfoNCE may underestimate the true MI,
given that it is limited by the number of samples. To tackle this problem, DEMI [
17
] first scaffolds
the total MI estimation into a sequence of smaller estimation problems. In this paper, since the
confounders in the real world are commonly independent, we simplify the complexity of mutual
information decomposition and eliminate the need to learn conditional mutual information as a
sub-term, assuming that multiple confounders are independent of each other.
3 Preliminaries
We consider standard RL framework where an agent optimizes a specified reward function through
interacting with an environment. Formally, we formulate our problem as a Markov decision process
(MDP) [
34
], which is defined as a tuple
(S,A, p, r, γ, ρ0)
. Here,
S
is the state space,
A
is the action
space,
p(s0|s, a)
is the transition dynamics,
r(s, a)
is the reward function,
ρ0
is the initial state
distribution, and
γ[0,1)
is the discount factor. In order to address the problem of generalization,
we further consider the distribution of MDPs, where the transition dynamics
p˜u(s0|s, a)
varies
according to multiple confounders
˜u={u0, u1, . . . , uN}
. The confounders can be continuous
random variables, like the mass, damping, random disturbance force, or discrete random variables,
such as one of the robot’s leg is crippled. We assume that the true transition dynamics model is
unknown, but the state transition data can be sampled by taking actions in the environment. Given a
set of training setting sampled from
p(˜utrain)
, the meta-training process learns a policy
π(s, c)
that
adapts to the task at hand by conditioning on the embedding of the history of past transitions, which
3
𝑝
Embedding Distance
Decomposed MI Optimization Minimize Transition Prediction Error
##(𝑠!, 𝑎!)
Prediction
Net
Disentangled Context
Past (𝑠, 𝑎) pairs
Context Encoder
𝑐"𝑐#𝑐$𝑐%
Loss
Negative sampled trajectoryPositive sampled trajectory
max
𝑠̃!&"
𝑠!&"
Length
Mass Damping Crippled Leg
𝑐"𝑐#𝑐$𝑐%
Figure 2: The overall framework of DOMINO ( ). The context encoder embeds the past state-action
pairs ( ) into disentangled context vectors. The disentangled context vectors ( ) are learned via the
decomposed mutual information optimization ( ) while minimizing the state transition prediction
error ( ).
Minimize transition prediction error
: With the current state-action pair and the learned
context vectors, future state can be predicted by the prediction network. The gradient of the prediction
error will be used to update both the context encoder and the prediction network.
Decomposed
MI Optimization
: we optimize the MI between the learned context and the historical trajectories
under the same confounder setting via maximizing the InfoNCE bound, which aims to minimize the
embedding distance between the positive sampled trajectories ( ) and the context, while maximizing
the embedding distance between the negative sampled trajectories ( ) and the context.
we refer as context
c
. At test-time, the policy should adapt to the new MDP under the test setting
˜utest
drawn from p(˜utest).
Our goal is to learn a policy to maximizing the expected return
Rtrain
condition on the context
c
which
is encoded from the sequences of current state action pairs
{sτ, aτ, sτ+1}t
τ=tH
in several training
scenarios and enable it to perform well and achieve a high expected return
Rtest
in test scenarios
never seen before.
max
π(R#=E˜up(˜u#)"
X
t=0
γtr(st,at)#), atπ(st, c),# = {"train" or "test"}(1)
4 Decomposed Mutual Information Optimization for Context Learning
In this section, we first provide a theoretical analysis to show why the multi-confounded environments
are more challenging. We find that when the number of confounders increases, the InfoNCE will be a
loose bound of MI with the samples in limited seen environments, resulting in the underestimation of
MI. To solve such a problem, we develop the DOMINO framework to learn disentangled context
by decomposed MI optimization. We theoretically illustrate that the decomposed MI optimization
can alleviate the underestimation of MI and reduce the demand for the number of samples. The
disentangled context
c={c0, c1, . . . , cN}
is embedded by the context encoder with parameter
ϕ
from
the past state-action pairs
τ={s
l, a
l}t1
l=tH
in current episode. DOMINO explicitly maximizes the
MI between the context
c
and the historical trajectories
T=τiM
i=1
(
τi=si
l, ai
lT
l=0
) collected
based on the combination of multiple confounders
˜u={u0, u1, . . . , uN}
as same as the current
confounder setting, while minimizing the state transition prediction error conditioned on the learned
context. We solve the MI optimization problem by optimizing the InfoNCE lower bound on MI [
9
],
which can be viewed as a contrastive method for the MI optimization, and decompose the full MI
optimization into smaller ones to alleviate the underestimation of the mutual information and reduce
the demand for the number of samples collected in various environments.
4
4.1 InfoNCE Bound for Mutual Information Optimization
InfoNCE bound
INCE(x;y)
is a lower bound of the mutual information
I(x;y)
, where NCE stands for
Noise-Contrastive Estimation, is a type of contrastive loss function used for self-supervised learning.
InfoNCE is obtained by comparing pairs sampled from the joint distribution
x, y1p(x, y)
(
y1
is
called the positive example) to pairs
x, yi
built using a set of negative examples,
y2:Kp(y2:K) =
QK
k=2 p(yk):
I(x;y)INCE(x;y|ψ, K) = E"log eψ(x,y1)
1
KPK
k=1 eψ(x,yk)#(2)
where
ψ
is a function assigning a similarity score to
x, y
pairs and
K
denotes the number of samples.
Through discriminating naturally the paired positive instances from the randomly paired negative
instances, it is proved to bring universal performance gains in various domains, such as computer
vision and natural language processing.
Lemma 1 INCE(X;Y|K)I(x;y)log K
is a necessary condition for
INCE(X;Y|K)
to be
a tight bound of I(x;y). (see proof in Appendix A)
Some previous context-aware methods learn an entangled context
c
by maximizing the mutual
information between the context
c
embedded from the past state-action pairs in the current episode,
and the historical trajectories
T
collected under the same confounder setting as the current episode.
They solve this problem by maximizing the InfoNCE lower bound on
INCE(c;T)
, which can be
viewed as a contrastive estimation [
9
] of
INCE(c;T)
, and obtain promising improvement in single-
confounded environment. However, according to Lemma 1, the
INCE(c;T)
may be loose if the
true mutual information
I(c;T)
is larger than
log K
, which is called underestimation of the mutual
information. Therefore, to make the InfoNCE bound to be a tight bound of
I(c;T)
, the minimum
number of samples is
eI(c;T)
. In real-world robotic control tasks, the dynamics of the robot is
commonly influenced by multiple confounders
˜u={u0, . . . , ui, . . . , uj, . . . , uN}
simultaneously,
under the assumption that the confounders are independent (such as mass and damping), the mutual
information between the historical trajectories Tand the context ccan be derived as
I(c;T) = Ep(τ,c)log p(T | c)
p(T)=Ep(τ,c)log Rp(T | ˜u)p(˜u|c)d˜u
p(T)
Ep(τ,c)p(˜u|c)log p(T | ˜u)
p(T)=I(˜u;T)uiuj
=
N
X
i=0
I(ui;T)
(3)
As the number of confounders increases, the lower bound of
I(c;T)
will become larger, and the
necessary condition for
INCE(T;c|K)
to be a tight bound of
I(c;T)
will become more difficult
to satisfy. Since
I(c;T)PN
i=0 I(ui;T)
, to let the necessary consition satisfied, the amount of
data
K
must be larger than
ePN
i=0 I(ui;T)
according to Lemma 1. Thus the demand for data increases
significantly. Since the confounders are commonly independent in real-world, can we relax this
condition by learning disentangled context vectors instead of entangled context intuitively?
4.2 Decomposed MI Optimization
If the context vectors
c={c0, c1, . . . , cN}
can be independent, then we can ease this problem by
applying the chain rule on MI to decompose the total MI into a sum of smaller MI terms, i.e.,
INCE(c;T | K) =
N
X
i=0
{INCE(ci;T | K)} ≤ Nlog K(4)
Theorem 1
If the context vectors
{c0, c1, . . . , cN}
can be independent, then the necessary condition
for
INCE(c;T)
to be a tight bound can be relaxed to
I(c;T)Nlog K= log KN
. Thus, the need
of the number of samples can be reduced from KeI(c;T)to Ke1
NI(c;T).
Inspired by Theorem 1, we intuitively learn disentangled context vectors and maximize the mu-
tual information between the historical trajectories
T
and the context vectors
{c0, . . . , cN}
while
minimizing the INCE between the context vectors, i.e., to maximize the LNCE
LNCE(ϕ, w) =
N
X
i=0
INCE(ci;T)
N
X
j=0
N
X
i=0,i6=j
INCE(ci;cj)(5)
5
摘要:

DecomposedMutualInformationOptimizationforGeneralizedContextinMeta-ReinforcementLearningYaoMuTheUniversityofHongKongmuyao@connect.hku.hkYuzhengZhuangHuaweiNoah'sArkLabzhuangyuzheng@huawei.comFeiNiTianjinUniversityfei_ni@tju.edu.cnBinWangHuaweiNoah'sArkLabwangbin158@huawei.comJianyuChenTsinghuaUniver...

展开>> 收起<<
Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:2.21MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注