Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

2025-05-06 0 0 2.21MB 25 页 10玖币

侵权投诉

Decomposed Mutual Information Optimization for

Generalized Context in Meta-Reinforcement

Learning

Yao Mu

The University of Hong Kong

muyao@connect.hku.hk

Yuzheng Zhuang

Huawei Noah’s Ark Lab

zhuangyuzheng@huawei.com

Fei Ni

Tianjin University

fei_ni@tju.edu.cn

Bin Wang

Huawei Noah’s Ark Lab

wangbin158@huawei.com

Jianyu Chen

Tsinghua University

jianyuchen@tsinghua.edu.cn

Jianye Hao

Huawei Noah’s Ark Lab

haojianye@huawei.com

Ping Luo ∗

The University of Hong Kong

pluo@cs.hku.hk

Abstract

Adapting to the changes in transition dynamics is essential in robotic applications.

By learning a conditional policy with a compact context, context-aware meta-

reinforcement learning provides a ﬂexible way to adjust behavior according to

dynamics changes. However, in real-world applications, the agent may encounter

complex dynamics changes. Multiple confounders can inﬂuence the transition

dynamics, making it challenging to infer accurate context for decision-making.

This paper addresses such a challenge by

mposed

utual

formation

ptimization (DOMINO) for context learning, which explicitly learns a disentan-

gled context to maximize the mutual information between the context and historical

trajectories, while minimizing the state transition prediction error. Our theoretical

analysis shows that DOMINO can overcome the underestimation of the mutual

information caused by multi-confounded challenges via learning disentangled

context and reduce the demand for the number of samples collected in various

environments. Extensive experiments show that the context learned by DOMINO

beneﬁts both model-based and model-free reinforcement learning algorithms for

dynamics generalization in terms of sample efﬁciency and performance in unseen

environments. Open-sourced code is released on our homepage.

1 Introduction

Dynamics generalization in deep reinforcement learning (RL) investigates the problem of training a

RL agent in a few kinds of environments and adapting across unseen system dynamics or structures,

such as different physical parameters or robot mythologies. Meta-Reinforcement Learning (Meta-

RL) has been proposed to tackle the problem by training on a range of tasks, and fast adapting

to a new task with the learned prior knowledge. However, training in meta-RL requires orders of

magnitudes more samples than single-task RL since the agent not only has to learn to infer the change

∗

Ping Luo is the corresponding author. Yao Mu and Fei Ni conducted this work during the internship in

Huawei Noah’s Ark Lab.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04209v1 [cs.LG] 9 Oct 2022

Length→Dynamics←Mass

Length

Mass

Cripple

Mass→Dynamics←Crippled Leg

Mass

Cripple

Mass→Dynamics←Crippled Leg

Cartpole Cheetah Ant

(a) Example of the multi-confounded environments (b) Performance comparison

Figure 1: Generalization with complex dynamics changes. The transition dynamics of the robot

may simultaneously inﬂuenced by multiple confounders, such as mass ( ), length of leg ( ), or a

crippled leg ( ). In real-world situations, all the possible confounders may change simultaneously,

which bring challenges to robotic dynamics generalization. DOMINO addresses such problem by

decomposed MI optimization and achieves the state-of-the-art performance.

of environment but also has to learn the corresponding policies. Context-aware meta-RL methods

take a step further and show promising potential to capture local dynamics explicitly by learning an

additional context vector from historical trajectories [

–

]. The historical trajectories are sampled

from the joint distribution of multiple confounders, which are the key factors that cause the dynamics

changes. Accordingly, if multiple confounders affect the dynamics simultaneously, the state transition

distribution will become highly multi-modal, leading to challenges in extracting accurate context.

Recent advanced context-aware meta-RL methods [

–

] further improve meta-RL via contrastive

learning, which optimizes the InfoNCE bound [

] of the mutual information in essence. These

methods show a promising improvement in entangled context learning, which performs well in

single confounded environments. However, as demonstrated in Figure 1a, in real-world situations for

robotic applications with partially unspeciﬁed dynamics, the transition dynamics can be inﬂuenced

by multiple confounders simultaneously, such as mass changes, damping, friction, or malfunctional

modules like a crippled leg. For example, when a transportation robot is working in the wild, the load

will dynamically change as the task progresses, while the humidity and roughness of the road also

may vary. Moreover, some works also construct a confounder set for unsupervised RL environment

generalization[

–

]. RIA [

] also constructs confounder sets with multiple confounders for

unsupervised dynamics generalization. Such changeable environments bring great challenges to the

robot for capturing contextual information, which motivates our study.

Contribution.

In this paper, we give a theoretical analysis which demonstrates that when the

number of confounders increases, InfoNCE will be a loose bound of mutual information (MI) with

the samples in limited seen environments, which is called MI underestimation [

]. To tackle this

problem, we propose a

mposed

utual

formation

ptimization (DOMINO) framework for

context learning in meta-RL. The context encoder aims to embed the past state-action pairs into

disentangled context vectors and is optimized by maximizing the mutual information between the

disentangled context vectors and historical trajectories while minimizing the state transition prediction

error. DOMINO decomposes the full MI optimization problem into a summation of

smaller MI

optimization problems by learning disentangled context. We then theoretically prove that DOMINO

could alleviate the underestimation bias of the InfoNCE and reduce the demand for the samples

collected in various environments [

]. Last, with the learned disentangled context, we further

develop the context-aware model-based and model-free algorithms to learn the context-conditioned

policy and illustrate that DOMINO can consistently improve generalization performance in both

ways to overcome the challenge of multi-confounded dynamics.

Extensive experiments demonstrate that DOMINO beneﬁts meta-RL on both the generalization

performance in unseen environments and sample efﬁciency during the training process under the

challenging multi-confounded setting. For example, as show in Figure 1b, it achieves 1.5 times per-

formance improvement to T-MCL [

] in the Cheetah domain and 2.6 times performance improvement

to T-MCL in the Crippled-Ant domain. Visualization of the learned context demonstrates that the

disentangled context generated by DOMINO under different environments could be more clearly

distinguished in the embedding space, which indicates its advantage to extract high-quality contextual

information from the environment.

2 Related Work

2.1 Meta-Reinforcement Learning

Meta-RL extends the framework of meta-learning [

] to reinforcement learning, aiming to

learn an adaptive policy being able to generalize to unseen tasks. Speciﬁcally, meta-RL methods

learn the policy based on the prior knowledge discovered from various training environments and

reuse the policy to fast adapt to unseen testing environments after zero or few shots. Gradient-

based meta-RL algorithms [

–

] learn a model initialization and adapt the parameters with few

policy gradient updates in new dynamics. Context-based meta-RL algorithms [

–

] learn contextual

information to capture local dynamics explicitly and show great potential to tackle generalization tasks

in complicated environments. Many model-free context-based methods are proposed to learn a policy

conditioned on the latent context that can adapt with off-policy data by leveraging context information

and is trained by maximizing the expected return. PEARL [

] adapts to a new environment by

inferring latent context variables from a small number of trajectories. Recent advanced methods

further improve the quality of contextual representation leveraging contrastive learning [

–

]. Unlike

the model-free methods mentioned above, context-aware world models are proposed to learn the

dynamics with confounders directly. CaDM [

] learns a global model that generalizes across tasks by

training a latent context to capture the local dynamics. T-MCL [

] combines multiple-choice learning

with context-aware world model and achieves state-of-the-art results on the dynamics generalization

tasks. RIA [

] further expands this method into unsupervised setting without environment label by

intervention, and enhances the context learning via MI optimization.

However, existing context-based approaches focus on learning entangled context, in which each

trajectory is encoded into only one context vector. In a multi-confounding environment, learning

entangled contexts requires orders of magnitude higher samples to capture accurate dynamics infor-

mation. To tackle this challenge, different from RIA [

] and T-MCL [

] , DOMINO infers several

disentangled context vectors from a single trajectory and divides the whole MI optimization into

the summation of smaller ones. The proposed decomposed MI optimization reduces the amount

of demand for diverse samples and thus improves the generalization of the policy to overcome the

adaptation problem in multi-confounded unseen environments.

2.2 Mutual Information Optimization for Representation Learning

Representation learning based on mutual information (MI) maximization has been applied in various

tasks such as computer vision [

], natural language processing [

], and RL [

], exploiting

noise-contrastive estimation (NCE) [

], InfoNCE [

] and variational objectives [

]. InfoNCE

has gained recent interest with respect to variational approaches due to its lower variance [

] and

superior performance in downstream tasks. However, InfoNCE may underestimate the true MI,

given that it is limited by the number of samples. To tackle this problem, DEMI [

] ﬁrst scaffolds

the total MI estimation into a sequence of smaller estimation problems. In this paper, since the

confounders in the real world are commonly independent, we simplify the complexity of mutual

information decomposition and eliminate the need to learn conditional mutual information as a

sub-term, assuming that multiple confounders are independent of each other.

3 Preliminaries

We consider standard RL framework where an agent optimizes a speciﬁed reward function through

interacting with an environment. Formally, we formulate our problem as a Markov decision process

(MDP) [

], which is deﬁned as a tuple

(S,A, p, r, γ, ρ0)

. Here,

is the state space,

is the action

space,

p(s0|s, a)

is the transition dynamics,

r(s, a)

is the reward function,

ρ0

is the initial state

distribution, and

γ∈[0,1)

is the discount factor. In order to address the problem of generalization,

we further consider the distribution of MDPs, where the transition dynamics

p˜u(s0|s, a)

varies

according to multiple confounders

˜u={u0, u1, . . . , uN}

. The confounders can be continuous

random variables, like the mass, damping, random disturbance force, or discrete random variables,

such as one of the robot’s leg is crippled. We assume that the true transition dynamics model is

unknown, but the state transition data can be sampled by taking actions in the environment. Given a

set of training setting sampled from

p(˜utrain)

, the meta-training process learns a policy

π(s, c)

that

adapts to the task at hand by conditioning on the embedding of the history of past transitions, which

……

𝑝

Embedding Distance

Decomposed MI Optimization Minimize Transition Prediction Error

##(𝑠!, 𝑎!)

Prediction

Net

Disentangled Context

Past (𝑠, 𝑎) pairs

Context Encoder

⋯

𝑐"𝑐#𝑐$𝑐%

Loss

Negative sampled trajectoryPositive sampled trajectory

max

𝑠̃!&"

𝑠!&"

Length

Mass Damping Crippled Leg

⋯

𝑐"𝑐#𝑐$𝑐%

Figure 2: The overall framework of DOMINO ( ). The context encoder embeds the past state-action

pairs ( ) into disentangled context vectors. The disentangled context vectors ( ) are learned via the

decomposed mutual information optimization ( ) while minimizing the state transition prediction

error ( ).

Minimize transition prediction error

: With the current state-action pair and the learned

context vectors, future state can be predicted by the prediction network. The gradient of the prediction

error will be used to update both the context encoder and the prediction network.

Decomposed

MI Optimization

: we optimize the MI between the learned context and the historical trajectories

under the same confounder setting via maximizing the InfoNCE bound, which aims to minimize the

embedding distance between the positive sampled trajectories ( ) and the context, while maximizing

the embedding distance between the negative sampled trajectories ( ) and the context.

we refer as context

. At test-time, the policy should adapt to the new MDP under the test setting

˜utest

drawn from p(˜utest).

Our goal is to learn a policy to maximizing the expected return

Rtrain

condition on the context

which

is encoded from the sequences of current state action pairs

{sτ, aτ, sτ+1}t

τ=t−H

in several training

scenarios and enable it to perform well and achieve a high expected return

Rtest

in test scenarios

never seen before.

max

π(R#=E˜u∼p(˜u#)"∞

t=0

γtr(st,at)#), at∼π(st, c),# = {"train" or "test"}(1)

4 Decomposed Mutual Information Optimization for Context Learning

In this section, we ﬁrst provide a theoretical analysis to show why the multi-confounded environments

are more challenging. We ﬁnd that when the number of confounders increases, the InfoNCE will be a

loose bound of MI with the samples in limited seen environments, resulting in the underestimation of

MI. To solve such a problem, we develop the DOMINO framework to learn disentangled context

by decomposed MI optimization. We theoretically illustrate that the decomposed MI optimization

can alleviate the underestimation of MI and reduce the demand for the number of samples. The

disentangled context

c={c0, c1, . . . , cN}

is embedded by the context encoder with parameter

from

the past state-action pairs

τ∗={s∗

l, a∗

l}t−1

l=t−H

in current episode. DOMINO explicitly maximizes the

MI between the context

and the historical trajectories

T=τiM

i=1

(

τi=si

l, ai

lT

l=0

) collected

based on the combination of multiple confounders

˜u={u0, u1, . . . , uN}

as same as the current

confounder setting, while minimizing the state transition prediction error conditioned on the learned

context. We solve the MI optimization problem by optimizing the InfoNCE lower bound on MI [

which can be viewed as a contrastive method for the MI optimization, and decompose the full MI

optimization into smaller ones to alleviate the underestimation of the mutual information and reduce

the demand for the number of samples collected in various environments.

4.1 InfoNCE Bound for Mutual Information Optimization

InfoNCE bound

INCE(x;y)

is a lower bound of the mutual information

I(x;y)

, where NCE stands for

Noise-Contrastive Estimation, is a type of contrastive loss function used for self-supervised learning.

InfoNCE is obtained by comparing pairs sampled from the joint distribution

x, y1∼p(x, y)

(

called the positive example) to pairs

x, yi

built using a set of negative examples,

y2:K∼p(y2:K) =

k=2 p(yk):

I(x;y)≥INCE(x;y|ψ, K) = E"log eψ(x,y1)

KPK

k=1 eψ(x,yk)#(2)

where

is a function assigning a similarity score to

x, y

pairs and

denotes the number of samples.

Through discriminating naturally the paired positive instances from the randomly paired negative

instances, it is proved to bring universal performance gains in various domains, such as computer

vision and natural language processing.

Lemma 1 INCE(X;Y|K)≤I(x;y)≤log K

is a necessary condition for

INCE(X;Y|K)

to be

a tight bound of I(x;y). (see proof in Appendix A)

Some previous context-aware methods learn an entangled context

by maximizing the mutual

information between the context

embedded from the past state-action pairs in the current episode,

and the historical trajectories

collected under the same confounder setting as the current episode.

They solve this problem by maximizing the InfoNCE lower bound on

INCE(c;T)

, which can be

viewed as a contrastive estimation [

] of

INCE(c;T)

, and obtain promising improvement in single-

confounded environment. However, according to Lemma 1, the

INCE(c;T)

may be loose if the

true mutual information

I(c;T)

is larger than

log K

, which is called underestimation of the mutual

information. Therefore, to make the InfoNCE bound to be a tight bound of

I(c;T)

, the minimum

number of samples is

eI(c;T)

. In real-world robotic control tasks, the dynamics of the robot is

commonly inﬂuenced by multiple confounders

˜u={u0, . . . , ui, . . . , uj, . . . , uN}

simultaneously,

under the assumption that the confounders are independent (such as mass and damping), the mutual

information between the historical trajectories Tand the context ccan be derived as

I(c;T) = Ep(τ,c)log p(T | c)

p(T)=Ep(τ,c)log Rp(T | ˜u)p(˜u|c)d˜u

p(T)

≥Ep(τ,c)p(˜u|c)log p(T | ˜u)

p(T)=I(˜u;T)ui⊥uj

=⇒

i=0

I(ui;T)

(3)

As the number of confounders increases, the lower bound of

I(c;T)

will become larger, and the

necessary condition for

INCE(T;c|K)

to be a tight bound of

I(c;T)

will become more difﬁcult

to satisfy. Since

I(c;T)≥PN

i=0 I(ui;T)

, to let the necessary consition satisﬁed, the amount of

data

must be larger than

ePN

i=0 I(ui;T)

according to Lemma 1. Thus the demand for data increases

signiﬁcantly. Since the confounders are commonly independent in real-world, can we relax this

condition by learning disentangled context vectors instead of entangled context intuitively?

4.2 Decomposed MI Optimization

If the context vectors

c={c0, c1, . . . , cN}

can be independent, then we can ease this problem by

applying the chain rule on MI to decompose the total MI into a sum of smaller MI terms, i.e.,

INCE(c;T | K) =

i=0

{INCE(ci;T | K)} ≤ Nlog K(4)

Theorem 1

If the context vectors

{c0, c1, . . . , cN}

can be independent, then the necessary condition

for

INCE(c;T)

to be a tight bound can be relaxed to

I(c;T)≤Nlog K= log KN

. Thus, the need

of the number of samples can be reduced from K≥eI(c;T)to K≥e1

NI(c;T).

Inspired by Theorem 1, we intuitively learn disentangled context vectors and maximize the mu-

tual information between the historical trajectories

and the context vectors

{c0, . . . , cN}

while

minimizing the INCE between the context vectors, i.e., to maximize the LNCE

LNCE(ϕ, w) =

i=0

INCE(ci;T)−

j=0

i=0,i6=j

INCE(ci;cj)(5)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DecomposedMutualInformationOptimizationforGeneralizedContextinMeta-ReinforcementLearningYaoMuTheUniversityofHongKongmuyao@connect.hku.hkYuzhengZhuangHuaweiNoah'sArkLabzhuangyuzheng@huawei.comFeiNiTianjinUniversityfei_ni@tju.edu.cnBinWangHuaweiNoah'sArkLabwangbin158@huawei.comJianyuChenTsinghuaUniver...

展开>> 收起<<

Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: