2 Related Work
2.1 Meta-Reinforcement Learning
Meta-RL extends the framework of meta-learning [
20
,
21
] to reinforcement learning, aiming to
learn an adaptive policy being able to generalize to unseen tasks. Specifically, meta-RL methods
learn the policy based on the prior knowledge discovered from various training environments and
reuse the policy to fast adapt to unseen testing environments after zero or few shots. Gradient-
based meta-RL algorithms [
22
–
25
] learn a model initialization and adapt the parameters with few
policy gradient updates in new dynamics. Context-based meta-RL algorithms [
1
–
4
] learn contextual
information to capture local dynamics explicitly and show great potential to tackle generalization tasks
in complicated environments. Many model-free context-based methods are proposed to learn a policy
conditioned on the latent context that can adapt with off-policy data by leveraging context information
and is trained by maximizing the expected return. PEARL [
1
] adapts to a new environment by
inferring latent context variables from a small number of trajectories. Recent advanced methods
further improve the quality of contextual representation leveraging contrastive learning [
5
–
8
]. Unlike
the model-free methods mentioned above, context-aware world models are proposed to learn the
dynamics with confounders directly. CaDM [
26
] learns a global model that generalizes across tasks by
training a latent context to capture the local dynamics. T-MCL [
4
] combines multiple-choice learning
with context-aware world model and achieves state-of-the-art results on the dynamics generalization
tasks. RIA [
16
] further expands this method into unsupervised setting without environment label by
intervention, and enhances the context learning via MI optimization.
However, existing context-based approaches focus on learning entangled context, in which each
trajectory is encoded into only one context vector. In a multi-confounding environment, learning
entangled contexts requires orders of magnitude higher samples to capture accurate dynamics infor-
mation. To tackle this challenge, different from RIA [
16
] and T-MCL [
4
] , DOMINO infers several
disentangled context vectors from a single trajectory and divides the whole MI optimization into
the summation of smaller ones. The proposed decomposed MI optimization reduces the amount
of demand for diverse samples and thus improves the generalization of the policy to overcome the
adaptation problem in multi-confounded unseen environments.
2.2 Mutual Information Optimization for Representation Learning
Representation learning based on mutual information (MI) maximization has been applied in various
tasks such as computer vision [
27
,
28
], natural language processing [
29
,
19
], and RL [
30
], exploiting
noise-contrastive estimation (NCE) [
31
], InfoNCE [
9
] and variational objectives [
32
]. InfoNCE
has gained recent interest with respect to variational approaches due to its lower variance [
33
] and
superior performance in downstream tasks. However, InfoNCE may underestimate the true MI,
given that it is limited by the number of samples. To tackle this problem, DEMI [
17
] first scaffolds
the total MI estimation into a sequence of smaller estimation problems. In this paper, since the
confounders in the real world are commonly independent, we simplify the complexity of mutual
information decomposition and eliminate the need to learn conditional mutual information as a
sub-term, assuming that multiple confounders are independent of each other.
3 Preliminaries
We consider standard RL framework where an agent optimizes a specified reward function through
interacting with an environment. Formally, we formulate our problem as a Markov decision process
(MDP) [
34
], which is defined as a tuple
(S,A, p, r, γ, ρ0)
. Here,
S
is the state space,
A
is the action
space,
p(s0|s, a)
is the transition dynamics,
r(s, a)
is the reward function,
ρ0
is the initial state
distribution, and
γ∈[0,1)
is the discount factor. In order to address the problem of generalization,
we further consider the distribution of MDPs, where the transition dynamics
p˜u(s0|s, a)
varies
according to multiple confounders
˜u={u0, u1, . . . , uN}
. The confounders can be continuous
random variables, like the mass, damping, random disturbance force, or discrete random variables,
such as one of the robot’s leg is crippled. We assume that the true transition dynamics model is
unknown, but the state transition data can be sampled by taking actions in the environment. Given a
set of training setting sampled from
p(˜utrain)
, the meta-training process learns a policy
π(s, c)
that
adapts to the task at hand by conditioning on the embedding of the history of past transitions, which
3