Group Distributionally Robust Reinforcement Learning with Hierarchical Latent Variables Mengdi Xu1 Peide Huang1 Yaru Niu1 Visak Kumar2 Jielin Qiu1

2025-05-06 0 0 3.29MB 27 页 10玖币

侵权投诉

Group Distributionally Robust Reinforcement Learning with

Hierarchical Latent Variables

Mengdi Xu1, Peide Huang1, Yaru Niu1, Visak Kumar2, Jielin Qiu1

Chao Fang2, Kuan-Hui Lee2, Xuewei Qi, Henry Lam3, Bo Li4, Ding Zhao1

Abstract

One key challenge for multi-task Reinforce-

ment learning (RL) in practice is the absence

of task indicators. Robust RL has been ap-

plied to deal with task ambiguity, but may re-

sult in over-conservative policies. To balance

the worst-case (robustness) and average perfor-

mance, we propose Group Distributionally Ro-

bust Markov Decision Process (GDR-MDP), a

ﬂexible hierarchical MDP formulation that en-

codes task groups via a latent mixture model.

GDR-MDP identiﬁes the optimal policy that

maximizes the expected return under the worst-

possible qualiﬁed belief over task groups within

an ambiguity set. We rigorously show that

GDR-MDP’s hierarchical structure improves dis-

tributional robustness by adding regularization

to the worst possible outcomes. We then de-

velop deep RL algorithms for GDR-MDP for

both value-based and policy-based RL meth-

ods. Extensive experiments on Box2D control

tasks, MuJoCo benchmarks, and Google foot-

ball platforms show that our algorithms outper-

form classic robust training algorithms across

diverse environments in terms of robustness

under belief uncertainties. Demos are avail-

able on our project page (https://sites.

google.com/view/gdr-rl/home).

1 Introduction

Reinforcement learning (RL) has demonstrated extraordi-

nary capabilities in sequential decision-making, even for

handling multiple tasks [1,2,3,4]. With policies condi-

tioned on accurate task-speciﬁc contexts, RL agents could

1Mengdi Xu, Peide Huang, Yaru Niu, Jielin Qiu and Ding Zhao

are with Carnegie Mellon University.

2Visak Kumar, Chao Fang and Kuan-Hui Lee are with Toyota Re-

search Institute (TRI).

3Henry Lam is with Columbia University.

4Bo Li is with University of Illinois Urbana-Champaign.

perform better than ones without access to context infor-

mation [5,6]. However, one key challenge for contextual

decision-making is that, in real deployments, RL agents

may only have incomplete information about the task to

solve. In principle, agents could adaptively infer the la-

tent context with data collected across an episode, and prior

knowledge about tasks [7,8,9]. However, the context es-

timates may be inaccurate [10,11] due to limited interac-

tions, poorly constructed inference models, or intentionally

injected adversarial perturbations. Blindly trusting the in-

ferred context and performing context-dependent decision-

making may lead to signiﬁcant performance drops or catas-

trophic failures in safety-critical situations. Therefore, in

this work, we are motivated to study the problem of robust

decision-making under the task estimate uncertainty.

Prior works about robust RL involve optimizing over the

worst-case qualiﬁed elements within one uncertainty set

[12,13]. Such robust criterion assuming the worst possi-

ble outcome may lead to overly conservative policies, or

even training instabilities [14,15,16]. For instance, an au-

tonomous agent trained with robust methods may always

assume the human driver is aggressive regardless of re-

cent interactions and wait until the road is clear, conse-

quently blocking the trafﬁc. Therefore, balancing the ro-

bustness against task estimate uncertainties and the per-

formance when conditioned on the task estimates is still

an open problem. We provide one solution to address the

above problem by modeling the commonly existing simi-

larities between tasks under distributionally robust Markov

Decision Process (MDP) formulations.

Each task is typically represented by a unique combina-

tion of parameters or a multi-dimensional context in multi-

task RL. We argue that some parameters are more impor-

tant than others in terms of affecting the environment dy-

namics model and thus tasks can be properly clustered into

mixtures according to the more crucial parameters as in

Figure 1(a) and (b). However, existing robust MDP for-

mulations [12] lack the capacity to model task groups, or

equivalently, task subpopulations. Thus the effect of task

subpopulations on the policy’s robustness is unexplored. In

this paper, we show that the task subpopulations help bal-

ance the worst-case performance (robustness) and average

arXiv:2210.12262v1 [cs.LG] 21 Oct 2022

Figure 1: Illustration examples when modeling tasks with

a ﬂat latent structure that uses one distribution for all tasks

as in (a), and a hierarchical latent structure that clusters

tasks to different mixtures as in (b). The graphical model

with a hierarchical latent structure for both GDR-MDP and

HLMDP is shown in (c). At episode n, a mixture znis ﬁrst

sampled from a prior distribution w. An MDP mis then

sampled according to µzn(m)and controls the dynamics

of the n’th episode.

performance under conditions (Section 5.2).

In contrast to prior work [10] that leverages point estimates

of latent contexts, we take a probabilistic point of view and

represent the task subpopulation estimate with a belief dis-

tribution. Holding a belief of the task subpopulation, which

is the high-level latent variable, helps leverage the prior dis-

tributional information of task similarities. It also naturally

copes with distributionally robust optimization by optimiz-

ing w.r.t. the worst-possible belief distribution within an

ambiguity set. We consider an adaptive setting in line with

system identiﬁcation methods [17], where the belief is ini-

tialized as a uniform distribution and then updated during

one episode. Our problem formation is related to the am-

biguity modeling [18] inspired by human’s bounded ratio-

nality to approximate and handle distributions, which has

been studied in behavioral economics [19,20] yet has not

been widely acknowledged in RL.

We highlight our main contributions as follows:

1. We formulate Hierarchical-Latent MDP (HLMDP)

(Section 4), which utilizes a mixture model over

MDPs to encode task subpopulations. HLMDP has a

high-level latent variable zas the mixture, and a low-

level mto represent tasks (Figure 1(c)).

2. We introduce the Group Distributionally Robust

MDP (GDR-MDP) in Section 5to handle the over-

conservative problem, which formulates the robust-

ness w.r.t. the ambiguity of the adaptive belief b(z)

over mixtures. GDR-MDP builds on distributionally

robust optimization [21,22] and HLMDP to leverage

rich distributional information.

3. We show the convergence property of GDR-MDP in

the inﬁnite-horizon case. We ﬁnd that the hierarchical

latent structure helps restrict the worst-possible out-

come within the ambiguity set and thus helps generate

less conservative policies with higher optimal values.

4. We design robust deep RL training algorithms based

on GDR-MDP by injecting perturbations to beliefs

stored in the data buffer. We empirically evalu-

ate in three environments, including robotic control

tasks and google research football tasks. Our results

demonstrate that our proposed algorithms outperform

baselines in terms of robustness to belief noise.

2 Related Work

Robust RL and Distributionally Robust RL. RL’s vul-

nerability to uncertainties has attracted large efforts to de-

sign proper robust MDP formulations accounting for un-

certainties in MDP components [12,13,23,24,25,26].

Existing robust deep RL algorithms [27,28,29,30,31,24]

are shown to generate robust policies with promising re-

sults in practice. However, it is also known that robust

RL that optimizes over the worst-possible elements in the

uncertainty set may generate over-conservative policies by

trading average performance for robustness and may even

lead to training instabilities [16]. In contrast, distribution-

ally robust RL [32,33,34,35,36,37,38,39] assumes

that the distribution of uncertain components (such as tran-

sition models) is partially/indirectly observable. It builds

on distributionally robust optimization [21,22] which op-

timizes over the worst possible distribution within the am-

biguity set. Compared with common robust methods, dis-

tributionally robust RL embeds prior probabilistic informa-

tion and generates less conservative policies with carefully

calibrated ambiguity sets [32]. We aim to propose distri-

butionally robust RL formulations and training algorithms

to handle task estimate uncertainties while maintaining a

trade-off between robustness and performance.

One relevant work is the recently proposed distributionally

robust POMDP [37] which maintains a belief over states

and ﬁnds the worst possible transition model distribution

within an ambiguity set. We instead hold a belief over

task mixtures and ﬁnd the worst possible belief distribu-

tion. [38] also maintains a belief distribution over tasks but

models tasks with a ﬂat latent structure. Moreover, [38]

achieves robustness by optimizing at test-time, while we

aim to design robust training algorithms to save computa-

tion during deployment.

RL with Task Estimate Uncertainty. Inferring the la-

tent task as well as utilizing the estimates in decision-

making have been explored under the framework of

Bayesian-adaptive MDPs [40,41,42,43,17]. Our work

is similar to Bayesian-adaptive MDPs in terms of updating

a belief distribution with Bayesian update rules, but we fo-

cus on the robustness against task estimate uncertainties at

the same time. The closest work to our research is [10],

which optimizes a conditional value-at-risk objective and

maintains an uncertainty set centered on a context point es-

timate. Instead, we maintain an ambiguity set over beliefs

and further consider the presence of task subpopulations.

[11] also considers the uncertainties in belief estimates but

with a ﬂat latent task structure.

Multi-task RL. Learning a suite of tasks with an RL

agent has been studied under different frameworks [3,44],

such as Latent MDP [45], Multi-model MDP [5], Con-

textual MDP [46], Hidden Parameter MDP [47], and etc

[48]. Our proposed HLMDP builds on the Latent MDP

[45] which contains a ﬁnite number of MDPs, each accom-

panied by a weight. In contrast to Latent MDP utilizing

a ﬂat structure to model each MDP’s probability, HLMDP

leverages a rich hierarchical model to cluster MDPs to a

ﬁnite number of mixtures. In addition, HLMDP is a spe-

cial yet important subclass of POMDP [49]. It treats the

latent task mixture that the current environment belongs to

as the unobservable variable. HLMDP resembles the re-

cently proposed Hierarchical Bayesian Bandit [50] model

but focuses on more complex MDP settings.

3 Preliminary

This section introduces Latent MDP and the adaptive belief

setting, both serving as building blocks for our proposed

HLMDP (Section 4) and GDR-MDP (Section 5).

Latent MDP. An episodic Latent MDP [45] is speciﬁed

by a tuple (M, T, S,A, µ).Mis a set of MDPs with

cardinality |M| =M. Here T,S, and Aare the shared

episode length (planning horizon), state, and action space,

respectively. µis a categorical distribution over MDPs and

m=1 µ(m) = 1. Each MDP Mm∈ M, m ∈[M]is

a tuple (T, S,A,Pm,Rm, νm)where Pmis the transition

probability, Rmis the reward function and νmis the initial

state distribution.

Latent MDP assumes that at the beginning of each episode,

one MDP from set Mis sampled based on µ(m). It

aims to ﬁnd a policy πthat maximizes the accumulated

expected return solving maxπPM

m=1 µ(m)Eπ

mPT

t=1 rt,

where Em[·]denotes EPm,Rm[·].

The Adaptive Belief Setting In general, a belief distribu-

tion contains the probability of each possible MDP that the

current environment belongs to. The adaptive belief setting

[5] holds a belief distribution that is dynamically updated

with streamingly observed interactions and prior knowl-

edge about the MDPs. In practice, prior knowledge may

be acquired by rule-based policies or data-driven learning

methods. For example, it is possible to pre-train in sim-

ulated complete information scenarios or exploit unsuper-

vised learning methods based on online collected data [51].

There also exist multiple choices for updating the belief,

such as applying the Bayesian rule as in POMDPs [49] and

representing beliefs with deep recurrent neural nets [52].

4 Hierarchical Latent MDP

In realistic settings, tasks share similarities, and task sub-

populations are common. Although different MDP formu-

lations are proposed to solve multi-task RL, the task rela-

tionships are in general overlooked. To ﬁll in the gap, we

ﬁrst propose Hierarchical Latent MDP (HLMDP), which

utilizes a hierarchical mixture model to represent distribu-

tions over MDPs. Moreover, we consider the adaptive be-

lief setting to leverage prior information about tasks.

Deﬁnition 1 (Hierarchical Latent MDPs).An episodic

HLMDP is deﬁned by a tuple (Z,M, T, S,A, w).Zde-

notes a set of Latent MDPs and |Z| =Z.Mis a set of

MDPs with cardinality |M| =Mshared by different La-

tent MDPs. T,S, and Aare the shared episode length

(planning horizon), state, and action space, respectively.

Each Latent MDP Zz∈ Z, z ∈[Z]consists of a set of

joint MDPs {Mm}M

m=1 and their weights µzsatisfying

m=1 µz(m) = 1.wis the categorical distribution over

Latent MDPs and PZ

z=1 w(z) = 1.

We provide a graphical model of HLMDP in Figure 1(c).

HLMDP assumes that at the beginning of each episode, the

environment ﬁrst samples a Latent MDP z∼w(z)and

then samples an MDP m∼µz(m). HLMDP encodes task

similarity information via the mixture model, and thus con-

tains richer task information than Latent MDP proposed in

[45]. For instance, we could always ﬁnd one Latent MDP

for each HLMDP. However, there may exist inﬁnitely many

corresponding HLMDPs given one Latent MDP.

HLMDP in Adaptive Belief Setting. When solving

multi-task RL problems, the adaptive setting is shown to

help generate a policy with a higher performance [5] than

the non-adaptive one since it leverages prior knowledge

about the transition model as well as the online collected

data tailored to the unseen environment. Hence we are mo-

tivated to formulate HLMDP in the adaptive belief setting.

HLMDP maintains a belief distribution b(z)over task

groups to model the probability that the current environ-

ment belongs to each group z. At the beginning of each

episode, we initialize the belief distribution with a uniform

distribution b0. We use the Bayesian rule to update beliefs

based on interactions and a prior knowledge base. Note that

the knowledge base are not accurate enough and may lead

to inaccurate belief updates. At timestep t, we get the next

belief estimate bt+1 with the state estimation function SE:

SE(bt, st) = bt(j)L(j)

Pi∈[Z]bt(i)L(i),∀j∈[Z],(1)

wher Under the adaptive belief setting, HLMDP aims to

ﬁnd an optimal policy ¯π?within a history-dependent pol-

icy class Π, under which the discounted expected cumula-

tive reward is maximized as in Equation 2. Following gen-

eral notations in POMDPs, we denote the history at time

tas ht= (s0, a1, s1, . . . , st−1, at−1, st)∈ Htcontaining

state-action pairs (s, a). At timestep t, we use both the ob-

served state stand the inferred belief distribution bt(z)as

the sufﬁcient statistics for history ht.

V?= max

π∈Π

Eb0:T(z)Eµz(m)Eπ

m

t=1

γtrt,(2)

where rtdenotes the reward received at step t.b0(z)is the

initial belief at timestep 0.

5 Group Distributionally Robust MDP

The belief update function in Equation 1may not be accu-

rate, which motivates robust decision-making under belief

estimate errors. In this section, we introduce Group Dis-

tributionally Robust MDP (GDR-MDP) which models

task groups and considers robustness against the belief am-

biguity. We then study the convergence property of GDR-

MDP in the inﬁnite-horizon case in Section 5.1. We ﬁnd

that GDR-MDP’s hierarchical structure helps restrict the

worst-possible value within the ambiguity set and provide

the robustness guarantee in Section 5.2.

Deﬁnition 2 (General Ambiguity Sets).Let ∆kbe a k-

simplex. Considering a categorical belief distribution b∈

∆k, a general ambiguity set without special structures is

deﬁned as C∆kcontaining all possible distributions for b.

Deﬁnition 3 (Group Distributionally Robust MDP).

An episodic GDR-MDP is deﬁned by a 8-tuple

(C,Z,M, T, S,A, w, SE).Cis a general belief am-

biguity set. T, S,A,M,Z, w are elements of an episodic

HLMDP as in Deﬁnition 1.SE : ∆Z−1× S → ∆Z−1is

the belief updating rule. GDR-MDP aims to ﬁnd a policy

π?∈Πthat obtains the following optimal value:

V?= max

π∈Πmin

b0:T

∈C∆Z−1

Eˆ

b0:T(z)Eµz(m)Eπ

m

t=1

γtrt,(3)

where C∆Z−1is a general ambiguity set tailored to beliefs

over Latent MDPs in set Z.

GDR-MDP naturally balances robustness and performance

by leveraging distributionally robust formulation and rich

distributional information. In contrast to HLMDP, which

maximizes expected return over nominal adaptive belief

distribution (Equation 2), GDR-MDP aims to maximize

the expected return under the worst-possible beliefs within

an ambiguity set C∆Z−1. Moreover, GDR-MDP opti-

mizes over fewer optimization variables than when directly

perturbing MDP model parameters or states. It resem-

bles the group distributionally robust optimization problem

in supervised learning [53,54] but focuses on sequential

decision-making in dynamic environments.

5.1 Convergence in Inﬁnite-horizon Case

With general ambiguity sets (as in Deﬁnition 2), calculating

the optimal policy is intractable [33,39]. We propose a

belief-wise ambiguity set that follows the b-rectangularity

to facilitate solving the proposed GDR-MDP.

Assumption 1 (b-rectangularity).We assume a belief-wise

ambiguity set, ˜

C:= Nb∈∆Z−1Cb, where Nrepresents

Cartesian product. bserves as the nominal distribution of

the ambiguity set.

More concretely, the b-rectangularity assumption uncou-

ples the ambiguity set related to different beliefs. When

conditioned on beliefs at each timestep, the minimization

loop selects the worst-case realization unrelated to other

timesteps. The b-rectangularity assumption is motivated

by the s-rectangularity ﬁrst introduced in [23], which helps

reduce a robust MDP formulation to an MDP formulation

and get rid of the time-inconsistency problem [55]. Ambi-

guity sets beyond rectangularities are recently explored in

[56,57], which we leave for future works.

With b-rectangular ambiguity sets, we derive Bellman

equations to solve Equation 3with dynamic programming.

Detailed proofs are in Appendix Section B.1.

Proposition 1 (Group Distributionally Robust Bellman

Equation).Deﬁne the distributionally robust value of an

arbitrary policy πas follows where bt+1 =SE(bt, st).

Vπ

t(bt, st)= min

bt:T∈

Cbt:T

Eˆ

bt:T(z)Eµz(m)Eπt:T

m

n=t

γn−trn|bt, st.

The Group Distirbutionally Robust Bellman expectation

equation is

Vπ

t(bt, st) = min

bt∈Cbt

Eˆ

bt(z)Eµz(m)EπthERm[rt]+

γX

st+1

Pm(st+1|st, at)Vπ

t+1(bt+1, st+1)i.(4)

Lemma 1 (Contraction Mapping).Let Vbe a set of real-

valued bounded functions on ∆Z−1×S.LV(b, s) : V → V

refers to the Bellman operator deﬁned as

LV(b, s) = max

π∈Πmin

b∈Cb

Eˆ

b(z)Eµz(m)EπhERm[r]+

γX

Pm(s0|s, a)Vπ(SE(b, s), s)i.(5)

LV(b, s)is a γ-contraction operator on the complete met-

ric space (V,k · k∞). That is, given ∀U, V ∈ V,

kLU− LVk∞≤γkU−Vk∞.

Theorem 1 (Convergence in Inﬁnite-horizon Case).De-

ﬁne V∞(b, s)as the inﬁnite horizon value function. For all

b∈ B and s∈ S, we have V∞(b, s)is the unique solu-

tion to LV∞(b, s) = V∞(b, s), and limt→∞ LVt(b, s) =

LV∞(b, s)uniformly in k·k∞.

By repeatedly applying the contraction operator in

Lemma 1, the value function will converge to a unique

ﬁxed point, which corresponds to the optimal value based

on Banach ﬁxed point theorem [58].

5.2 Robustness Guarantee of GDR-MDP

This section shows how GDR-MDP’s hierarchical task

structure and the distributionally robust formulation help

balance performance and robustness. We compare the op-

timal value of GDR-MDP denoted as VGDR(π?

GDR), with

three different robust formulations. Group Robust MDP is

a robust version of GDR-MDP with its optimal value de-

noted as VGR(π?

GR). Distributionally Robust MDP holds

a belief over MDPs without the hierarchical task structure

whose optimal value denoted as VDR(π?

DR). Robust MDP

is a robust version of Distributionally Robust MDP, de-

noted as VR(π?

R).π?

·denote optimal policies under differ-

ent formulations. We achieve the comparison by studying

how maintaining beliefs over mixtures affects the worst-

possible outcome of the inner minimization problem and

the resulting RL policy.

We study the worst-possible value via the relationships

between ambiguity sets projected to the space of beliefs

over MDPs. We ﬁrst deﬁne a discrepancy-based ambigu-

ity set that is widely used in existing DRO formulations

[59,60,61].

Deﬁnition 4 (Ambiguity set with total variance distance).

Consider a discrepancy-based ambiguity set deﬁned based

on total variance distance. Formally, the ambiguity set is

CνX,dT V ,ξ(X) = {ν0(X) : sup

X∈X

|ν0(X)−νX(X)| ≤ ξ},

where X∈ X is the support, νXis the nominal distribution

over Xand ξis the ambiguity set’s size.

To achieve a reasonable comparison, we control the adver-

sary’s budget ξthe same when perturbing the belief over

task groups zand tasks m, which correspond to different

model misspeciﬁcation forms when there is a hierarchical

latent structure about tasks.

Theorem 2 (Values of different robust formulations).Let

Um(π) = Eπ

mPT

t=1 γtrt. Let Cb(m),dT V ,ξ(m)and

Cb(z),dT V ,ξ(z)denote the ambiguity sets for beliefs over

tasks mand groups z, respectively. b(m)and b(z)satisfy

b(m) = PZµz(m)b(z)and are the nominal distributions.

For any history-dependent policy π∈Π, its value function

under different robust formulations are:

VGDR(π) = min

b(z)∈Cb(z),dT V ,ξ (z)

Eˆ

b(z)Eµz(m)[Um(π)],

VGR(π) = min

z∈[Z]

Eµz(m)[Um(π)],

VDR(π) = min

b(m)∈Cb(m),dT V ,ξ (m)

Eˆ

b(m)[Um(π)],

VR(π) = min

m∈[M][Um(π)].

We have the following inequalities hold: VGDR(π)≥

VGR(π)≥VR(π)and VGDR(π)≥VDR(π).

Theorem 2shows that with a nontrivial ambiguity set,

the distributionally robust formulation in GDR-MDP helps

regularize the worst-possible value when compared with

robust ones, including the group robust (GR) and task ro-

bust (R) formulations. It also shows that GDR-MDP’s hi-

erarchical structure further helps restrict the effect of the

Figure 2: Hierarchical Latent Bandit examples. (a), (b) and

biguity sets, and different robust formulations’ optimal val-

ues for an example with two groups and two unique tasks.

(d) shows the relationship between ambiguity sets for an

example with two groups and three unique tasks.

adversary, resulting in higher values than the distribution-

ally robust formulation with a ﬂat latent structure (DR). To

get Theorem 2, we ﬁrst ﬁnd that when projecting the ξ-

ambiguity set for b(z)to the space of b(m), the resulting

ambiguity set is a subset of the ξ-ambiguity set for b(m).

Proofs are detailed in Appendix Section B.2. Our setting is

different from [62] which states that DRO is a generaliza-

tion of point-wise attacks. The key difference is that when

the adversary perturbs b(m), we omit the expectation over

the mixtures under b(z).

Theorem 3 (Optimal values of different robust formula-

tions).Let π?

·denote the converged optimal policy for

different robust formulations, we have VGDR(π?

GDR)≥

VGR(π?

GR)≥VR(π?

R)and VGDR(π?

GDR)≥VDR(π?

DR).

Based on Theorem 2, we can compare the optimal values

for different robust formulations. Theorem 3shows that

imposing ambiguity set on beliefs over mixtures helps gen-

erate less conservative policies with higher optimal values

at convergence compared with other robust formulations.

Illustration Examples in Figure 2.We provide two hier-

archical latent bandit examples in Figure 2. The ﬁrst ex-

ample shown in Figure 2(a) has two latent groups with

different weights over two unique MDPs. (b) shows the

ambiguity sets of the example in (a). The orange sets de-

note the ξ-ambiguity sets for the beliefs over mixtures and

MDPs. The green set denotes the ambiguity set projected

from the ξ-ambiguity set for belief distributions over mix-

tures. We show that the mapped set is a subset of the orig-

inal ξ-ambiguity set for the MDP belief distributions. (c)

shows the optimal policy and value of different robust for-

mulations for the example in (a). Our proposed GDR has

the potential to get a less conservative policy with higher

returns than other robust baselines. (d) follows the same

notations in (b) but corresponds to an example with three

possible MDPs. (b) and (d) together shows that the hier-

archical structure helps regularize the adversary’s strength.

The detailed procedure for getting the optimal policies is

shown in Appendix A.

6 Algorithms

To solve the proposed GDR-MDP, we propose novel robust

deep RL algorithms (summarized in Algorithm 2and Algo-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GroupDistributionallyRobustReinforcementLearningwithHierarchicalLatentVariablesMengdiXu1,PeideHuang1,YaruNiu1,VisakKumar2,JielinQiu1ChaoFang2,Kuan-HuiLee2,XueweiQi,HenryLam3,BoLi4,DingZhao1AbstractOnekeychallengeformulti-taskReinforce-mentlearning(RL)inpracticeistheabsenceoftaskindicators.RobustRLha...

展开>> 收起<<

Group Distributionally Robust Reinforcement Learning with Hierarchical Latent Variables Mengdi Xu1 Peide Huang1 Yaru Niu1 Visak Kumar2 Jielin Qiu1.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Group Distributionally Robust Reinforcement Learning with Hierarchical Latent Variables Mengdi Xu1 Peide Huang1 Yaru Niu1 Visak Kumar2 Jielin Qiu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: