On the Power of Pre-training for Generalization in RLProvable Benefits and Hardness

2025-05-02 0 0 445.87KB 31 页 10玖币

侵权投诉

arXiv:2210.10464v2 [cs.LG] 29 Jun 2023

On the Power of Pre-training for Generalization in RL:

Provable Beneﬁts and Hardness

Haotian Ye * 1 Xiaoyu Chen * 2 Liwei Wang 2 3 Simon S. Du 4

Abstract

Generalization in Reinforcement Learning (RL)

aims to train an agent during training that gen-

eralizes to the target environment. In this work,

we ﬁrst point out that RL generalization is fun-

damentally different from the generalization in

supervised learning, and ﬁne-tuning on the tar-

get environment is necessary for good test per-

formance. Therefore, we seek to answer the fol-

lowing question: how much can we expect pre-

training over training environments to be help-

ful for efﬁcient and effective ﬁne-tuning? On

one hand, we give a surprising result showing

that asymptotically, the improvement from pre-

training is at most a constant factor. On the

other hand, we show that pre-training can be in-

deed helpful in the non-asymptotic regime by de-

signing a policy collection-elimination (PCE) al-

gorithm and proving a distribution-dependent re-

gret bound that is independent of the state-action

space. We hope our theoretical results can pro-

vide insight towards understanding pre-training

and generalization in RL.

1. Introduction

Reinforcement learning (RL) is concerned with sequen-

tial decision making problems in which the agent inter-

acts with the environment aiming to maximize its cu-

mulative reward. This framework has achieved tremen-

dous successes in various ﬁelds such as game play-

ing (Mnih et al.,2013;Silver et al.,2017;Vinyals et al.,

2019), resource management (Mao et al.,2016), recom-

*Equal contribution 1Yuanpei College, Peking University,

China 2National Key Laboratory of General Artiﬁcial Intelligence,

School of Intelligence Science and Technology, Peking Univer-

sity, China 3Center for Data Science, Peking University, China

4University of Washington, United States. Correspondence to:

Simon S. Du <ssdu@cs.washington.edu>, Liwei Wang <wan-

glw@pku.edu.cn>.

Proceedings of the 40 th International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

mendation systems (Shani et al.,2005;Zheng et al.,2018)

and online advertising (Cai et al.,2017). However, many

empirical applications of RL algorithms are typically re-

stricted to the single environment setting. That is, the RL

policy is learned and evaluated in the exactly same envi-

ronment. This learning paradigm can lead to the issue of

overﬁtting in RL (Sutton,1995;Farebrother et al.,2018),

and may have degenerate performance when the agent is

deployed to an unseen (but similar) environment.

The ability to generalize to test environments is impor-

tant to the success of reinforcement learning algorithms,

especially in the real applications such as autonomous

driving (Shalev-Shwartz et al.,2016;Sallab et al.,2017),

robotics (Kober et al.,2013;Kormushev et al.,2013) and

health care (Yu et al.,2021). In these real-world tasks,

the environment can be dynamic, open-ended and always

changing. We hope the agent can learn meaningful skills

in the training stage and be robust to the variation dur-

ing the test stage. Furthermore, in applications such as

robotics where we have a simulator to efﬁciently and safely

generate unlimited data, we can ﬁrstly train the agent in

the randomized simulator models and then generalize it to

the real environment (Rusu et al.,2017;Peng et al.,2018;

Andrychowicz et al.,2020). An RL algorithm with good

generalization ability can greatly reduce the demand of real-

world data and improve test-time performance.

Generalization in supervised learning has been

widely studied for decades (Mitchell et al.,1986;

Bousquet & Elisseeff,2002;Kawaguchi et al.,2017).

For a typical supervised learning task such as classiﬁcation,

given a hypothesis space Hand a loss function ℓ, the

agent aims to ﬁnd an optimal solution in the average

manner. That is, we hope the solution is near-optimal

compared with the optimal hypothesis h∗in expectation

over the data distribution, which is formally deﬁned as

h∗= arg minh∈H Eℓ(h(X), Y ). From this perspective,

generalization in RL is fundamentally different. Once

the agent is deployed in the test environment Msampled

from distribution D, it is expected to achieve comparable

performance with the optimal policy in M. In other words,

we hope the learned policy can perform near-optimal

compared with the optimal value V∗

Min instance for the

sampled test environment M.

On the Power of Pre-training for Generalization in RL: Provable Beneﬁts and Hardness

Unfortunately, as discussed in many previous

works (Malik et al.,2021;Ghosh et al.,2021), the

instance-optimal solution in the target environment can

be statistically intractable without additional assumptions.

We formulate this intractability into a lower bound (Propo-

sition 3.2) to show that it is impractical to directly obtain a

near-optimal policy for the test environment M∗with high

probability. This motivates us to ask: in what settings can

the generalization problem in RL be tractable?

Targeting on RL generalization, the agent is often allowed

to further interact with the test environment to improve its

policy. For example, many previous results in robotics have

demonstrated that ﬁne-tuning in the test environment can

greatly improve the test performance for sim-to-real trans-

fer (Rusu et al.,2017;James et al.,2019;Rajeswaran et al.,

2016). Therefore, one possible way to formulate general-

ization is to allow further interaction with the target en-

vironment during the test stage. Speciﬁcally, suppose the

agent interacts with MDP M ∼ Din the test stage, and we

measure the performance of the ﬁne-tuning algorithm Aus-

ing the expected regret in Kepisodes, i.e. RegK(D,A) =

EM∼DhPK

k=1 Vπ∗(M)

M−Vπk

Mi. In this setting, can the

information obtained from pre-training 1help reduce the

regret suffered during the test stage?

In addition, when the test-time ﬁne-tuning is not al-

lowed, what can we expect the pre-training to be help-

ful? As discussed above, we can no longer demand

instance-optimality in this setting, but can only step back

and pursue a near-optimal policy in expectation. Speciﬁ-

cally, our goal is to perform near-optimal in terms of the

optimal policy with maximum value in expectation, i.e.

π∗(D) = arg maxπ∈ΠEM∼DVπ

M. Here Vπ

Mis the value

function of the policy πin MDP M. We seek to answer:

is it possible to design a sample-efﬁcient training algo-

rithm that returns a ǫ-optimal policy πin expectation, i.e.

EM∼DhVπ∗(D)

M−Vπ

Mi≤ǫ?

Main contributions. In this paper, we theoretically study

RL generalization in the above two settings. We show that:

• When ﬁne-tuning is allowed, we study the beneﬁt of

pre-training for the test-time performance. Since all

information we can gain from training is no more than

the distribution Ditself, we start with a somewhat sur-

prising theorem showing the limitation of this bene-

ﬁt: there exists hard cases where, even if the agent

has exactly learned the environment distribution Din

the training stage, it cannot improve the test-time re-

gret up to a universal factor in the asymptotic setting

(K→ ∞). In other words, knowing the distribution

1We call the training stage “pre-training” when interactions

with the test environment are allowed.

Dcannot provide more information in consideration

of the regret asymptotically. Our theorem is proved

by using Radon transform and Lebesgue integral anal-

ysis to give a global level information limit, which we

believe are novel techniques for RL communities.

• Inspired by this asymptotic hardness, we focus on the

non-asymptotic setting where Kis ﬁxed, and study

whether and how much we can reduce the regret.

We propose an efﬁcient pre-training and test-time

ﬁne-tuning algorithm called PCE (Policy Collection-

Elimination). By maintaining a minimum policy set

that generalizes well, it achieves a regret upper bound

OpC(D)Kin the test stage, where C(D)is a com-

plexity measure of the distribution D. This bound re-

moves the polynomial dependence on the cardinality

of state-action space by leveraging the information ob-

tained from pre-training. We give a ﬁne-grained anal-

ysis on the value of C(D)and show that our bound can

be signiﬁcantly smaller than state-action space depen-

dent bound in many settings. We also give a lower

bound to show that this measure is a tight and bidirec-

tional control of the regret.

• When the agent cannot interact with the test envi-

ronment, we propose an efﬁcient algorithm called

OMERM (Optimistic Model-based Empirical Risk

Minimization) to ﬁnd a near-optimal policy in ex-

pectation. This algorithm is guaranteed to return a

ǫ-optimal policy with Olog NΠ

ǫ/(12H)/ǫ2sam-

pled MDP tasks in the training stage where NΠ

ǫ/(12H)

is the complexity of the policy class. This rate matches

the traditional generalization rate in many supervised

learning results (Mohri et al.,2018;Kawaguchi et al.,

2017).

2. Related Works

Generalization and Multi-task RL. Many empirical

works study how to improve generalization for deep

RL algorithms (Packer et al.,2018;Zhang et al.,2020;

Ghosh et al.,2021). We refer readers to a recent survey

(Kirk et al.,2021) for more discussion on empirical results.

Our paper is more closely related to the recent works to-

wards understanding RL generalization from the theoretical

perspective.Wang et al. (2019) focused on a special class

of reparameterizable RL problems, and derive generaliza-

tion bounds based on Rademacher complexity and the PAC-

Bayes bound. The most related work is a recent paper

of Malik et al. (2021), which also provided lower bounds

showing that instance-optimal solution is statistically difﬁ-

cult for RL generalization when we cannot access the sam-

pled test environment. Further, they proposed efﬁcient al-

gorithms which is guaranteed to return a near-optimal pol-

On the Power of Pre-training for Generalization in RL: Provable Beneﬁts and Hardness

icy for deterministic MDPs. However, their work is dif-

ferent from ours since they studied a restricted setting un-

der structural assumptions of the environments, such as

all MDPs share a common optimal policy, and their al-

gorithm requires access to a query model. Our paper is

also related to recent works studying multi-task learning in

RL (Brunskill & Li,2013;Tirinzoni et al.,2020;Hu et al.,

2021;Zhang & Wang,2021;Lu et al.,2021), in which they

studied how to transfer the knowledge learned from previ-

ous tasks to new tasks. Their problem formulation is differ-

ent from ours since they study the multi-task setting where

the MDP is selected from a given MDP set without prob-

ability mechanism. In addition, they typically assume that

all the tasks have similar transition dynamics or share com-

mon representations.

Provably Efﬁcient Exploration in RL. Recent years have

witnessed many theoretical results studying provably efﬁ-

cient exploration in RL (Osband et al.,2013;Azar et al.,

2017;Osband & Van Roy,2017;Jin et al.,2018;2020b;

Wang et al.,2020;Zhang et al.,2021) with the minimax re-

gret for tabular MDPs with non-stationary transition being

O(√HSAK). These results indicate that polynomial de-

pendence on the whole state-action space is unavoidable

without additional assumptions. Their formulation corre-

sponds to the single-task setting where the agent only in-

teracts with a single environment aiming to maximize its

cumulative rewards without pre-training. The regret de-

ﬁned in the ﬁne-tuning setting coincides with the concept

of Bayesian regret in the previous literature (Osband et al.,

2013;Osband & Van Roy,2017;O’Donoghue,2021). The

best-known Bayesian regret for tabular RL is ˜

O(√HSAK)

when applied to our setting (O’Donoghue,2021).

Latent MDPs. This work is also related to previous results

on latent MDPs (Kwon et al.,2021b;a), which is a special

class of Partially Observable MDPs (Azizzadenesheli et al.,

2016;Guo et al.,2016;Jin et al.,2020a). In latent MDPs,

the MDP representing the dynamics of the environment is

sampled from an unknown distribution at the start of each

episode. Their works are different from ours since they

focus on tackling the challenge of partial observability.

3. Preliminary and Framework

Notations Throughout the paper, we use [N]to denote

the set {1,···, N}where N∈N+. For an event E, let

I[E]be the indicator function of event E, i.e. I[E] = 1 if

and only if Eis true. For any domain Ω, we use C(Ω)

to denote the continuous function on Ω. We use O(·)to

denote the standard big Onotation, and ˜

O(·)to denote the

big Onotation with log(·)term omitted.

3.1. Episodic MDPs

An episodic MDP Mis speciﬁed as a tuple

(S,A,PM, RM, H), where S,Aare the state and ac-

tion space with cardinality Sand Arespectively, and His

the steps in one episode. PM,h :S × A7→ ∆(S)is the

transition such that PM,h(s′|s, a)denotes the probability

to transit to state s′if action ais taken in state sin step h.

RM,h :S × A 7→ ∆(R)is the reward function such that

RM,h(s, a)is the distribution of reward with non-negative

mean rM,h(s, a)when action ais taken in state sat step

h. In order to compare with traditional generalization, we

make the following assumption:

Assumption 3.1. The total mean reward is bounded by 1,

i.e. ∀M ∈ Ω,PH

h=1 rM,h(sh, ah)≤1for all trajectory

(s1, a1,···, sH, aH)with positive probability in M; The

reward mechanism RM(s, a)is 1-subgaussian, i.e.

EX∼RM,h(s,a)[exp(λ[X−rM,h(s, a)])] ≤exp λ2

2

for all λ∈R.

The total reward assumption follows the previous works

on horizon-free RL (Ren et al.,2021;Zhang et al.,2021;

Li et al.,2022) and covers the traditional setting where

rM,h(s, a)∈[0,1] by scaling H, and it is more natural

in environments with sparse rewards (Vecerik et al.,2017;

Riedmiller et al.,2018). In addition, it allows us to com-

pare with supervised learning bound where H= 1 and

the loss is bounded by [0,1]. The subgaussian assumption

is more common in practice and is widely used in ban-

dits (Lattimore & Szepesv´ari,2020). It also covers tradi-

tional RL setting where RM,h(s, a)∈∆([0,1]), and al-

lows us to study MDP environment with a wider range. For

the convenience of explanation, we assume the agent al-

ways starts from the same state s1. It is straightforward to

recover the initial state distribution µfrom this setting by

adding an initial state s0with transition µ(Du et al.,2019;

Chen et al.,2021).

Policy and Value Function. A policy πis set of Hfunc-

tions where each maps a state to an action distribution, i.e.

π={πh}H

h=1, πh:S 7→ ∆(A)and πcan be stochastic.

We denote the set of all policies described above as Π. We

deﬁne NΠ

ǫas the ǫ-covering number of the policy space

Πw.r.t. distance d(π1, π2) = maxs∈S,h∈[H]kπ1

h(·|s)−

π2

h(·|s)k1. Given πand h∈[H], we deﬁne the Q-function

Qπ

M,h :S × A 7→ R+, where

Qπ

M,h(s, a) = rM,h(s, a)+ X

s′∈S

PM,h(s′|s, a)Vπ

M,h+1(s′),

and the V-function Vπ

M,h :S 7→ R+, where

Vπ

M,h(s) = Ea∼πh(·|s)Qπ

M,h(s, a)

On the Power of Pre-training for Generalization in RL: Provable Beneﬁts and Hardness

for h≤Hand Vπ

M,H+1(s) = 0. We abbreviate

Vπ

M,1(s1)as Vπ

M, which can be interpreted as the value

when executing policy πin M. Following the notations

in previous works, we use PhV(s, a)as the shorthand of

Ps′∈S Ph(s′|s, a)V(s′)in our analysis.

3.2. RL Generalization Formulation

We mainly study the setting where all MDP instances we

face in training and testing stages are i.i.d. sampled from

a distribution Dsupported on a (possibly inﬁnite) count-

able set Ω. For an MDP M ∈ Ω, we use P(M)to

denote the probability of sampling Maccording to dis-

tribution D. For an MDP set ˜

Ω⊆Ω, we similarly de-

ﬁne P(˜

Ω) = PM∈˜

ΩP(M). We assume that S,A, H is

shared by all MDPs, while the transition and reward are dif-

ferent. When interacting with a sampled instance M, one

does not know which instance it is, but can only identify its

model through interactions.

In the training (pre-training) stage, the agent can sample

i.i.d. MDP instances from the unknown distribution D. The

overall goal is to perform well in the test stage with the in-

formation learned in the training stage. Deﬁne the optimal

policy as

π∗(M) = arg max

π∈Π

Vπ

M, π∗(D) = arg max

π∈Π

EM∼DVπ

We say a policy πis ǫ-optimal in expectation, if

EM∼DVπ∗(D)

M−Vπ

M≤ǫ.

We say a policy πis ǫ-optimal in instance, if

EM∼DVπ∗(M)

M−Vπ

M≤ǫ.

Without Test-time Interaction. When the interaction

with the test environment is unavailable, optimality in in-

stance can be statistically intractable, and we can only pur-

sue optimality in expectation. We formulate this difﬁculty

into the following proposition.

Proposition 3.2. There exists an MDP support Ω, such that

for any distribution Dwith positive p.d.f. p(r),∃ǫ0>0,

and for any deployed policy ˆπ,

EM∗∼DVπ∗(M∗)

M∗−Vˆπ

M∗≥ǫ0.

Proposition 3.2 is proved by constructing Ωas a set of

MDPs with opposed optimal action, and the complete proof

can be found in Appendix A. When Ωis discrete, there ex-

ists hard instances where the proposition holds for ǫ0≥1

This implies that without test-time interactions or special

knowledge on the structure of Ωand D, it is impractical

to be near optimal in instance. This intractability arises

from the demand on instance optimal policy, which is never

asked in supervised learning.

With Test-time Interaction. To pursue the optimality in

instance, we study the problem of RL generalization with

test-time interaction. When our algorithm is allowed to in-

teract with the target MDP M∗∼Dfor Kepisodes in the

test stage, we want to reduce the regret, which is deﬁned as

RegK(D,A),EM∗∼DRegK(M∗,A),

RegK(M∗,A),

k=1

[Vπ∗(M∗)

M∗−Vπk

M∗],

where πkis the policy that Adeploys in episode k. Here

M∗is unknown and unchanged during all Kepisodes.

The choice of Bayesian regret is more natural in gener-

alization, and can better evaluate the performance of an

algorithm in practice. From the standard Regret-to-PAC

technique (Jin et al.,2018;Dann et al.,2017), an algorithm

with ˜

O(√K)regret can be transformed to an algorithm

that returns an ǫ-optimal policy with ˜

O(1/ǫ2)trajectories.

Therefore, we believe regret can also be a good criterion to

measure the sample efﬁciency of ﬁne-tuning algorithms in

the test stage.

4. Results for the Setting with Test-time

Interaction

In this section, we study the setting where the agent is al-

lowed to interact with the sampled test MDP M∗. When

there is no pre-training stage, the typical regret bound in the

test stage is ˜

O(√SAHK)(Zhang et al.,2021). For gener-

alization in RL, we mainly care about the performance in

the test stage, and hope the agent can reduce test regret

by leveraging the information learned in the pre-training

stage. Obviously, when Ωis the set of all tabular MDPs

and the distribution Dis uniform over Ω, pre-training can

do nothing on improving the test regret, since it provides no

extra information for the test stage. Therefore, we seek a

distribution-dependent improvement that is better than tra-

ditional upper bound in most of benign settings.

4.1. Hardness in the Asymptotic Setting

We start by understanding how much information the pre-

training stage can provide at most, regardless of the test

time episode K. One natural focus is on the MDP distri-

bution D, which is a sufﬁcient statistic of the possible envi-

ronment that the agent will encounter in the test stage. We

strengthen the algorithm by directly telling it the accurate

distribution D, and analyze how much this extra informa-

tion can help to improve the regret. Speciﬁcally, we ask:

Is there a distribution based multiplied factor C(D)that is

small when Denjoys some benign properties (e.g. Dis

sharp and concentrated), such that when knowing D, there

exists an algorithm that can reduce the regret by a factor of

C(D)for all K?

On the Power of Pre-training for Generalization in RL: Provable Beneﬁts and Hardness

Perhaps surprisingly, our answer towards this question is

negative for all large enough Kin the asymptotic case.

As is formulated in Theorem 4.1, the importance of Dis

constrained by a universal factor c0asymptotically. Here

c0=1

16 holds universally and does not depend on D. This

theorem implies that no matter what distribution Dis, for

sufﬁciently large K, any algorithm can only reduce the to-

tal regret by at most a constant factor with the extra knowl-

edge of D, making a distribution-dependent improvement

impossible for universal K.

Theorem 4.1. There exists an MDP instance set Ω, a uni-

versal constant c0=1

16 , and an algorithm ˆ

Athat only

inputs the episode K, such that for any distribution Dwith

positive p.d.f. p(r)∈C(Ω) (which ˆ

Adoes NOT know), any

algorithm Athat inputs Dand the episode K,

1. Ωis not degraded, i.e.,

lim

K→∞ RegK(D,A(D, K)) = +∞.

2. Knowing the distribution is useless up to a constant,

i.e.

lim inf

K→∞

RegK(D,A(D, K))

RegK(D,ˆ

A(K)) ≥c0.

In Theorem 4.1, Point (1) avoids any trivial support Ω

where ∃π∗that is optimal for all M ∈ Ω, in which case

the distribution is of course useless since ˆ

Acan be optimal

by simply following π∗even it does not know D. Note that

our bounds hold for any distribution D, which indicates that

even a very sharp distribution cannot provide useful infor-

mation in the asymptotic case where K→ ∞. The value of

c0depends on the coefﬁcient of previous upper and lower

bound, and we conjecture that it could be arbitrarily close

to 1.

We defer the complete proof to Appendix B, and brieﬂy

sketch the intuition here. The key observation is that the

information provided in the training stage (prior) is ﬁxed,

while the required information gradually increase as Kin-

creases. When K= 1, the agent can clearly beneﬁt from

the knowledge of D. Without this knowledge, all it can do

is a random guess since it has never interacted with M∗be-

fore. However, when Kis large, the algorithm can interact

with M∗many times and learn M∗more accurately, while

the prior Dwill become relatively less informative. As a

result, the beneﬁts of knowing Dvanishes eventually.

Theorem 4.1 lower bounds the improvement of regret by

a constant. As is commonly known, the regret bound

can be converted into a PAC-RL bound (Jin et al.,2018;

Dann et al.,2017). This implies that when δ, ǫ →0,

in terms of pursuing a ǫ-optimal policy to π∗(M∗), pre-

training cannot help reduce the sample complexity. Despite

negative, we point out that this theorem only describe the

asymptotic setting where K→ ∞, but it imposes no con-

straint when Kis ﬁxed.

4.2. Improvement in the Non-asymptotic Setting

In the last subsection, we provide a lower bound showing

that the information obtained from the training stage can

be useless if we require an universal improvement with re-

spect to all episode K. However, a near-optimal regret in

the non-asymptotic and non-universal setting is also desir-

able in many applications in practice. In this section, we

ﬁx the value of Kand seek to design an algorithm such

that it can leverage the pre-training information and reduce

K-episode test regret. To avoid redundant explanation for

single MDP learning, we formulate the following oracles.

Deﬁnition 4.2. (Policy learning oracle) We deﬁne

Ol(M, ǫ, log(1/δ)) as the policy learning oracle which can

return a policy πthat is ǫ-optimal w.r.t. MDP Mwith prob-

ability at least 1−δ, i.e. V∗

M(s1)−Vπ

M(s1)≤ǫ. The ran-

domness of the policy πis due to the randomness of both

the oracle algorithm and the environment.

Deﬁnition 4.3. (Policy evaluation oracle) We deﬁne

Oe(M, π, ǫ, log(1/δ)) as the policy evaluation oracle

which can return a value vthat is ǫ-close to the value func-

tion Vπ

M(s1)with probability at least 1−δ, i.e. |v−

Vπ

M(s1)| ≤ ǫ. The randomness of the value vis due to

the randomness of both the oracle and the environment.

Both oracles can be efﬁciently implemented using the

previous algorithms for single-task MDPs. Speciﬁ-

cally, we can implement the policy learning oracle us-

ing algorithms such as UCBVI (Azar et al.,2017), LSVI-

UCB (Jin et al.,2020b) and GOLF (Jin et al.,2021) with

polynomial sample complexities. The policy evaluation

oracle can be achieved by the standard Monte Carlo

method (Sutton & Barto,2018).

4.2.1. ALGORITHM

There are two major difﬁculties in designing the algorithm.

First, what do we want to learn during the pre-training pro-

cess and how to learn it? One idea is to directly learn the

whole distribution D, which is all that we can obtain for

the test stage. However, this requires ˜

O(|Ω|2/δ2)samples

for a required accuracy δ, and is unacceptable when |Ω|is

large or even inﬁnite. Second, how to design the test-stage

algorithm to leverage the learned information effectively?

If we cannot effectively use the information from the pre-

training, the regret or samples required in the test stage can

be ˜

O(poly(S, A)) in the worst case.

To tackle the above difﬁculties, we formulate this problem

as a policy candidate collection-elimination process. Our

intuition is to ﬁnd a minimum policy set that can gener-

alize to most MDPs sampled from D. In the pre-training

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

arXiv:2210.10464v2[cs.LG]29Jun2023OnthePowerofPre-trainingforGeneralizationinRL:ProvableBeneﬁtsandHardnessHaotianYe*1XiaoyuChen*2LiweiWang23SimonS.Du4AbstractGeneralizationinReinforcementLearning(RL)aimstotrainanagentduringtrainingthatgen-eralizestothetargetenvironment.Inthiswork,weﬁrstpointoutthatR...

展开>> 收起<<

On the Power of Pre-training for Generalization in RLProvable Benefits and Hardness.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the Power of Pre-training for Generalization in RLProvable Benefits and Hardness

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: