Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories Qinqing Zheng1Mikael Henaff1Brandon Amos1Aditya Grover2 Abstract

2025-04-24 1 0 1.4MB 24 页 10玖币

侵权投诉

Semi-Supervised Ofﬂine Reinforcement Learning with Action-Free Trajectories

Qinqing Zheng 1Mikael Henaff 1Brandon Amos 1Aditya Grover 2

Abstract

Natural agents can effectively learn from multiple

data sources that differ in size, quality, and types

of measurements. We study this heterogeneity in

the context of ofﬂine reinforcement learning (RL)

by introducing a new, practically motivated semi-

supervised setting. Here, an agent has access to

two sets of trajectories: labelled trajectories con-

taining state, action and reward triplets at every

timestep, along with unlabelled trajectories that

contain only state and reward information. For

this setting, we develop and study a simple meta-

algorithmic pipeline that learns an inverse dynam-

ics model on the labelled data to obtain proxy-

labels for the unlabelled data, followed by the use

of any ofﬂine RL algorithm on the true and proxy-

labelled trajectories. Empirically, we ﬁnd this

simple pipeline to be highly successful — on sev-

eral D4RL benchmarks (Fu et al.,2020), certain

ofﬂine RL algorithms can match the performance

of variants trained on a fully labelled dataset even

when we label only 10% of trajectories which are

highly suboptimal. To strengthen our understand-

ing, we perform a large-scale controlled empirical

study investigating the interplay of data-centric

properties of the labelled and unlabelled datasets,

with algorithmic design choices (e.g., choice of

inverse dynamics, ofﬂine RL algorithm) to iden-

tify general trends and best practices for training

RL agents on semi-supervised ofﬂine datasets.

1 Introduction

One of the key challenges with deploying reinforcement

learning (RL) agents is their prohibitive sample complexity

for real-world applications. Ofﬂine reinforcement learn-

ing (RL) can signiﬁcantly reduce the sample complexity

by exploiting logged demonstrations from auxiliary data

Meta AI Research

UCLA. Correspondence to: Qinqing

Zheng <zhengqinqing@gmail.com>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

sources (Levine et al.,2020). Standard ofﬂine RL as-

sumes fully logged datasets: the trajectories are complete

sequences of observations, actions, and rewards. However,

contrary to curated benchmarks in use today, the nature of

ofﬂine demonstrations in the real world can be highly varied.

For example, the demonstrations could be misaligned due

to frequency mismatch (Burns et al.,2022), use different

sensors, actuators, or dynamics (Reed et al.,2022;Lee et al.,

2022), or lack partial state (Ghosh et al.,2022;Rafailov

et al.,2021;Mazoure et al.,2021) or reward information (Yu

et al.,2022). Successful ofﬂine RL in the real world requires

embracing these heterogeneous aspects for maximal data

efﬁciency, similar to learning in humans.

In this work, we propose a new and practically motivated

semi-supervised setup for ofﬂine RL: the ofﬂine dataset

consists of some action-free trajectories (which we call un-

labelled) in addition to the standard action-complete trajec-

tories (which we call labelled). In particular, we are mainly

interested in the case where a signiﬁcant majority of the

trajectories in the ofﬂine dataset are unlabelled, and the un-

labelled data might have different qualities than the labelled

ones. One motivating example for this setup is learning

from videos (Schmeckpeper et al.,2020a;b) or third-person

demonstrations (Stadie et al.,2017;Sharma et al.,2019).

There are tremendous amounts of internet videos that can

be potentially used to train RL agents, yet they are without

action labels and are of varying quality. Notably, our setup

has two key properties that differentiate it from traditional

semi-supervised learning:

•

First, we do not assume that the distribution of the labelled

and unlabelled trajectories are necessarily identical. In

realistic scenarios, we expect these to be different with un-

labelled data having higher returns than labelled data e.g.,

videos of a human professional are easy to obtain whereas

precisely measuring their actions is challenging. We repli-

cate such varied data quality setups in some of our experi-

ments; Figure 1.1 shows an illustration of the difference in

returns between the labelled and unlabelled dataset splits

using the hopper-medium-expert D4RL dataset.

•

Second, our end goal goes beyond labelling the actions

in the unlabelled trajectories, but rather we intend to use

the unlabelled data for learning a downstream policy that

is better than the behavioral policies used for generating

the ofﬂine datasets.

arXiv:2210.06518v3 [cs.LG] 22 Jun 2023

Semi-Supervised Ofﬂine Reinforcement Learning with Action-Free Trajectories

Figure 1.1: An example of the return distribution of the

labelled and unlabelled datasets.

Correspondingly, there are two kinds of generalization chal-

lenges in the proposed setup: (i) generalizing from the la-

belled to the unlabelled data distribution and then (ii) going

beyond the ofﬂine data distributions to get closer to the

expert distribution. Regular ofﬂine RL is only concerned

with the latter, and standard algorithms such as Conservative

Q Learning (

CQL

;Kumar et al. (2020)), TD3BC (

TD3BC

;

Fujimoto & Gu (2021)) or Decision Transformer (

;Chen

et al. (2021)), cannot directly operate on such unlabelled

trajectories. At the same time, na¨

ıvely throwing out the un-

labelled trajectories can be wasteful, especially when they

have high returns. Thus, our paper seeks to answer the

following question:

How can we best leverage the unlabelled data to im-

prove the performance of ofﬂine RL algorithms?

To answer this question, we study different approaches to

train policies in the semi-supervised setup described above,

and propose a meta-algorithmic pipeline Semi-Supervised

Ofﬂine Reinforcement Learning (

SS-ORL

contains three simple steps: (1) train an inverse dynamics

model (IDM) on the labelled data, which predicts actions

based on transition sequences, (2) ﬁll in proxy-actions for

the unlabelled data, and ﬁnally (3) train an ofﬂine RL agent

on the combined dataset.

The main takeaway of our paper is:

Given low-quality labelled data,

SS-ORL

agents can

exploit unlabelled data containing high-quality trajecto-

ries to improve performance. The absolute performance

SS-ORL

is close to or even matches that of the oracle

agents, which have access to complete action information

of both labelled and unlabelled trajectories.

From a technical standpoint, we address the limitations of

the classic IDM (Pathak et al.,2017) by proposing a novel

stochastic multi-transition IDM that incorporates previous

states to account for non-Markovian behavior policies. To

enable compute and data efﬁcient learning, we conduct thor-

ough ablation studies to understand how the performance

SS-ORL

agents are affected by the algorithmic design

choices, and how it varies as a function of data-centric prop-

erties such as the size and return distributions of labelled

and unlabelled datasets. We highlight a few predominant

trends from our experimental ﬁndings below:

Proxy-labelling is an effective way to utilize unlabelled

data. For example,

SS-ORL

instantiated with

the ofﬂine RL method signiﬁcantly outperforms an

alternative DT-based approach without proxy-labelling.

Simply training the IDM on the labelled dataset out-

performs more sophisticated semi-supervised protocols

such as self-training (Fralick,1967).

Incorporating past information into the IDM improves

generalization.

The performance of

SS-ORL

agents critically depend on

factors such as size and quality of the labelled and unla-

belled datasets, but the effect magnitudes depend on the

ofﬂine RL method. For example, we found that

TD3BC

is less sensitive to missing actions then DT and CQL.

2 Related Work

Ofﬂine RL The goal of ofﬂine RL is to learn effective poli-

cies from ﬁxed datasets which are generated by unknown

behavior policies. There are two main categories of model-

free ofﬂine RL methods: value-based methods and behavior

cloning (BC) based methods.

Value-based methods attempt to learn value functions based

on temporal difference (TD) updates. There is a line of

work that aims to port existing off-policy value-based on-

line RL methods to the ofﬂine setting, with various types

of additional regularization components that encourage the

learned policy to stay close to the behavior policy. Several

representative techniques include speciﬁcally tailored pol-

icy parameterizations (Fujimoto et al.,2019;Ghasemipour

et al.,2021), divergence-based regularization on the learned

policy (Wu et al.,2019;Jaques et al.,2019;Kumar et al.,

2019), and regularized value function estimation (Nachum

et al.,2019;Kumar et al.,2020;Kostrikov et al.,2021a;

Fujimoto & Gu,2021;Kostrikov et al.,2021b).

A growing body of recent work formulates ofﬂine RL as

a supervised learning problem (Chen et al.,2021;Janner

et al.,2021;Emmons et al.,2021). Compared with value-

based methods, these supervised methods enjoy several

appealing properties including algorithmic simplicity and

training stability. Generally speaking, these approaches can

be viewed as conditional behavior cloning methods (Bain &

Sammut,1995), where the conditioning is based on goals

or returns. Similar to value-based methods, these can be

extended to the online setup as well (Zheng et al.,2022)

and demonstrate excellent performance in hybrid setups

involving both ofﬂine data and online interactions.

Semi-Supervised Learning Semi-supervised learning

(SSL) is a sub-area of machine learning that studies ap-

proaches to train predictors from a small amount of labelled

data combined with a large amount of unlabelled data. In

Semi-Supervised Ofﬂine Reinforcement Learning with Action-Free Trajectories

supervised learning, predictors only learn from labelled data.

However, labelled training examples often require human

annotation efforts and are thus hard to obtain, whereas un-

labelled data can be comparatively easy to collect. The

research on semi-supervised learning spans several decades.

One of the oldest SSL techniques, self-training, was orig-

inally proposed in the 1960s (Fralick,1967). There, the

predictor is ﬁrst trained on the labelled data. Then, at each

training round, according to certain selection criteria such

as model uncertainty, a portion of the unlabelled data is

annotated by the predictor and added into the training set

for the next round. Such process is repeated multiple times.

We refer the readers to Zhu (2005); Chapelle et al. (2006);

Ouali et al. (2020); Van Engelen & Hoos (2020) for com-

prehensive literature surveys.

Imitation Learning from Observations There have

been several works in imitation learning (IL) which do

not assume access to the full set of actions, such as

BCO (Torabi et al.,2018a), MoBILE (Kidambi et al.,2021),

GAIfO (Torabi et al.,2018b) or third-person IL approaches

(Stadie et al.,2017;Sharma et al.,2019). The recent work

of Baker et al. (2022) also considered a setup where a small

number of labelled actions are available in addition to a large

unlabelled dataset. A key difference with our work is that

the IL setup typically assumes that all trajectories are gen-

erated by an expert, unlike our ofﬂine setup. Further, some

of these methods even permit reward-free interactions with

the environment which is not possible in the ofﬂine setup.

Learning from Videos Several works consider training

agents with human video demonstrations (Schmeckpeper

et al.,2020a;b), which are without action annotations. Dis-

tinct from our setup, some of these works allow for online

interactions, assume expert videos, and more broadly, video

data typically speciﬁes agents with different embodiments.

3 Semi-Supervised Ofﬂine RL

Preliminaries We model our environment as a Markov

decision process (MDP) (Bellman,1957) denoted by

⟨S,A, p, P, R, γ⟩

, where

is the state space,

is the

action space,

p(s1)

is the distribution of the initial state,

P(st+1|st, at)

is the transition probability distribution,

R(st, at)

is the deterministic reward function, and

is the

discount factor. At each timestep

, the agent observes a state

st∈ S

and executes an action

at∈ A

. The environment

then moves the agent to the next state

st+1 ∼P(·|st, at)

and also returns the agent a reward rt=R(st, at).

3.1 Proposed Setup

We assume the agent has access to a static ofﬂine dataset

Tofﬂine

. The dataset consists of trajectories collected by

unknown policies, which are generally suboptimal. Let

denote a trajectory and

|τ|

denote its length. We assume that

all the trajectories in

Tofﬂine

contain complete rewards and

states. However, only a small subset of them contain actions.

We are interested in learning a policy by leveraging the of-

ﬂine dataset without interacting with the environment. This

setup is analogous to semi-supervised learning, where ac-

tions serve the role of labels. Hence, we also refer to the

complete trajectories as labelled data (denoted by

Tlabelled

)

and the action-free trajectories as unlabelled data (denoted

Tunlabelled

). Further, we assume the labelled and unla-

belled data are sampled from two distributions Plabelled and

Punlabelled

, respectively. In general, the two distributions can

be different. One case we are particularly interested in is

when

Plabelled

generates low-to-moderate quality trajectories,

whereas

Punlabelled

generates trajectories of diverse qualities

including ones with high returns, as shown in Fig 1.1.

Our setup shares some similarities with state-only imitation

learning (Ijspeert et al.,2002;Bentivegna et al.,2002;Torabi

et al.,2019) in the use of action-unlabelled trajectories.

However, there are two fundamental differences. First, in

state-only IL, the unlabelled demonstrations are from the

same distribution as the labelled demonstrations, and both

are generated by a near-optimal expert policy. In our setting,

Plabelled

and

Punlabelled

can be different and are not assumed

to be optimal. Second, many state-only imitation learning

algorithms (e.g., Gupta et al. (2017); Torabi et al. (2018a;b);

Liu et al. (2018); Sermanet et al. (2018)) permit (reward-

free) interactions with the environments similar to their

original counterparts (e.g., Ho & Ermon (2016); Kim et al.

(2020)). This is not allowed in our ofﬂine setup, where the

agents are only provided with Tlabelled and Tunlabelled.

3.2 Training Pipeline

RL policies trained on low to moderate quality ofﬂine tra-

jectories are often sub-optimal, as many of the trajectories

might not have high returns and only cover a limited part

of the state space. Our goal is to ﬁnd a way to combine the

action labelled trajectories and the unlabelled action-free

trajectories, so that the ofﬂine agent can exploit structures

in the unlabelled data to improve performance.

One natural strategy is to ﬁll in proxy actions for those unla-

belled trajectories, and use the proxy-labelled data together

with the labelled data as a whole to train an ofﬂine RL agent.

Since we assume both the labelled and unlabelled trajec-

tories contain the states, we can train an inverse dynamics

model (IDM) ϕthat predicts actions using the states. Once

we obtain the IDM, we use it to generate the proxy actions

for the unlabelled trajectories. Finally, we combine those

proxy-labelled trajectories with the labelled trajectories, and

train an agent using the ofﬂine RL algorithm of choice. Our

meta-algorithmic pipeline is summarized in Algorithm 1.

Particularly, we propose a novel stochastic multi-transition

Semi-Supervised Ofﬂine Reinforcement Learning with Action-Free Trajectories

Algorithm 1: Semi-supervised ofﬂine RL (SS-ORL)

1Input: trajectories Tlabelled and Tunlabelled, IDM transition size

k, ofﬂine RL algorithm ORL

// train a stochastic multi-transition

IDM using the labelled data

θ←argminθP(at,st,−k)in Tlabelled [−log ϕθ(at|st,−k)]

// fill in the proxy actions for the

unlabelled data

3Tproxy ←∅

4for each trajectory τ∈Tunlabelled do

5bat←µb

θ(st,−k), i.e. mean of

Nµb

θ(st,−k),Σb

θ(st,−k),t= 1,...,|τ|

6τproxy ←τwith proxy actions {bat}|τ|

t=1 ﬁlled in

7Tproxy ←Tproxy S{τproxy}

// train an offline RL agent using the

combined data

8π←policy trained by ORL using dataset Tlabelled STproxy

9Output: π

IDM that incorporates past information to enhance the treat-

ment for stochastic MDPs and non-Markovian behavior

policies. Section 3.2.1 discusses the details.

Of note,

SS-ORL

is a multi-stage pipeline, where the IDM

is trained only on the labelled data in a single round. There

are other possible ways to combine the labelled and unla-

belled data. In Section 3.2.2, we discuss several alternative

design choices and the key reasons why we do not employ

them. Additionally, we present the ablation experiments in

Section 4.2.

3.2.1 STOCHASTIC MULTI-TRANSITION IDM

In past work (Pathak et al.,2017;Burda et al.,2019;Henaff

et al.,2022), the IDM typically learns to map two subsequent

states of the

-th transition,

(st, st+1)

, to

. In theory,

this is sufﬁcient when the ofﬂine dataset is generated by

a single Markovian policy in a deterministic environment,

see Appendix Dfor the analysis. However, in practice,

the ofﬂine dataset might contain trajectories logged from

multiple sources.

To provide better treatment for multiple behavior poli-

cies, we introduce a multi-transition IDM that predicts

the distribution of

using the most recent

k+ 1

tran-

sitions. More precisely, let

st,−k

denote the sequence

smin(0,t−k), . . . , st, st+1

. We model

P(at|st,−k)

as a multi-

variate Gaussian with a diagonal covariance matrix:

at∼ N µθ(st,−k),Σθ(st,−k).(1)

Let

ϕθ(at|st,−k)

be the probability density function of

Nµθ(st,−k),Σθ(st,−k)

. Given the labelled trajecto-

ries

Tlabelled

, we minimize the negative log-likelihood loss

P(at,st,−k)in Tlabelled [−log ϕθ(at|st,−k)]

. We call

the tran-

sition size parameter. Note that the standard IDM which

predicts

from

(st, st+1)

under the

ℓ2

loss, is a special

case subsumed by our model: it is equivalent to the case

k= 0

and the diagonal entries of

Σθ

(i.e., the variances of

each action dimension) are all the same.

In essence, we approximate

p(at|st+1, . . . , s1)

p(at|st,−k)

, and choosing

k > 0

allows us to take past

state information into account. Meanwhile, the theory also

indicates that incorporating future states like

st+2

would

not help to predict

(see the analysis in Appendix Dfor

details). For all the experiments in this paper, we use

k= 1

We ablate this design choice in Section 4.2. Moreover,

our IDM naturally extends to non-Markovian policies and

stochastic MDPs. This is beyond the scope of this paper, but

we consider them as potential directions for future work.

3.2.2 ALTERNATIVE DESIGN CHOICES

Training without Proxy Labelling

SS-ORL

ﬁlls in proxy

actions for the unlabelled trajectories before training the

agent. There, the policy learning task is deﬁned on the

combined dataset of the labelled and unlabelled data. An

alternative approach is to only use the labelled data to

deﬁne the policy learning task, but create certain auxiliary

tasks using the unlabelled data. These auxiliary tasks do

not depend on actions, so that proxy-labelling is not needed.

Multitask learning approaches can be employed to train an

agent that solves those tasks together. For example, Reed

et al. (2022) train a generalist agent that processes diverse

sequences with a single transformer model. In a similar vein,

we consider

DT-Joint

, a variant of

, that trains on both

labelled and unlabelled data simultaneously. In a nutshell,

DT-Joint

predicts actions for the labelled trajectories,

and states and rewards for both labelled and unlabelled

trajectories. See Appendix Ffor the implementation details.

Nonetheless, our ablation experiment in Section 4.2 shows

that SS-ORL signiﬁcantly outperforms DT-Joint.

Self-Training for the IDM The annotation process in

SS-ORL

, which involves training an IDM on the labelled

data and generating proxy actions for the unlabelled trajec-

tories, is similar to one step of self-training (Fralick,1967,

Cf. Section 2), one commonly used approach in standard

semi-supervised learning. However, a key difference is that

we do not retrain the IDM but directly move to the next

stage of training the agent using the combined data. There

are a few reasons that we do not employ self-training for the

IDM. First, it is computationally expensive to execute multi-

ple rounds of training. More importantly, our end goal is to

obtain a downstream policy with improved performance via

utilizing the proxy-labelled data. As a baseline, we consider

self-training for the IDM, where after each training round

we add the proxy-labelled data with low predictive uncer-

tainties into the training set for the next round. Empirically,

we found that this variant underperforms our approach. See

Section 4.2 and Appendix Efor more details.

Semi-Supervised Ofﬂine Reinforcement Learning with Action-Free Trajectories

4 Experiments

Our main objectives are to answer four sets of questions:

Q1.

How close can

SS-ORL

agents match the performance

of fully supervised ofﬂine RL agents, especially when

only a small subset of trajectories is labelled?

Q2.

How do the

SS-ORL

agents perform under different

design choices for training the IDM, or even avoiding

proxy-labelling completely?

Q3.

How does the performance of

SS-ORL

agents vary as

a function of the size and quality of the labelled and

unlabelled datasets?

Q4.

Do different ofﬂine RL methods respond differently to

various setups of the dataset size and quality?

We focus on two

Gym

locomotion tasks,

hopper

and

walker

, with the v2

medium-expert

medium

and

medium-replay

datasets from the D4RL bench-

mark (Fu et al.,2020). Due to space constraints, the

results on

medium

and

medium-replay

datasets

are deferred to Appendix C. We respond to the above

questions in Section 4.1,4.2,4.3 and 4.4, respec-

tively. We also include additional experiments on

the

maze2d

environments in Appendix H. For all

experiments, we train

instances of each method with

different seeds, and for each instance we roll out

evaluation trajectories. Our code is available at

https:

//github.com/facebookresearch/ssorl/.

4.1 Main Evaluation (Q1)

Data Setup We subsample

10%

of the total ofﬂine trajec-

tories whose returns are from the bottom

as the labelled

trajectories,

10 ≤q≤100

. The actions of the remaining

trajectories are discarded to create the unlabelled ones. We

refer to this setup as the coupled setup, since the labelled

data distribution

Plabelled

and the unlabelled data distribution

Punlabelled

will change simultaneously as we vary the value of

. As

increases, the labelled data quality increases and the

distributions

Plabelled

and

Punlabelled

become closer. When

q= 100

, our setup is equivalent to sampling the labelled

trajectories uniformly and

Plabelled =Punlabelled

. Note that

under our setup, we always have

10%

trajectories labelled

and

90%

unlabelled, and the total amount of data used to

train the ofﬂine RL agent is the same as the original ofﬂine

dataset. This allows for easy comparison with results under

the standard, fully labelled setup. In Section 4.3, we will de-

couple

Plabelled

and

Punlabelled

for an in-depth understanding

of their individual inﬂuences on the SS-ORL agents.

Inverse Dynamics Model We train an IDM as described

in Section 3with

k= 1

. That is, the IDM predicts

using

3 consecutive states:

st−1, st

and

st+1

, where the mean

and the covariance matrix are predicted by two independent

multilayer perceptrons (MLPs), each containing two

hidden layers and

1024

hidden units per layer. To prevent

overﬁtting, we randomly sample

10%

of the labelled

trajectories as the validation set, and use the IDM that

yields the best validation error within 100k iterations.

Ofﬂine RL Methods We instantiate Algorithm 1with DT,

CQL

and

TD3BC

as the underlying ofﬂine RL methods.

is a recently proposed conditional behaviour cloning (BC)

method that uses sequence modelling tools to model the

trajectories.

CQL

is a representative value-based ofﬂine RL

method.

TD3BC

is a hybrid method which adds a BC term to

regularize the Q-learning updates. We refer to these instan-

tiations as

SS-DT

SS-CQL

and

SS-TD3BC

, respectively.

See Appendix Afor the implementation details.

Results We compare the performance of the

SS-ORL

agents with corresponding baseline and oracle agents. The

baseline agents are trained on the labelled trajectories only,

and the oracle agents are trained on the full ofﬂine dataset

with complete action labels. Intuitively, the performance of

the baseline and the oracle agents can be considered as the

(estimated) lower and upper bounds for the performance of

the

SS-ORL

agents. We consider

different values of

10,30,50,70,90

and

100

, and we report the average return

and standard deviation after

200

k iterations. Figure 4.1 plots

the results on the

medium-expert

datasets. On both

datasets, the

SS-ORL

agents consistently improve upon the

baselines. Remarkably, even when the labelled data quality

is low, the

SS-ORL

agents are able to obtain decent returns.

increases, the performance of the

SS-ORL

agents also

keeps increasing and ﬁnally matches the performance of the

oracle agents.

To quantitatively measure how a

SS-ORL

agent tracks the

performance of the corresponding oracle agent, we deﬁne

the relative performance gap of SS-ORL agents as

Perf(Oracle-ORL) −Perf(SS-ORL)

Perf(Oracle-ORL) ,(2)

and similarly for the baseline agents. Figure 4.2 plots the

average relative performance gap of these agents. Compared

with the baselines, the

SS-ORL

agents notably reduce the

relative performance gap.

Our results generalize to even fewer percentage of labelled

data. Figure 4.3 plots the relative performance gap of the

agents trained on

walker

medium-expert

datasets,

when only

of the total trajectories are labelled. See Ap-

pendix C.3 for more experiments. Similar observations can

be found in the results of

medium

and

medium-replay

datasets, see Figure C.1 and C.2.

4.2

Comparison with Alternative Design Choices (Q2)

Training without Proxy-Labelling Figure 4.4 plots the

performance of

DT-Joint

and the

SS-ORL

agents on

the

hopper

medium-expert

dataset, using the coupled

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Semi-SupervisedOfflineReinforcementLearningwithAction-FreeTrajectoriesQinqingZheng1MikaelHenaff1BrandonAmos1AdityaGrover2AbstractNaturalagentscaneffectivelylearnfrommultipledatasourcesthatdifferinsize,quality,andtypesofmeasurements.Westudythisheterogeneityinthecontextofofflinereinforcementlearning(R...

展开>> 收起<<

Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories Qinqing Zheng1Mikael Henaff1Brandon Amos1Aditya Grover2 Abstract.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories Qinqing Zheng1Mikael Henaff1Brandon Amos1Aditya Grover2 Abstract

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: