Boosting Ofﬂine Reinforcement Learning via Data Rebalancing Yang Yue1 2 Bingyi Kang1 Xiao Ma1Zhongwen Xu1Gao Huang2Shuicheng Yan1 Abstract

2025-05-06 0 0 698.67KB 8 页 10玖币

侵权投诉

Boosting Ofﬂine Reinforcement Learning via Data Rebalancing

Yang Yue 1 2 * Bingyi Kang 1 † Xiao Ma 1Zhongwen Xu 1Gao Huang 2Shuicheng Yan 1

Abstract

Ofﬂine reinforcement learning (RL) is challenged

by the distributional shift between learning poli-

cies and datasets. To address this problem, ex-

isting works mainly focus on designing sophisti-

cated algorithms to explicitly or implicitly con-

strain the learned policy to be close to the be-

havior policy. The constraint applies not only to

well-performing actions but also to inferior ones,

which limits the performance upper bound of the

learned policy. Instead of aligning the densities

of two distributions, aligning the supports gives a

relaxed constraint while still being able to avoid

out-of-distribution actions. Therefore, we pro-

pose a simple yet effective method to boost ofﬂine

RL algorithms based on the observation that re-

sampling a dataset keeps the distribution support

unchanged. More speciﬁcally, we construct a bet-

ter behavior policy by resampling each transition

in an old dataset according to its episodic return.

We dub our method ReD (Return-based Data Re-

balance), which can be implemented with less

than 10 lines of code change and adds negligible

running time. Extensive experiments demonstrate

that ReD is effective at boosting ofﬂine RL per-

formance and orthogonal to decoupling strategies

in long-tailed classiﬁcation. New state-of-the-arts

are achieved on the D4RL benchmark.

1. Introduction

Recent advances in Deep Reinforcement Learning (DRL)

have achieved great success in various challenging decision-

making applications, such as board games (Schrittwieser

et al.,2020) and strategy games (Vinyals et al.,2019). How-

ever, DRL naturally works in an online paradigm where

agents need to actively interact with environments for ex-

perience collection. This hinders DRL from applications in

Sea AI Lab

Department of Automation, BNRist, Ts-

inghua University. Correspondence to: Yang Yue

le-

y22@mails.tsinghua.edu.cn

This work was done when Yang

Yue was an intern at Sea AI Lab. †Corresponding Author.

the real-world scenarios where interactions are prohibitively

expensive and dangerous. Ofﬂine Reinforcement learning

attempts to address the problem by learning from previously

collected data, which allows utilizing large datasets to train

agents (Lange et al.,2012). Vanilla off-policy RL algorithms

suffer poor performance in the ofﬂine setting due to the

distributional shift problem (Fujimoto et al.,2019). Specif-

ically, performing policy evaluation, i.e., updating value

function with Bellman’s equations, involves querying the

value of out-of-distribution (OOD) state-action pairs, which

potentially leads to accumulative extrapolation error. The

main class of existing methods alleviates the above problem

via constraining the learned policy not to deviate far from

the behavior policy by

directly restricting their probabil-

ity densities

. The constraint can be KL divergence (Jaques

et al.,2019;Peng et al.,2019), Wasserstein distance (Wu

et al.,2019), maximum mean discrepancy (MMD) (Kumar

et al.,2019), or behavior cloning regularization (Fujimoto

& Gu,2021).

However, such constraints might be too restrictive as the

learned policy is forced to mimic both bad and good actions

of the behavior policy. For instance, consider a dataset

for state space

and action space

A={a1, a2, a3}

col-

lected with behavior policy

. At one speciﬁc state

s∗

, the

policy

assign probability

0.2

to action

, 0.8 to

and

zero density to

. However,

would lead to much higher

expected return than

. Minimizing the density distance

of two policies can avoid

, but forces the learned policy

to choose

over

, resulting in much worse performance.

Therefore, a more reasonable condition is to constrain two

policy distributions to have the same support of action, i.e.,

the learned policy has positive density only on actions that

give non-zero probability in the behavior policy (Kumar

et al.,2019). In this case, it regularizes the learning pol-

icy to sample in-distribution state-action pairs and gives

a higher performance upper bound. We term it

support

alignment

as a more ﬂexible relaxation of the behavior

regularization. Nevertheless, explicit support alignment is

intractable in practice (Kumar et al.,2019), especially for

high-dimensional continuous action spaces.

We make one important observation that reweighting the

data distribution density does not change the support of

the data distribution, i.e., zero density is still zero density

after reweighting. In the context of ofﬂine RL, instead of

arXiv:2210.09241v1 [cs.LG] 17 Oct 2022

sampling uniformly from the ofﬂine dataset, varying the

sampling rate of ofﬂine samples to focus on trajectories

with higher accumulative returns, i.e., changing the action

density, does not change the support and produces a better

behavior policy. Matching the learned policy with a resam-

pled policy is approximately performing support alignment.

In this work, we boost ofﬂine RL by designing data rebalanc-

ing strategies to construct better behavior policies. We ﬁrst

show that existing ofﬂine datasets are extremely imbalanced

in terms of episodic return (as shown in Fig. 1). In some

datasets, most actions lead to a low return, which renders

the possibility that current density-based constraints are too

restrictive. We thus propose to resample the dataset dur-

ing training based on episodic return, which assigns larger

weights to transitions with higher returns. The method is

thus dubbed Return-based Data Rebalance (ReD), and can

be easily implemented with less than 10 lines of code. With-

out any modiﬁcation to prior hyperparameters, we ﬁnd that

ReD effectively boosts the performance of various popular

ofﬂine RL algorithms by a large margin on diverse domains

in D4RL (Brockman et al.,2016;Fu et al.,2020). Then as

our minor contribution, we propose a more elaborated im-

plementation of data rebalance, Decoupled ReD (DeReD),

inspired by decoupling strategies for data rebalance train-

ing in long-tailed classiﬁcation (Kang et al.,2020). The

proposed DeReD combined with IQL achieves the state-of-

the-art performance on D4RL. The effectiveness of return-

based data rebalance may imply that data dimension is as

important as algorithmic dimension in ofﬂine RL.

hopper-medium-replay-v2 walker2d-medium-replay-v2

hopper-medium-expert-v2 walker2d-medium-expert-v2

Figure 1.

Visualization of Trajectory Return Distributions.

Medium-replay datasets are likely to have a long-tailed distribu-

tion, and medium-expert are likely to have two peaks.

2. Related Work

Ofﬂine RL.

To alleviate extrapolation error and address

the distributional shift problem, a general framework for

prior ofﬂine RL works is to constrain the learned policy

to stay close to the behavior policy. Considering KL-

divergence is easy to accurately compute under the Gaussian

distribution assumption, many works choose KL-divergence

as policy constraint. There are many concrete implemen-

tation choices, e.g., explicitly modeling behavior prior by

VAE, avoiding explicit modeling by the dual form (Wu

et al.,2019;Jaques et al.,2019). Exponentially advantage-

weighted regression, an implicit form of KL-divergence con-

straint, is derived by AWR (Peng et al.,2019), CRR (Wang

et al.,2020) and AWAC (Nair et al.,2020). IQL (Kostrikov

et al.,2021b) also extracts policy via advantage-weighted

regression from the expectile value function, enforcing a

KL constraint. Behavior cloning (BC) is another alternative

to implement constraint (Fujimoto & Gu,2021). There also

exist other directions to realize the ofﬂine RL. One line is

to regularize Q-function by conservative estimate (Kumar

et al.,2020;Buckman et al.,2020). Surprisingly, our experi-

ments show that data rebalance also works with conservative

Q-learning (Kumar et al.,2020). Another line is to view

ofﬂine RL as the sequential modeling problem by masked

transformer (Chen et al.,2021;Janner et al.,2021), and then

the transformer outputs actions to attain the given return.

Some works attempt to approximately satisfy support align-

ment. BEAR (Kumar et al.,2019) utilizes maximum mean

discrepancy to approximately optimize support alignment.

However, the effectiveness that the sampled MMD has in

constraining two distributions to the same support is only

empirically shown in a low-dimension distribution with di-

agonal covariance matrices with no theoretical guarantee,

and MMD is extremely complex to implement. That’s a

possible reason why Wu et al. ﬁnds MMD has no gain

compared to KL. Another attempt to relax the restrictive

constraint is adaptively adjusting the weight of the constraint

term by dual gradient ascent. However, Wu et al. observes

that the adaptive weight has a modest disadvantage over a

ﬁxed one.

Rebalance Data.

Dataset rebalance is widely used in vi-

sual tasks when facing a long-tailed distribution (Zhang

et al.,2021). For decision making, imitation learning (IL)

aims to learn from demonstration, where rebalance is natu-

rally applied to ﬁlter out bad demonstrations. BAIL (Chen

et al.,2020) employs a neural network to approximate the

upper envelope (i.e., the optimal return from data) and select

good state-action pairs to imitate. Another rebalance in IL

is 10%BC where behavior cloning only uses the top 10%

of transitions ordered by episode return (Chen et al.,2021).

MARWIL (Wang et al.,2018) and AWR (Peng et al.,2019)

employ exponentially advantage-weighted behavior cloning,

equivalent to policy improvement step with KL constraint in

RL. Experiments show that our method can further improve

KL constraint methods.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BoostingOfineReinforcementLearningviaDataRebalancingYangYue12*BingyiKang1XiaoMa1ZhongwenXu1GaoHuang2ShuichengYan1AbstractOfinereinforcementlearning(RL)ischallengedbythedistributionalshiftbetweenlearningpoli-ciesanddatasets.Toaddressthisproblem,ex-istingworksmainlyfocusondesigningsophisti-catedalg...

展开>> 收起<<

Boosting Ofﬂine Reinforcement Learning via Data Rebalancing Yang Yue1 2 Bingyi Kang1 Xiao Ma1Zhongwen Xu1Gao Huang2Shuicheng Yan1 Abstract.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Boosting Ofﬂine Reinforcement Learning via Data Rebalancing Yang Yue1 2 Bingyi Kang1 Xiao Ma1Zhongwen Xu1Gao Huang2Shuicheng Yan1 Abstract

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: