QUERY THEAGENT IMPROVING SAMPLE EFFICIENCY THROUGH EPISTEMIC UNCERTAINTY ESTIMATION Julian Alverio

2025-05-01 0 0 1.45MB 12 页 10玖币

侵权投诉

QUERY THE AGENT: IMPROVING SAMPLE EFFICIENCY

THROUGH EPISTEMIC UNCERTAINTY ESTIMATION

Julian Alverio

MIT CSAIL

jalverio@mit.edu

Boris Katz

MIT CSAIL

boris@mit.edu

Andrei Barbu

MIT CSAIL

abarbu@mit.edu

ABSTRACT

Curricula for goal-conditioned reinforcement learning agents typically rely on

poor estimates of the agent’s epistemic uncertainty or fail to consider the agents’

epistemic uncertainty altogether, resulting in poor sample efﬁciency. We propose a

novel algorithm, Query The Agent (QTA), which signiﬁcantly improves sample

efﬁciency by estimating the agent’s epistemic uncertainty throughout the state

space and setting goals in highly uncertain areas. Encouraging the agent to collect

data in highly uncertain states allows the agent to improve its estimation of the

value function rapidly. QTA utilizes a novel technique for estimating epistemic

uncertainty, Predictive Uncertainty Networks (PUN), to allow QTA to assess the

agent’s uncertainty in all previously observed states. We demonstrate that QTA

offers decisive sample efﬁciency improvements over preexisting methods.

1 INTRODUCTION

Deep reinforcement learning has been demonstrated to be highly effective in a diverse array of se-

quential decision-making tasks (Silver et al., 2016; Berner et al., 2019). However, deep reinforcement

learning remains challenging to implement in the real world, in part because of the massive amount of

data required for training. This challenge is acute in robotics (Sünderhauf et al., 2018; Dulac-Arnold

et al., 2019), in tasks such as manipulation (Liu et al.), and in self-driving cars (Kothari et al.).

Existing curriculum methods for training goal-conditioned reinforcement learning (RL) agents suffer

from poor sample efﬁciency (Dulac-Arnold et al., 2019) and often fail to consider agents’ speciﬁc

deﬁciencies and epistemic uncertainty when selecting goals. Instead, they rely on poor proxy

estimates of epistemic uncertainty or high-level statistics from rollouts, such as the task success rate.

Without customizing learning according to agents’ epistemic uncertainties, existing methods inhibit

the agent’s learning with three modes of failure. Firstly, a given curriculum may not be sufﬁciently

challenging for an agent, thus using timesteps inefﬁciently. Secondly, a given curriculum may be

too challenging to an agent, causing the agent to learn more slowly than it could otherwise. Thirdly,

a curriculum may fail to take into account an agent catastrophically forgetting the value manifold

in a previously learned region of the state space. Curriculum algorithms need a detailed estimate

of the agent’s epistemic uncertainty throughout the state space in order to maximize learning by

encouraging agents to explore the regions of the state space the agent least understands (Kaelbling,

1993; Plappert et al., 2018).

We propose a novel curriculum algorithm, Query The Agent (QTA), to accelerate learning in goal-

conditioned settings. QTA estimates the agent’s epistemic uncertainty in all previously observed

states, then drives the agent to reduce its epistemic uncertainty as quickly as possible by setting

goals in states with high epistemic uncertainty. QTA estimates epistemic uncertainty using a novel

neural architecture, Predictive Uncertainty Networks (PUN). By taking into account the agent’s

epistemic uncertainty throughout the state space, QTA aims to explore neither too quickly nor too

slowly, and revisit previously explored states when catastrophic forgetting occurs. We demonstrate in

a 2D continuous maze environment that QTA is signiﬁcantly more sample efﬁcient than preexisting

arXiv:2210.02585v1 [cs.LG] 5 Oct 2022

Figure 1: Left: An overview of our goal-selection method.

Threshold ﬁlter is only applied when

sampling the ﬁrst goal of an episode. Right: An example of a 1-period square-wave maze, and an

M-maze. Starting locations are sampled uniformly in the blue regions and goal locations are sampled

uniformly in the green regions. Environment described in detail in section 4

methods. We also provide a detailed analysis of how QTA’s approximation of the optimal value

manifold evolves over time, demonstrating that QTA’s learning dynamics are meaningfully driven by

epistemic uncertainty estimation. An overview of QTA and our maze environments are shown in 1.

We further demonstrate the importance of utilizing the agent’s epistemic uncertainty by extending

QTA with a modiﬁed Prioritized Experience Replay (Schaul et al., 2016) (PER) buffer. This modiﬁed

PER buffer assigns sampling priorities to transitions based on states’ estimated epistemic uncertainty.

QTA augmented with our modiﬁed PER buffer outperforms QTA, while QTA using a standard

PER buffer does not. This demonstrates the beneﬁts of integrating detailed epistemic uncertainty

estimation into reinforcement learning curricula.

Our contributions are:

Query the Agent (QTA): A novel curriculum algorithm that adapts goals to the agent’s

epistemic uncertainty and pushes an agent to improve rapidly by collecting data in highly

uncertainty states in the environment.

Predictive Uncertainty Networks (PUN): A new technique, broadly applicable to all Q

learning approaches, for measuring an agent’s epistemic uncertainty throughout the state

space by using the agent’s own latent representation. We demonstrate an implementation

with DDPG.

An analysis, including ablation experiments, on how QTA estimates epistemic uncertainty

throughout state space and how QTA evolves an agent’s understanding of its environment

over time.

2 RELATED WORK

Recent advances build upon Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) by

designing curricula that allow the agent to cleverly select goals to accelerate learning. None of

these appropriately take into account the agent’s epistemic uncertainty. We will categorize methods

according to which information about the agent they utilize to select goals, then discuss them.

Pitis et al. (2020) does not directly take into account any information regarding the agent’s epistemic

uncertainty or performance. Their method simply ﬁts a kernel density estimate over the set of

previously visited goals, then draws samples as goal candidates and selects the sample with the

lowest likelihood under the kernel density estimate. This method beneﬁts from reliably producing

increasingly difﬁcult goals for the agent and being immune to problems such as a goal generator

network destabilizing. There are multiple drawbacks. Firstly, if an agent suffers from catastrophic

forgetting, the curriculum won’t be able to correct the problem effectively. Secondly, if the incremental

goals are insufﬁciently/overly challenging for the agent, the curriculum won’t properly adapt. Pitis

et al. (2020) shares some commonalities with pseudocount techniques in setting goals exclusively

based on visitation frequency.

Similarly, count-based and pseudocount-based algorithms (Tang et al.; Bellemare et al.) rely on

counting the number of times that a state or a region of state space has been visited. In high

dimensional or continuous environments, these algorithms rely on various compression or learned

similarity metrics (Machado et al.; Schmidhuber, 2008) or basic statistics in order to identify which

regions of the state space are considered similar and adjacent to one another. Intrinsic exploration

bonuses can be awarded and goals can be set based on visitation frequencies to accelerate exploration.

These methods, taking into account exclusively visitation frequency, do not directly to into account

the agent’s epistemic uncertainty or performance and may set goals too ambitiously or lazily and

won’t account for catastrophic forgetting.

Other techniques, such as Bharadhwaj et al. (2020); Portelas et al. (2019); Florensa et al. (2018),

only indirectly take into account the agent’s understanding of the state space through approximations

based on statistics regarding agents’ historical ability to reach goals. Since such metrics offer

minimal insight into the agent’s understanding of the task, these methods select goals that may not be

maximally informative to the agent. The major weakness of these methods is that the goal-generation

networks can destabilize, and are required to set goals with only very abstract data about the agent’s

understanding of the state space.

Another class of algorithms, such as Pathak et al.; Zhang et al. (2020), rely on an external ensemble

of networks and estimate epistemic uncertainty throughout the state space. Pathak et al. maintains

an external network of forward models, which observe the same data as the agent, and uses the

disagreement among the forward models as an epistemic uncertainty estimate. Zhang et al. (2020)

does the same, but with three external critics. The major problem with these algorithms is that they

are estimating the epistemic uncertainty of other networks, not the agent’s network. Because these

external networks’ understanding of the state space will drift away from the agent’s over time, the

epistemic uncertainty estimate of the external networks is not representative of the agent’s epistemic

uncertainty estimate. Additionally, Zhang et al. (2020) requires privileged access to the entire state

space in order to stabilize.

Other curiosity-related methods, such as random network distillation (Burda et al.), use different

formulations of a similar mechanism; estimating the uncertainty of networks disconnected from the

agent’s computation graph. Random network distillation relies on using the agent’s experiences

to train an external network to predict the output of a frozen, randomly initialized network known

as the target network. As an agent visits a particular state more, the external network improves its

prediction of the output of the target network, making it an effective exploration technique to provide

intrinsic reward bonuses proportionate to the size of the external network’s prediction error. However,

this approach relies on estimating the agent’s error in the random network prediction task to gauge

uncertainty in the Q-learning task. These are two very different tasks with different difﬁculties and

will learn at different speeds, thus this technique provides a poor estimate of epistemic uncertainty.

A number of techniques (Chane-Sane et al., 2021; Nachum et al., 2018; Christen et al., 2021; OpenAI

et al., 2021) have used multiple reinforcement learning agents in order to solve long-horizon tasks,

by either using the agents in an adversarial fashion or in a hierarchy. Using multiple reinforcement

learning agents incurs a signiﬁcant computation expense, and comes along with signiﬁcant technical

challenges such as nonstationarity between agents. Consideration of these techniques is outside the

scope of this work, as this work focuses on maximizing the learning of a single agent.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

QUERYTHEAGENT:IMPROVINGSAMPLEEFFICIENCYTHROUGHEPISTEMICUNCERTAINTYESTIMATIONJulianAlverioMITCSAILjalverio@mit.eduBorisKatzMITCSAILboris@mit.eduAndreiBarbuMITCSAILabarbu@mit.eduABSTRACTCurriculaforgoal-conditionedreinforcementlearningagentstypicallyrelyonpoorestimatesoftheagent'sepistemicuncertaintyo...

展开>> 收起<<

QUERY THEAGENT IMPROVING SAMPLE EFFICIENCY THROUGH EPISTEMIC UNCERTAINTY ESTIMATION Julian Alverio.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

QUERY THEAGENT IMPROVING SAMPLE EFFICIENCY THROUGH EPISTEMIC UNCERTAINTY ESTIMATION Julian Alverio

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: