QUERY THEAGENT IMPROVING SAMPLE EFFICIENCY THROUGH EPISTEMIC UNCERTAINTY ESTIMATION Julian Alverio

2025-05-01 0 0 1.45MB 12 页 10玖币
侵权投诉
QUERY THE AGENT: IMPROVING SAMPLE EFFICIENCY
THROUGH EPISTEMIC UNCERTAINTY ESTIMATION
Julian Alverio
MIT CSAIL
jalverio@mit.edu
Boris Katz
MIT CSAIL
boris@mit.edu
Andrei Barbu
MIT CSAIL
abarbu@mit.edu
ABSTRACT
Curricula for goal-conditioned reinforcement learning agents typically rely on
poor estimates of the agent’s epistemic uncertainty or fail to consider the agents’
epistemic uncertainty altogether, resulting in poor sample efficiency. We propose a
novel algorithm, Query The Agent (QTA), which significantly improves sample
efficiency by estimating the agent’s epistemic uncertainty throughout the state
space and setting goals in highly uncertain areas. Encouraging the agent to collect
data in highly uncertain states allows the agent to improve its estimation of the
value function rapidly. QTA utilizes a novel technique for estimating epistemic
uncertainty, Predictive Uncertainty Networks (PUN), to allow QTA to assess the
agent’s uncertainty in all previously observed states. We demonstrate that QTA
offers decisive sample efficiency improvements over preexisting methods.
1 INTRODUCTION
Deep reinforcement learning has been demonstrated to be highly effective in a diverse array of se-
quential decision-making tasks (Silver et al., 2016; Berner et al., 2019). However, deep reinforcement
learning remains challenging to implement in the real world, in part because of the massive amount of
data required for training. This challenge is acute in robotics (Sünderhauf et al., 2018; Dulac-Arnold
et al., 2019), in tasks such as manipulation (Liu et al.), and in self-driving cars (Kothari et al.).
Existing curriculum methods for training goal-conditioned reinforcement learning (RL) agents suffer
from poor sample efficiency (Dulac-Arnold et al., 2019) and often fail to consider agents’ specific
deficiencies and epistemic uncertainty when selecting goals. Instead, they rely on poor proxy
estimates of epistemic uncertainty or high-level statistics from rollouts, such as the task success rate.
Without customizing learning according to agents’ epistemic uncertainties, existing methods inhibit
the agent’s learning with three modes of failure. Firstly, a given curriculum may not be sufficiently
challenging for an agent, thus using timesteps inefficiently. Secondly, a given curriculum may be
too challenging to an agent, causing the agent to learn more slowly than it could otherwise. Thirdly,
a curriculum may fail to take into account an agent catastrophically forgetting the value manifold
in a previously learned region of the state space. Curriculum algorithms need a detailed estimate
of the agent’s epistemic uncertainty throughout the state space in order to maximize learning by
encouraging agents to explore the regions of the state space the agent least understands (Kaelbling,
1993; Plappert et al., 2018).
We propose a novel curriculum algorithm, Query The Agent (QTA), to accelerate learning in goal-
conditioned settings. QTA estimates the agent’s epistemic uncertainty in all previously observed
states, then drives the agent to reduce its epistemic uncertainty as quickly as possible by setting
goals in states with high epistemic uncertainty. QTA estimates epistemic uncertainty using a novel
neural architecture, Predictive Uncertainty Networks (PUN). By taking into account the agent’s
epistemic uncertainty throughout the state space, QTA aims to explore neither too quickly nor too
slowly, and revisit previously explored states when catastrophic forgetting occurs. We demonstrate in
a 2D continuous maze environment that QTA is significantly more sample efficient than preexisting
1
arXiv:2210.02585v1 [cs.LG] 5 Oct 2022
Figure 1: Left: An overview of our goal-selection method.
*
Threshold filter is only applied when
sampling the first goal of an episode. Right: An example of a 1-period square-wave maze, and an
M-maze. Starting locations are sampled uniformly in the blue regions and goal locations are sampled
uniformly in the green regions. Environment described in detail in section 4
methods. We also provide a detailed analysis of how QTAs approximation of the optimal value
manifold evolves over time, demonstrating that QTAs learning dynamics are meaningfully driven by
epistemic uncertainty estimation. An overview of QTA and our maze environments are shown in 1.
We further demonstrate the importance of utilizing the agent’s epistemic uncertainty by extending
QTA with a modified Prioritized Experience Replay (Schaul et al., 2016) (PER) buffer. This modified
PER buffer assigns sampling priorities to transitions based on states’ estimated epistemic uncertainty.
QTA augmented with our modified PER buffer outperforms QTA, while QTA using a standard
PER buffer does not. This demonstrates the benefits of integrating detailed epistemic uncertainty
estimation into reinforcement learning curricula.
Our contributions are:
1.
Query the Agent (QTA): A novel curriculum algorithm that adapts goals to the agent’s
epistemic uncertainty and pushes an agent to improve rapidly by collecting data in highly
uncertainty states in the environment.
2.
Predictive Uncertainty Networks (PUN): A new technique, broadly applicable to all Q
learning approaches, for measuring an agent’s epistemic uncertainty throughout the state
space by using the agent’s own latent representation. We demonstrate an implementation
with DDPG.
3.
An analysis, including ablation experiments, on how QTA estimates epistemic uncertainty
throughout state space and how QTA evolves an agent’s understanding of its environment
over time.
2 RELATED WORK
Recent advances build upon Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) by
designing curricula that allow the agent to cleverly select goals to accelerate learning. None of
these appropriately take into account the agent’s epistemic uncertainty. We will categorize methods
according to which information about the agent they utilize to select goals, then discuss them.
2
Pitis et al. (2020) does not directly take into account any information regarding the agent’s epistemic
uncertainty or performance. Their method simply fits a kernel density estimate over the set of
previously visited goals, then draws samples as goal candidates and selects the sample with the
lowest likelihood under the kernel density estimate. This method benefits from reliably producing
increasingly difficult goals for the agent and being immune to problems such as a goal generator
network destabilizing. There are multiple drawbacks. Firstly, if an agent suffers from catastrophic
forgetting, the curriculum won’t be able to correct the problem effectively. Secondly, if the incremental
goals are insufficiently/overly challenging for the agent, the curriculum won’t properly adapt. Pitis
et al. (2020) shares some commonalities with pseudocount techniques in setting goals exclusively
based on visitation frequency.
Similarly, count-based and pseudocount-based algorithms (Tang et al.; Bellemare et al.) rely on
counting the number of times that a state or a region of state space has been visited. In high
dimensional or continuous environments, these algorithms rely on various compression or learned
similarity metrics (Machado et al.; Schmidhuber, 2008) or basic statistics in order to identify which
regions of the state space are considered similar and adjacent to one another. Intrinsic exploration
bonuses can be awarded and goals can be set based on visitation frequencies to accelerate exploration.
These methods, taking into account exclusively visitation frequency, do not directly to into account
the agent’s epistemic uncertainty or performance and may set goals too ambitiously or lazily and
won’t account for catastrophic forgetting.
Other techniques, such as Bharadhwaj et al. (2020); Portelas et al. (2019); Florensa et al. (2018),
only indirectly take into account the agent’s understanding of the state space through approximations
based on statistics regarding agents’ historical ability to reach goals. Since such metrics offer
minimal insight into the agent’s understanding of the task, these methods select goals that may not be
maximally informative to the agent. The major weakness of these methods is that the goal-generation
networks can destabilize, and are required to set goals with only very abstract data about the agent’s
understanding of the state space.
Another class of algorithms, such as Pathak et al.; Zhang et al. (2020), rely on an external ensemble
of networks and estimate epistemic uncertainty throughout the state space. Pathak et al. maintains
an external network of forward models, which observe the same data as the agent, and uses the
disagreement among the forward models as an epistemic uncertainty estimate. Zhang et al. (2020)
does the same, but with three external critics. The major problem with these algorithms is that they
are estimating the epistemic uncertainty of other networks, not the agent’s network. Because these
external networks’ understanding of the state space will drift away from the agent’s over time, the
epistemic uncertainty estimate of the external networks is not representative of the agent’s epistemic
uncertainty estimate. Additionally, Zhang et al. (2020) requires privileged access to the entire state
space in order to stabilize.
Other curiosity-related methods, such as random network distillation (Burda et al.), use different
formulations of a similar mechanism; estimating the uncertainty of networks disconnected from the
agent’s computation graph. Random network distillation relies on using the agent’s experiences
to train an external network to predict the output of a frozen, randomly initialized network known
as the target network. As an agent visits a particular state more, the external network improves its
prediction of the output of the target network, making it an effective exploration technique to provide
intrinsic reward bonuses proportionate to the size of the external network’s prediction error. However,
this approach relies on estimating the agent’s error in the random network prediction task to gauge
uncertainty in the Q-learning task. These are two very different tasks with different difficulties and
will learn at different speeds, thus this technique provides a poor estimate of epistemic uncertainty.
A number of techniques (Chane-Sane et al., 2021; Nachum et al., 2018; Christen et al., 2021; OpenAI
et al., 2021) have used multiple reinforcement learning agents in order to solve long-horizon tasks,
by either using the agents in an adversarial fashion or in a hierarchy. Using multiple reinforcement
learning agents incurs a significant computation expense, and comes along with significant technical
challenges such as nonstationarity between agents. Consideration of these techniques is outside the
scope of this work, as this work focuses on maximizing the learning of a single agent.
3
摘要:

QUERYTHEAGENT:IMPROVINGSAMPLEEFFICIENCYTHROUGHEPISTEMICUNCERTAINTYESTIMATIONJulianAlverioMITCSAILjalverio@mit.eduBorisKatzMITCSAILboris@mit.eduAndreiBarbuMITCSAILabarbu@mit.eduABSTRACTCurriculaforgoal-conditionedreinforcementlearningagentstypicallyrelyonpoorestimatesoftheagent'sepistemicuncertaintyo...

展开>> 收起<<
QUERY THEAGENT IMPROVING SAMPLE EFFICIENCY THROUGH EPISTEMIC UNCERTAINTY ESTIMATION Julian Alverio.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.45MB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注