LEARNING DYNAMIC ABSTRACT REPRESENTATIONS FOR SAMPLE -EFFICIENT REINFORCEMENT LEARNING Mehdi Dadvar

2025-04-29 0 0 3.38MB 16 页 10玖币
侵权投诉
LEARNING DYNAMIC ABSTRACT REPRESENTATIONS
FOR SAMPLE-EFFICIENT REINFORCEMENT LEARNING
Mehdi Dadvar
Arizona State University
Tempe, AZ, 85281, USA
mdadvar@asu.edu
Rashmeet Kaur Nayyar
Arizona State University
Tempe, AZ, 85281, USA
rmnayyar@asu.edu
Siddharth Srivastava
Arizona State University
Tempe, AZ, 85281, USA
siddharths@asu.edu
ABSTRACT
In many real-world problems, the learning agent needs to learn a problem’s ab-
stractions and solution simultaneously. However, most such abstractions need to
be designed and refined by hand for different problems and domains of applica-
tion. This paper presents a novel top-down approach for constructing state abstrac-
tions while carrying out reinforcement learning. Starting with state variables and a
simulator, it presents a novel domain-independent approach for dynamically com-
puting an abstraction based on the dispersion of Q-values in abstract states as the
agent continues acting and learning. Extensive empirical evaluation on multiple
domains and problems shows that this approach automatically learns abstractions
that are finely-tuned to the problem, yield powerful sample efficiency, and result
in the RL agent significantly outperforming existing approaches.
1 INTRODUCTION
It is well known that good abstract representations can play a vital role in improving the scalability
and efficiency of reinforcement learning (RL) (Sutton & Barto, 2018; Yu, 2018; Konidaris, 2019).
However, it is not very clear how good abstract representations could be efficiently learned without
extensive hand-coding. Several authors have investigated methods for aggregating concrete states
based on similarities in value functions but this approach can be difficult to scale as the number of
concrete states or the transition graph grows.
This paper presents a novel approach for top-down construction and refinement of abstractions for
sample efficient reinforcement learning. Rather than aggregating concrete states based on the agent’s
experience, our approach starts with a default, auto-generated coarse abstraction that collapses the
domain of each state variable (e.g., the location of each taxi and each passenger in the classic taxi
world) to one or two abstract values. This eliminates the need to consider concrete states individ-
ually, although this initial abstraction is likely to be too coarse for most practical problems. The
overall algorithm proceeds by interleaving the process of refining this abstraction with learning and
evaluation of policies, and results in automatically generated, problem and reward-function specific
abstractions that aid learning. This process not only helps in creating a succinct representation of cu-
mulative value functions, but it also makes learning more sample efficient by using the abstraction
to locally transfer states’ values and cleaving abstract states only when it is observed that an abstract
state contains states featuring a large spread in their value functions.
This approach is related to research on abstraction for reinforcement learning and on abstraction
refinement for model checking Dams & Grumberg (2018); Clarke et al. (2000) (a detailed survey of
related work is presented in the next section). However, unlike existing streams of work, we develop
a process that automatically generates conditional abstractions, where the final abstraction on the
set of values of a variable can depend on the specific values of other variables. For instance, Fig.
1 displays a taxi world where for different values of the state variables (destination and passengers
locations), meaningful conditional abstractions are constructed for the taxi location. A meaning-
ful abstraction provides greater details in the taxi-location variable around the passenger location
when the taxi needs to pick up a passenger (Fig. 1 (middle)). When the taxi has the passenger, the
abstraction should show greater details around the destination (Fig. 1(right)). Furthermore, our ap-
proach goes beyond the concept of counter-example driven abstraction refinement to consider the
1
arXiv:2210.01955v2 [cs.LG] 8 Dec 2022
reward function as well as stochastic dynamics, and it uses measures of dispersion such as the stan-
dard deviation of Q-values to drive the refinement process. The main contributions of this paper
are mechanisms for building conditional abstraction trees that help compute and represent such ab-
stractions, and the process of interleaving RL episodes with phases of abstraction and refinement.
Although this process could be adapted to numerous RL algorithms, we focus on developing and
investigating it with Q-learning in this paper.
Figure 1: Consider a classic taxi world with two passengers
and a building as the drop-off location where the green area is
impassable (left). Meaningful conditional abstractions can be
constructed, for example, for situations where both passen-
gers are at their pickup locations (middle), or one passenger
has already been picked-up (right).
The presented approach for dynamic ab-
stractions for RL (DAR+RL) can be
thought of as a dynamic abstraction
scheme because the refinement is tied to
the dispersion of Q-values based on the
agent’s evolving policy during learning. It
provides adjustable degrees of compres-
sion (Abel et al., 2016) where the aggres-
siveness of abstraction can be controlled
by tuning the definition of variation in the
dispersion of Q-values. Extensive empir-
ical evaluation on multiple domains and
problems shows that this approach auto-
matically learns abstract representations
that effectively draw out similarities across
the state space, and yield powerful sample efficiency in learning. Comparative evaluation shows that
Q-learning based RL agents enhanced with our approach outperform state-of-the-art RL approaches
in both discrete and continuous domains while learning meaningful abstract representations.
The rest of this paper is organized as follows. Sec. 2 summarizes the related work followed by a
discussion on the necessary backgrounds in Sec. 3. Sec. 4 presents our dynamic abstraction learning
method for sample-efficient RL. The empirical evaluations are demonstrated in Sec. 5 followed by
the conclusions in Sec. 6.
2 RELATED WORK
Offline State Abstraction. Most early studies focus on action-specific (Dietterich, 1999) and
option-specific (Jonsson & Barto, 2000) state abstraction. Further, Givan et al. (2003) introduced
the notion of state equivalence to possibly reduce the state space size by which two states can be
aggregated into one abstract state if applying a mutual action leads to equivalence states with similar
rewards. Later on, Ravindran & Barto (2004) relaxed this definition of state equivalence by allowing
the actions to be different if there is a valid mapping between them. Offline state abstraction has
further been studied for generalization and transfer in RL (Karia & Srivastava, 2022) and planning
(Srivastava et al., 2012).
Graph-Theoretic State Abstraction. Mannor et al. (2004) developed a graph-theoretic state ab-
straction approach that utilizes the topological similarities of a state transition graph (STG) to ag-
gregate states in an online manner. Mannor’s definition of state abstraction follows Givan’s notion
of equivalence states except they update the partial STG iteratively to find the abstractions. Another
comparable method proposed by Chiu & Soo (2010) carries out spectral graph analysis on STG to
decompose the graph into multiple sub-graphs. However, most graph-theoretic analyses on STG,
such as computing the eigenvectors in Chiu & Soo’s work, can become infeasible for problems with
large-scale state space.
Monte-Carlo Tree Search (MCTS). MCTS approaches offer viable and tractable algorithms for
large state-space Markovian decision problems (Kocsis & Szepesv´
ari, 2006). Jiang et al. (2014)
demonstrated that proper abstraction effectively enhances the performance of MCTS algorithms.
However, their clustering-based state abstraction approach is limited to the states enumerated by
their algorithm within the partially expanded tree, which makes it ineffectual when limited samples
are available to the planning/learning agent. Anand et al. (2015) advanced Jiang’s method by com-
prehensively aggregating states and state-action pairs aiming to uncover more symmetries in the do-
main. Owing to their novel state-action pair abstraction extending Givan and Ravindran’s notions of
abstractions, Anand et al.’s method results in higher quality policies compared to other approaches
2
based on MCTS. However, their bottom-up abstraction scheme makes their method computation-
ally vulnerable to problems with significantly larger state space size. Moreover, their proposed state
abstraction method is limited to the explored states since it applies to the partially expanded tree.
Counterexample Guided Abstraction Refinement (CEGAR). CEGAR is a model checking
methodology that initially assumes a coarse abstract model and then validates or refines the initial
abstraction to eliminate spurious counterexamples to the property that needs to be verified (Clarke
et al., 2000). While most work in this direction focuses on deterministic systems, research on
the topic also considers the problem of defining the appropriate notion for a “counterexample”
in stochastic settings. E.g., Chadha & Viswanathan (2010) propose that a counterexample can be
considered as a small MDP that violates the desired property. However, searching for such coun-
terexamples can be difficult in the RL setting where the transition function of the MDP is not avail-
able. Seipp & Helmert (2018) developed algorithms for planning in deterministic environments that
invoke the CEGAR loop iteratively on the same original task to obtain more efficient abstraction
refinement. These methods do not consider the problem of building abstractions for stochastic plan-
ning or reinforcement learning.
3 BACKGROUND
Markov decision Processes (MDPs) (Bellman, 1957; Puterman, 2014) are defined as a tuple
hS,A,T,R, γi, where Sand Adenote the state and action spaces respectively. Generally, a con-
crete state s∈ S can be defined as a set of nstate variables such that V={vi|i= 1, ..., n}. In this
paper, we focus on problems where the state is defined using a set of variables. An extension to par-
tially observable settings where the agent receives an image of the state is a promising direction for
future work. T:S × A × S [0,1] is a transition probability function, R:S × A Ris a reward
function, and γis the discount factor. The unknown policy πis the solution to an MDP, denoted as
π:S → A. We consider the RL settings where an agent needs to interact with an environment that
can be modeled as an MDP with unknown T. The objective is to learn an optimal policy for this
MDP.
When the size of the space state increases significantly, most of the RL algorithms fail to solve
the given MDP due to the curse of dimensionality. Abstraction is a dimension reduction mecha-
nism by which the original problem representation maps to a new reduced problem representation
(Giunchiglia & Walsh, 1992). We adopt the general definition of state abstraction proposed by Li
et al. (2006).
Definition 1 Let M=hS,A,T,R, γibe the ground MDP from which the abstract MDP ¯
M=
h¯
S,A,¯
T,¯
R, γican be derived via a state abstraction function φ:S → ¯
S, where the abstract
state mapped to concrete state sis denoted as φ(s)¯
Sand φ1(¯s)is the set of concrete states
associated to ¯s¯
S. Further, a weighting function over concrete states is denoted as w(s)with
s∈ S s.t. for each ¯s¯
S,Psφ1(¯s)w(s)=1, where w(s)[0,1]. Accordingly, the abstract
transition probability function ¯
Tand reward function ¯
Rare defined as follows:
¯
R(¯s, a) = X
sφ1(¯s)
w(s)R(s, a),¯
T(¯s, a, ¯s0) = X
sφ1(¯s)
X
s0φ1(¯s)
w(s)T(s, a, s0).
When it comes to the decision-making in an abstract MDP, all concrete states associated with an
abstract state ¯s¯
Sare perceived identically. Accordingly, the relation between abstract policy
π:¯
S A and the concrete policy π:S A can be defined as π(s) = ¯π(φ(s)) for all s∈ S.
Further, the value functions for an abstract MDP are denoted as V¯π(¯
S),V(¯
S),Q¯π(¯
S, a), and
Q(¯
S, a). For more on RL and value functions see Sutton & Barto (2018), for MDPs, see (Bellman,
1957; Puterman, 2014), and for more on the notion of abstraction, refer Giunchiglia & Walsh (1992);
Li et al. (2006).
4 OUR APPROACH
4.1 OVERVIEW
Starting with state variables and a simulator, we develop a domain-independent approach for dynam-
ically computing an abstraction based on the dispersion of Q-values in abstract states. The idea of
3
dynamic abstraction is to learn a problem’s solution and abstractions simultaneously. We propose a
top-down abstraction refinement mechanism by which the learning agent effectively refines an initial
coarse abstraction through acting and learning. We illustrate this mechanism with an example.
Example 1 Consider a 4x4 Wumpus world consisting of a pit at (2,2) and a goal at (4,4). In this
domain, every movement has a reward -1. Reaching the goal results in a positive reward of 10 and
the agents receive a negative reward -10 for falling into the pit. The goal and the pit are the terminal
states of the domain. The agent’s actions include moving to non-diagonal adjacent cells at each time
step s.t. A={up, down, left, right}.
L
U
R
D
D
S1
Coarse
Abstracon S2
S3S4
Rened
Abstracon
U
U
U
R
R
R
L
L
L
DD
1
2
3
4
12 3 4
Figure 2: An example of dynamic heterogeneous abstrac-
tion refinement for a Wumpus world.
Considering Example 1, Fig. 2 (left) shows
a potential initial coarse abstraction in which
the domain of each state variable (here
horizontal and vertical locations) is split into
two abstract values and ¯
S1and ¯
S4contain
the pitfall and goal location respectively. As a
result, when learning, the agent will observe
a high standard deviation on the values of
Q(¯
S1, right), Q(¯
S1, down), Q(¯
S4, right),
and Q(¯
S4, down)because of the presence
of terminal states with large negative or
positive rewards. Guided by this dispersion
of Q-values, the initial coarse abstraction should be refined to resolve the observed variations. Fig.
2 (right) exemplifies an effective abstraction refinement for Example 1 demonstrated as a heatmap
of Q-values. Notice that the desired abstraction is a heterogeneous abstraction on the domains of
state variable values where the abstraction on a variable depends on the value of the other variables:
let xand ybe the horizontal and vertical locations of the agent in Example 1 respectively and their
domain be {1,2,3,4}. When y > 2, the domain of x(originally {1,2,3,4}) is abstracted into sets
{1,2},{3}, and {4}, but when y2, the domain of xis abstracted into sets {1},{2}, and {3,4}.
The next section (Sec. 4.2) presents our novel approach for automatically computing such abstrac-
tions while carrying out RL.
4.2 CONDITIONAL ABSTRACTION TREES
The value of a state variable viinherently falls within a known range. Partitioning these ranges is
one way to construct state abstractions. However, in practice, the abstraction of one state variable
is conditioned on a specific range of any other state variables. Accordingly, we need to maintain
and update such conditional abstractions via structures that we call Conditional Abstraction Trees
(CATs) while constructing the state abstractions.
rene on x
rene on y
rene on y
rene on x
potenal abstracons
for further renement
Θinit
Θ1Θ2
Θ3Θ4Θ5Θ6
Θ7Θ8
[1,4]
[1,4]
[1,2]
[1,4]
[3,4]
[1,4]
[1,2]
[1,2] [1,2]
[3,4] [3,4]
[1,2]
[3,4]
[3,4]
[3,4)
[3,4]
(3,4]
[3,4]
1
2
3
4
12 3 4
Figure 3: This figure illustrates a Conditional Abstraction Tree
(CAT) for Example 1. Ranges written inside the nodes represent
θiΘ. Each node represents a conditional abstraction.
Fig. 3 (right) exemplifies a partially
expanded CAT for the problem in Ex-
ample 1. This problem can be rep-
resented by two state variables, i.e.
agent’s horizontal and vertical loca-
tion denoted as xand yrespectively.
The tree’s root node contains the
global ranges (The first range refers
to the horizontal location xand the
second range refers to the vertical lo-
cation y) for both of these state vari-
ables representing an initial coarse
abstraction (in white). The annota-
tions visualize how this initial ab-
straction can be further refined w.r.t
a state variable resulting in new con-
ditional abstractions (symmetric an-
notations are not shown for the sake
4
摘要:

LEARNINGDYNAMICABSTRACTREPRESENTATIONSFORSAMPLE-EFFICIENTREINFORCEMENTLEARNINGMehdiDadvarArizonaStateUniversityTempe,AZ,85281,USAmdadvar@asu.eduRashmeetKaurNayyarArizonaStateUniversityTempe,AZ,85281,USArmnayyar@asu.eduSiddharthSrivastavaArizonaStateUniversityTempe,AZ,85281,USAsiddharths@asu.eduABSTR...

展开>> 收起<<
LEARNING DYNAMIC ABSTRACT REPRESENTATIONS FOR SAMPLE -EFFICIENT REINFORCEMENT LEARNING Mehdi Dadvar.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:3.38MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注