LEARNING DYNAMIC ABSTRACT REPRESENTATIONS FOR SAMPLE -EFFICIENT REINFORCEMENT LEARNING Mehdi Dadvar

2025-04-29 0 0 3.38MB 16 页 10玖币

侵权投诉

LEARNING DYNAMIC ABSTRACT REPRESENTATIONS

FOR SAMPLE-EFFICIENT REINFORCEMENT LEARNING

Mehdi Dadvar

Arizona State University

Tempe, AZ, 85281, USA

mdadvar@asu.edu

Rashmeet Kaur Nayyar

Arizona State University

Tempe, AZ, 85281, USA

rmnayyar@asu.edu

Siddharth Srivastava

Arizona State University

Tempe, AZ, 85281, USA

siddharths@asu.edu

ABSTRACT

In many real-world problems, the learning agent needs to learn a problem’s ab-

stractions and solution simultaneously. However, most such abstractions need to

be designed and reﬁned by hand for different problems and domains of applica-

tion. This paper presents a novel top-down approach for constructing state abstrac-

tions while carrying out reinforcement learning. Starting with state variables and a

simulator, it presents a novel domain-independent approach for dynamically com-

puting an abstraction based on the dispersion of Q-values in abstract states as the

agent continues acting and learning. Extensive empirical evaluation on multiple

domains and problems shows that this approach automatically learns abstractions

that are ﬁnely-tuned to the problem, yield powerful sample efﬁciency, and result

in the RL agent signiﬁcantly outperforming existing approaches.

1 INTRODUCTION

It is well known that good abstract representations can play a vital role in improving the scalability

and efﬁciency of reinforcement learning (RL) (Sutton & Barto, 2018; Yu, 2018; Konidaris, 2019).

However, it is not very clear how good abstract representations could be efﬁciently learned without

extensive hand-coding. Several authors have investigated methods for aggregating concrete states

based on similarities in value functions but this approach can be difﬁcult to scale as the number of

concrete states or the transition graph grows.

This paper presents a novel approach for top-down construction and reﬁnement of abstractions for

sample efﬁcient reinforcement learning. Rather than aggregating concrete states based on the agent’s

experience, our approach starts with a default, auto-generated coarse abstraction that collapses the

domain of each state variable (e.g., the location of each taxi and each passenger in the classic taxi

world) to one or two abstract values. This eliminates the need to consider concrete states individ-

ually, although this initial abstraction is likely to be too coarse for most practical problems. The

overall algorithm proceeds by interleaving the process of reﬁning this abstraction with learning and

evaluation of policies, and results in automatically generated, problem and reward-function speciﬁc

abstractions that aid learning. This process not only helps in creating a succinct representation of cu-

mulative value functions, but it also makes learning more sample efﬁcient by using the abstraction

to locally transfer states’ values and cleaving abstract states only when it is observed that an abstract

state contains states featuring a large spread in their value functions.

This approach is related to research on abstraction for reinforcement learning and on abstraction

reﬁnement for model checking Dams & Grumberg (2018); Clarke et al. (2000) (a detailed survey of

related work is presented in the next section). However, unlike existing streams of work, we develop

a process that automatically generates conditional abstractions, where the ﬁnal abstraction on the

set of values of a variable can depend on the speciﬁc values of other variables. For instance, Fig.

1 displays a taxi world where for different values of the state variables (destination and passengers

locations), meaningful conditional abstractions are constructed for the taxi location. A meaning-

ful abstraction provides greater details in the taxi-location variable around the passenger location

when the taxi needs to pick up a passenger (Fig. 1 (middle)). When the taxi has the passenger, the

abstraction should show greater details around the destination (Fig. 1(right)). Furthermore, our ap-

proach goes beyond the concept of counter-example driven abstraction reﬁnement to consider the

arXiv:2210.01955v2 [cs.LG] 8 Dec 2022

reward function as well as stochastic dynamics, and it uses measures of dispersion such as the stan-

dard deviation of Q-values to drive the reﬁnement process. The main contributions of this paper

are mechanisms for building conditional abstraction trees that help compute and represent such ab-

stractions, and the process of interleaving RL episodes with phases of abstraction and reﬁnement.

Although this process could be adapted to numerous RL algorithms, we focus on developing and

investigating it with Q-learning in this paper.

Figure 1: Consider a classic taxi world with two passengers

and a building as the drop-off location where the green area is

impassable (left). Meaningful conditional abstractions can be

constructed, for example, for situations where both passen-

gers are at their pickup locations (middle), or one passenger

has already been picked-up (right).

The presented approach for dynamic ab-

stractions for RL (DAR+RL) can be

thought of as a dynamic abstraction

scheme because the reﬁnement is tied to

the dispersion of Q-values based on the

agent’s evolving policy during learning. It

provides adjustable degrees of compres-

sion (Abel et al., 2016) where the aggres-

siveness of abstraction can be controlled

by tuning the deﬁnition of variation in the

dispersion of Q-values. Extensive empir-

ical evaluation on multiple domains and

problems shows that this approach auto-

matically learns abstract representations

that effectively draw out similarities across

the state space, and yield powerful sample efﬁciency in learning. Comparative evaluation shows that

Q-learning based RL agents enhanced with our approach outperform state-of-the-art RL approaches

in both discrete and continuous domains while learning meaningful abstract representations.

The rest of this paper is organized as follows. Sec. 2 summarizes the related work followed by a

discussion on the necessary backgrounds in Sec. 3. Sec. 4 presents our dynamic abstraction learning

method for sample-efﬁcient RL. The empirical evaluations are demonstrated in Sec. 5 followed by

the conclusions in Sec. 6.

2 RELATED WORK

Ofﬂine State Abstraction. Most early studies focus on action-speciﬁc (Dietterich, 1999) and

option-speciﬁc (Jonsson & Barto, 2000) state abstraction. Further, Givan et al. (2003) introduced

the notion of state equivalence to possibly reduce the state space size by which two states can be

aggregated into one abstract state if applying a mutual action leads to equivalence states with similar

rewards. Later on, Ravindran & Barto (2004) relaxed this deﬁnition of state equivalence by allowing

the actions to be different if there is a valid mapping between them. Ofﬂine state abstraction has

further been studied for generalization and transfer in RL (Karia & Srivastava, 2022) and planning

(Srivastava et al., 2012).

Graph-Theoretic State Abstraction. Mannor et al. (2004) developed a graph-theoretic state ab-

straction approach that utilizes the topological similarities of a state transition graph (STG) to ag-

gregate states in an online manner. Mannor’s deﬁnition of state abstraction follows Givan’s notion

of equivalence states except they update the partial STG iteratively to ﬁnd the abstractions. Another

comparable method proposed by Chiu & Soo (2010) carries out spectral graph analysis on STG to

decompose the graph into multiple sub-graphs. However, most graph-theoretic analyses on STG,

such as computing the eigenvectors in Chiu & Soo’s work, can become infeasible for problems with

large-scale state space.

Monte-Carlo Tree Search (MCTS). MCTS approaches offer viable and tractable algorithms for

large state-space Markovian decision problems (Kocsis & Szepesv´

ari, 2006). Jiang et al. (2014)

demonstrated that proper abstraction effectively enhances the performance of MCTS algorithms.

However, their clustering-based state abstraction approach is limited to the states enumerated by

their algorithm within the partially expanded tree, which makes it ineffectual when limited samples

are available to the planning/learning agent. Anand et al. (2015) advanced Jiang’s method by com-

prehensively aggregating states and state-action pairs aiming to uncover more symmetries in the do-

main. Owing to their novel state-action pair abstraction extending Givan and Ravindran’s notions of

abstractions, Anand et al.’s method results in higher quality policies compared to other approaches

based on MCTS. However, their bottom-up abstraction scheme makes their method computation-

ally vulnerable to problems with signiﬁcantly larger state space size. Moreover, their proposed state

abstraction method is limited to the explored states since it applies to the partially expanded tree.

Counterexample Guided Abstraction Reﬁnement (CEGAR). CEGAR is a model checking

methodology that initially assumes a coarse abstract model and then validates or reﬁnes the initial

abstraction to eliminate spurious counterexamples to the property that needs to be veriﬁed (Clarke

et al., 2000). While most work in this direction focuses on deterministic systems, research on

the topic also considers the problem of deﬁning the appropriate notion for a “counterexample”

in stochastic settings. E.g., Chadha & Viswanathan (2010) propose that a counterexample can be

considered as a small MDP that violates the desired property. However, searching for such coun-

terexamples can be difﬁcult in the RL setting where the transition function of the MDP is not avail-

able. Seipp & Helmert (2018) developed algorithms for planning in deterministic environments that

invoke the CEGAR loop iteratively on the same original task to obtain more efﬁcient abstraction

reﬁnement. These methods do not consider the problem of building abstractions for stochastic plan-

ning or reinforcement learning.

3 BACKGROUND

Markov decision Processes (MDPs) (Bellman, 1957; Puterman, 2014) are deﬁned as a tuple

hS,A,T,R, γi, where Sand Adenote the state and action spaces respectively. Generally, a con-

crete state s∈ S can be deﬁned as a set of nstate variables such that V={vi|i= 1, ..., n}. In this

paper, we focus on problems where the state is deﬁned using a set of variables. An extension to par-

tially observable settings where the agent receives an image of the state is a promising direction for

future work. T:S × A × S → [0,1] is a transition probability function, R:S × A → Ris a reward

function, and γis the discount factor. The unknown policy πis the solution to an MDP, denoted as

π:S → A. We consider the RL settings where an agent needs to interact with an environment that

can be modeled as an MDP with unknown T. The objective is to learn an optimal policy for this

MDP.

When the size of the space state increases signiﬁcantly, most of the RL algorithms fail to solve

the given MDP due to the curse of dimensionality. Abstraction is a dimension reduction mecha-

nism by which the original problem representation maps to a new reduced problem representation

(Giunchiglia & Walsh, 1992). We adopt the general deﬁnition of state abstraction proposed by Li

et al. (2006).

Deﬁnition 1 Let M=hS,A,T,R, γibe the ground MDP from which the abstract MDP ¯

h¯

S,A,¯

T,¯

R, γican be derived via a state abstraction function φ:S → ¯

S, where the abstract

state mapped to concrete state sis denoted as φ(s)∈¯

Sand φ−1(¯s)is the set of concrete states

associated to ¯s∈¯

S. Further, a weighting function over concrete states is denoted as w(s)with

s∈ S s.t. for each ¯s∈¯

S,Ps∈φ−1(¯s)w(s)=1, where w(s)∈[0,1]. Accordingly, the abstract

transition probability function ¯

Tand reward function ¯

Rare deﬁned as follows:

R(¯s, a) = X

s∈φ−1(¯s)

w(s)R(s, a),¯

T(¯s, a, ¯s0) = X

s∈φ−1(¯s)

s0∈φ−1(¯s)

w(s)T(s, a, s0).

When it comes to the decision-making in an abstract MDP, all concrete states associated with an

abstract state ¯s∈¯

Sare perceived identically. Accordingly, the relation between abstract policy

π:¯

S → A and the concrete policy π:S → A can be deﬁned as π(s) = ¯π(φ(s)) for all s∈ S.

Further, the value functions for an abstract MDP are denoted as V¯π(¯

S),V∗(¯

S),Q¯π(¯

S, a), and

Q∗(¯

S, a). For more on RL and value functions see Sutton & Barto (2018), for MDPs, see (Bellman,

1957; Puterman, 2014), and for more on the notion of abstraction, refer Giunchiglia & Walsh (1992);

Li et al. (2006).

4 OUR APPROACH

4.1 OVERVIEW

Starting with state variables and a simulator, we develop a domain-independent approach for dynam-

ically computing an abstraction based on the dispersion of Q-values in abstract states. The idea of

dynamic abstraction is to learn a problem’s solution and abstractions simultaneously. We propose a

top-down abstraction reﬁnement mechanism by which the learning agent effectively reﬁnes an initial

coarse abstraction through acting and learning. We illustrate this mechanism with an example.

Example 1 Consider a 4x4 Wumpus world consisting of a pit at (2,2) and a goal at (4,4). In this

domain, every movement has a reward -1. Reaching the goal results in a positive reward of 10 and

the agents receive a negative reward -10 for falling into the pit. The goal and the pit are the terminal

states of the domain. The agent’s actions include moving to non-diagonal adjacent cells at each time

step s.t. A={up, down, left, right}.

Coarse

Abstrac�on S2

S3S4

Reﬁned

Abstrac�on

12 3 4

Figure 2: An example of dynamic heterogeneous abstrac-

tion reﬁnement for a Wumpus world.

Considering Example 1, Fig. 2 (left) shows

a potential initial coarse abstraction in which

the domain of each state variable (here

horizontal and vertical locations) is split into

two abstract values and ¯

S1and ¯

S4contain

the pitfall and goal location respectively. As a

result, when learning, the agent will observe

a high standard deviation on the values of

Q(¯

S1, right), Q(¯

S1, down), Q(¯

S4, right),

and Q(¯

S4, down)because of the presence

of terminal states with large negative or

positive rewards. Guided by this dispersion

of Q-values, the initial coarse abstraction should be reﬁned to resolve the observed variations. Fig.

2 (right) exempliﬁes an effective abstraction reﬁnement for Example 1 demonstrated as a heatmap

of Q-values. Notice that the desired abstraction is a heterogeneous abstraction on the domains of

state variable values where the abstraction on a variable depends on the value of the other variables:

let xand ybe the horizontal and vertical locations of the agent in Example 1 respectively and their

domain be {1,2,3,4}. When y > 2, the domain of x(originally {1,2,3,4}) is abstracted into sets

{1,2},{3}, and {4}, but when y≤2, the domain of xis abstracted into sets {1},{2}, and {3,4}.

The next section (Sec. 4.2) presents our novel approach for automatically computing such abstrac-

tions while carrying out RL.

4.2 CONDITIONAL ABSTRACTION TREES

The value of a state variable viinherently falls within a known range. Partitioning these ranges is

one way to construct state abstractions. However, in practice, the abstraction of one state variable

is conditioned on a speciﬁc range of any other state variables. Accordingly, we need to maintain

and update such conditional abstractions via structures that we call Conditional Abstraction Trees

(CATs) while constructing the state abstractions.

reﬁne on x

reﬁne on y

reﬁne on x

poten�al abstrac�ons

for further reﬁnement

Θinit

Θ1Θ2

Θ3Θ4Θ5Θ6

Θ7Θ8

[1,4]

[1,2]

[1,4]

[3,4]

[1,4]

[1,2]

[1,2] [1,2]

[3,4] [3,4]

[1,2]

[3,4]

[3,4)

[3,4]

(3,4]

[3,4]

12 3 4

Figure 3: This ﬁgure illustrates a Conditional Abstraction Tree

(CAT) for Example 1. Ranges written inside the nodes represent

θi∈Θ. Each node represents a conditional abstraction.

Fig. 3 (right) exempliﬁes a partially

expanded CAT for the problem in Ex-

ample 1. This problem can be rep-

resented by two state variables, i.e.

agent’s horizontal and vertical loca-

tion denoted as xand yrespectively.

The tree’s root node contains the

global ranges (The ﬁrst range refers

to the horizontal location xand the

second range refers to the vertical lo-

cation y) for both of these state vari-

ables representing an initial coarse

abstraction (in white). The annota-

tions visualize how this initial ab-

straction can be further reﬁned w.r.t

a state variable resulting in new con-

ditional abstractions (symmetric an-

notations are not shown for the sake

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LEARNINGDYNAMICABSTRACTREPRESENTATIONSFORSAMPLE-EFFICIENTREINFORCEMENTLEARNINGMehdiDadvarArizonaStateUniversityTempe,AZ,85281,USAmdadvar@asu.eduRashmeetKaurNayyarArizonaStateUniversityTempe,AZ,85281,USArmnayyar@asu.eduSiddharthSrivastavaArizonaStateUniversityTempe,AZ,85281,USAsiddharths@asu.eduABSTR...

展开>> 收起<<

LEARNING DYNAMIC ABSTRACT REPRESENTATIONS FOR SAMPLE -EFFICIENT REINFORCEMENT LEARNING Mehdi Dadvar.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LEARNING DYNAMIC ABSTRACT REPRESENTATIONS FOR SAMPLE -EFFICIENT REINFORCEMENT LEARNING Mehdi Dadvar

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: