Sucient Exploration for Convex Q-learning Fan Lu Prashant G. Mehta Sean P. Meyn and Gergely Neu October 19 2022

2025-05-02 0 0 1.88MB 16 页 10玖币

侵权投诉

Suﬃcient Exploration for Convex Q-learning

Fan Lu, Prashant G. Mehta, Sean P. Meyn and Gergely Neu

October 19, 2022

Abstract

In recent years there has been a collective research eﬀort to ﬁnd new formulations of rein-

forcement learning that are simultaneously more eﬃcient and more amenable to analysis. This

paper concerns one approach that builds on the linear programming (LP) formulation of optimal

control of Manne. A primal version is called logistic Q-learning, and a dual variant is convex

Q-learning. This paper focuses on the latter, while building bridges with the former. The main

contributions follow: (i) The dual of convex Q-learning is not precisely Manne’s LP or a version

of logistic Q-learning, but has similar structure that reveals the need for regularization to avoid

over-ﬁtting. (ii) A suﬃcient condition is obtained for a bounded solution to the Q-learning

LP. (iii) Simulation studies reveal numerical challenges when addressing sampled-data systems

based on a continuous time model. The challenge is addressed using state-dependent sampling.

The theory is illustrated with applications to examples from OpenAI gym. It is shown that

convex Q-learning is successful in cases where standard Q-learning diverges, such as the LQR

problem.

Note: this is a slightly extended version of CDC 2022 [12], including an Appendix with

proofs of the main results.

1 Introduction

Ever since the introduction of Watkins’ Q-learning algorithm in the 1980s, the research community

has searched for a general theory beyond the so-called tabular settings (in which the function

class spans all possible functions of state and action). The natural extension of Q-learning to

general function approximation setting seeks to solve what is known as a projected Bellman equation

(PBE). There are few results available giving suﬃcient conditions for the existence of a solution,

or convergence of the algorithm if a solution does exist [24,17,10]. Counterexamples show that

conditions on the function class are required in general, even in a linear function approximation

setting [1,25,6]. The GQ-algorithm of [14] is one success story, based on a relaxation of the PBE.

Even if existence and stability of the algorithm were settled, we would still face the challenge

of interpreting the output of a Q-learning algorithm based on the PBE criterion. Inverse dynamic

programming provides bounds on performance, but only subject to a weighted L∞bound on the

Bellman error, while RL theory is largely based on L2bounds [23,18].

SPM and FL are with the Department of Electrical and Computer Engineering, University of Florida, Gainesville,

FL 32611; SPM holds an Inria International Chair, Paris, France.

PGM is with the Coordinated Science Laboratory and the Department of Mechanical Science and Engineering

at the University of Illinois at Urbana-Champaign (UIUC).

GN is with the Department of Information and Communication Technologies, Universitat Pompeu Fabra

(Barcelona, Spain).

Financial support from ARO award W911NF2010055 and NSF award EPCN 1935389

arXiv:2210.09409v1 [math.OC] 17 Oct 2022

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

CartPole

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Acrobot

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

MountainCar

median

10-25

25-75

75-90

( )

φθn( )

φθn

100

125

150

175

200

−500

−400

−300

−200

−100

−200

−180

−160

−140

−120

−100

Episodes

104Episodes

104

Figure 1: Average cumulative reward as a function of iteration for three examples. Percentiles are estimated

using independent runs.

Both logistic Q-learning [2] and convex Q-learning [16,9,11] are based on the convex analytic

approach to optimal control of [15] and its signiﬁcant development over the past 50 years in the

control and operations research literature [4,22,26,5]. There is a wealth of unanswered questions:

(i) It is shown in a tabular setting that the dual of convex Q-learning is somewhat similar to

Manne’s primal LP [15], but its sample path form also brings diﬀerences.

(ii) The most basic version of convex Q-learning is a linear program (LP). It is always feasible,

but boundedness has been an open topic for research (except for stylized special cases). It is shown

here for the ﬁrst time that boundedness holds if the covariance associated with the basis is full

rank. This may sound familiar to those acquainted with the literature, but the proof is non-trivial

since theory is far from the typical L2setting of TD-learning [25,23,18,3].

So far, LP formulations of reinforcement learning (RL) have restricted to either the tabular or

‘linear MDP” settings [2,9], or deterministic optimal control with general linear function approx-

imation [11]. In this paper we focus on the latter, since the challenges in the stochastic setting

would add signiﬁcant complexity.

We consider a nonlinear state space model in discrete time,

x(k+ 1) = F(x(k), u(k)) , k ≥0, x(0) = x∈X.(1)

The state space Xis a closed subset of Rn, and the input (or action) space Uis ﬁnite, with cardinality

nU:= |U|, and where F: X×U→X. We may have state-dependent constraints, so that for each

x∈Xthere is a set U(x)⊂Ufor which u(k) is constrained to U(x(k)) for each k. Notation is

simpliﬁed by denoting {z(k)=(x(k), u(k)) : k≥0}, evolving on Z:= {(x, u) : x∈X, u ∈U(x)}.

The paper concerns inﬁnite-horizon optimal control, whose deﬁnition requires a cost function

c:Z→R+, and a pair ze:= (xe, ue)∈Zthat achieves equilibrium:

xe= F(xe, ue).

The cost function c:Z→R+vanishes at ze.

These assumptions are imposed so that there is hope that the (optimal) Q-function is ﬁnite

valued:

Q?(x, u) = min

u(1),u(2),...

∞

k=0

c(x(k), u(k)) ,

x(0) = x∈X, u(0) = u∈U(x)

(2)

The Bellman equation may be expressed

Q?(x, u) = c(x, u) + Q?(F(x, u)),(3)

with Q(x) = minuQ(x, u) for any function Q. The optimal input is state feedback u?(k) =

φ?(x?(k)), using the “Q?-greedy” policy,

φ?(x)∈arg min

u∈U(x)

Q?(x, u), x ∈X.(4)

Q-learning algorithms are designed to approximate Q?within a parameterized family of func-

tions {Qθ:θ∈Rd}, and based on an appropriate approximation obtain a policy in analogy with

(4):

φθ(x)∈arg min

u∈U(x)

Qθ(x, u) (5)

For any input-state sequence {u(k), x(k) : k≥0}, the Bellman equation (3) implies

Q?(z(k)) = c(z(k)) + Q?(x(k+ 1)) (6)

This motivates the temporal diﬀerence sequence: for any θ, the observed error at time kis denoted

D◦

k+1(θ) := −Qθ(z(k)) + c(z(k)) + Qθ(x(k+ 1)) (7)

Given observations over a time-horizon 0 ≤k≤N, one approach is to choose θ∗that minimizes

the mean-square Bellman error:

N−1

k=0 D◦

k+1(θ)2

The GQ-algorithm is designed to solve a similar non-convex optimization problem. There has been

great success using an alternative Galerkin relaxation [18]: A sequence of d-dimensional eligibility

vectors {ζk}is constructed, and the goal then is to solve a version of the projected Bellman equation,

0=1

N−1

k=0

D◦

k+1(θ)ζk(8)

The standard Q-learning algorithm is based on the temporal diﬀerence sequence Dk+1 :=

D◦

k+1(θk) in the recursion

θk+1 =θk+αk+1Dk+1ζθk

k(9)

with ζθk

k=∇θQθ(z(k))|θ=θk, and {αk+1}is a non-negative step-size sequence [23,18]. When

convergent, the limit satisﬁes a projected Bellman equation similar to (8):

f(θ∗)=0, f(θ) = ED◦

k+1(θ)ζθ

k(10)

While not obvious from its description, parameter estimates obtained from the DQN algorithm

solve the same projected Bellman equation, provided it is convergent (see [11]).

The main contributions and organization of the paper are summarized as follows:

(i) In TD-learning it is known that the basis must be linearly independent to obtain a unique

solution. Theory developed in Section 2.1 implies that a similar condition is both necessary and

suﬃcient to obtain a bounded constraint region in the convex program that deﬁnes convex Q-

learning. This result is obtained in the general setting with linear function approximation, so in

particular the state space need not be ﬁnite. The main conclusions are summarized in Thm. 2.2.

1.6

1.7

1.8

1.9

2θn(2)

1010

1020

1030

Small initial condition

Large initial condition

20 40 60 80 100 120

Figure 2: Evolution of one parameter using

the standard Q-learning algorithm (9) in an

application to the linear quadratic regulator

problem.

(ii) The dual of convex Q-learning is described in Sec-

tion 2.2, along with a number of consequences: Prop. 2.5

provides an interpretation of complementary slackness as

an exact solution to the dynamic programming equation

at selected state-action pairs. This suggests that regular-

ization is needed to avoid over-ﬁtting in general.

In the tabular setting, the rank condition ensuring a

bounded constraint region is equivalent to full exploration

of all state-input pairs; this is also a suﬃcient condition

to ensure that convex Q-learning will compute exactly

the optimal Q-function. In this special case, the dual is

similar to the primal introduced by Manne [15].

(iii) Simulation studies reveal numerical challenges when

addressing sampled-data systems; the challenge is ad-

dressed here using state-dependent sampling.

(iv) Theory is illustrated in Section 3with applications to examples from OpenAI gym. Fig. 1

shows average cumulative reward as a function of training time for three examples. The algorithm

is remarkably robust, and is successful in cases where standard Q-learning diverges, such as the

LQR problem. An example of divergence is shown in Fig. 2. Details may be found in Section 3.

2 Convex Q-learning

Convex Q-learning is motivated by the classical LP characterization of the value function. For any

Q:Z→R, and r∈R, let SQ(r) denote the sub-level set:

SQ(r) = {z∈Z:Q(z)≤r}

The function Qis called inf-compact if the set SQ(r) is pre-compact for rin the range of Q. The

following may be found in [18, Ch. 4].

Proposition 2.1 Suppose that the value function Q?deﬁned in (2)is continuous, inf-compact,

and vanishes only at ze. Then, for any positive measure µon X×U,Q?solves the following convex

program:

max

Qhµ, Qi(11a)

s.t. Q(z)≤c(z) + Q(F(z)) , z ∈Z(11b)

Qis continuous, and Q(ze)=0.(11c)

The nonlinear operation that deﬁnes Qcan be removed if the input space Uis ﬁnite, so that (11)

can be represented as an LP; it is always a convex program (even if inﬁnite dimensional) because

this minimization operator is a concave functional. The LP construction is based on the inequalities

in (19) below.

Convex Q-learning is based on an approximation of the convex program (11), seeking an ap-

proximation to Q?among a ﬁnite-dimensional family {Qθ:θ∈Rd}. The value θimight represent

the ith weight in a neural network function approximation architecture, but to justify the adjective

convex we require a linear family:

Qθ(z) = θ|ψ(z) (12)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SucientExplorationforConvexQ-learningFanLu,PrashantG.Mehta,SeanP.MeynandGergelyNeuOctober19,2022AbstractInrecentyearstherehasbeenacollectiveresearcheorttondnewformulationsofrein-forcementlearningthataresimultaneouslymoreecientandmoreamenabletoanalysis.Thispaperconcernsoneapproachthatbuildsonthel...

展开>> 收起<<

Sucient Exploration for Convex Q-learning Fan Lu Prashant G. Mehta Sean P. Meyn and Gergely Neu October 19 2022.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sucient Exploration for Convex Q-learning Fan Lu Prashant G. Mehta Sean P. Meyn and Gergely Neu October 19 2022

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: