Sucient Exploration for Convex Q-learning Fan Lu Prashant G. Mehta Sean P. Meyn and Gergely Neu October 19 2022

2025-05-02 0 0 1.88MB 16 页 10玖币
侵权投诉
Sufficient Exploration for Convex Q-learning
Fan Lu, Prashant G. Mehta, Sean P. Meyn and Gergely Neu
October 19, 2022
Abstract
In recent years there has been a collective research effort to find new formulations of rein-
forcement learning that are simultaneously more efficient and more amenable to analysis. This
paper concerns one approach that builds on the linear programming (LP) formulation of optimal
control of Manne. A primal version is called logistic Q-learning, and a dual variant is convex
Q-learning. This paper focuses on the latter, while building bridges with the former. The main
contributions follow: (i) The dual of convex Q-learning is not precisely Manne’s LP or a version
of logistic Q-learning, but has similar structure that reveals the need for regularization to avoid
over-fitting. (ii) A sufficient condition is obtained for a bounded solution to the Q-learning
LP. (iii) Simulation studies reveal numerical challenges when addressing sampled-data systems
based on a continuous time model. The challenge is addressed using state-dependent sampling.
The theory is illustrated with applications to examples from OpenAI gym. It is shown that
convex Q-learning is successful in cases where standard Q-learning diverges, such as the LQR
problem.
Note: this is a slightly extended version of CDC 2022 [12], including an Appendix with
proofs of the main results.
1 Introduction
Ever since the introduction of Watkins’ Q-learning algorithm in the 1980s, the research community
has searched for a general theory beyond the so-called tabular settings (in which the function
class spans all possible functions of state and action). The natural extension of Q-learning to
general function approximation setting seeks to solve what is known as a projected Bellman equation
(PBE). There are few results available giving sufficient conditions for the existence of a solution,
or convergence of the algorithm if a solution does exist [24,17,10]. Counterexamples show that
conditions on the function class are required in general, even in a linear function approximation
setting [1,25,6]. The GQ-algorithm of [14] is one success story, based on a relaxation of the PBE.
Even if existence and stability of the algorithm were settled, we would still face the challenge
of interpreting the output of a Q-learning algorithm based on the PBE criterion. Inverse dynamic
programming provides bounds on performance, but only subject to a weighted Lbound on the
Bellman error, while RL theory is largely based on L2bounds [23,18].
SPM and FL are with the Department of Electrical and Computer Engineering, University of Florida, Gainesville,
FL 32611; SPM holds an Inria International Chair, Paris, France.
PGM is with the Coordinated Science Laboratory and the Department of Mechanical Science and Engineering
at the University of Illinois at Urbana-Champaign (UIUC).
GN is with the Department of Information and Communication Technologies, Universitat Pompeu Fabra
(Barcelona, Spain).
Financial support from ARO award W911NF2010055 and NSF award EPCN 1935389
1
arXiv:2210.09409v1 [math.OC] 17 Oct 2022
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
CartPole
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Acrobot
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
MountainCar
median
10-25
25-75
75-90
( )
φθn( )
φθn( )
φθn
25
50
75
100
125
150
175
200
−500
−400
−300
−200
−100
−200
−180
−160
−140
−120
−100
Episodes
×
104Episodes
×
104Episodes
×
104
Figure 1: Average cumulative reward as a function of iteration for three examples. Percentiles are estimated
using independent runs.
Both logistic Q-learning [2] and convex Q-learning [16,9,11] are based on the convex analytic
approach to optimal control of [15] and its significant development over the past 50 years in the
control and operations research literature [4,22,26,5]. There is a wealth of unanswered questions:
(i) It is shown in a tabular setting that the dual of convex Q-learning is somewhat similar to
Manne’s primal LP [15], but its sample path form also brings differences.
(ii) The most basic version of convex Q-learning is a linear program (LP). It is always feasible,
but boundedness has been an open topic for research (except for stylized special cases). It is shown
here for the first time that boundedness holds if the covariance associated with the basis is full
rank. This may sound familiar to those acquainted with the literature, but the proof is non-trivial
since theory is far from the typical L2setting of TD-learning [25,23,18,3].
So far, LP formulations of reinforcement learning (RL) have restricted to either the tabular or
‘linear MDP” settings [2,9], or deterministic optimal control with general linear function approx-
imation [11]. In this paper we focus on the latter, since the challenges in the stochastic setting
would add significant complexity.
We consider a nonlinear state space model in discrete time,
x(k+ 1) = F(x(k), u(k)) , k 0, x(0) = xX.(1)
The state space Xis a closed subset of Rn, and the input (or action) space Uis finite, with cardinality
nU:= |U|, and where F: X×UX. We may have state-dependent constraints, so that for each
xXthere is a set U(x)Ufor which u(k) is constrained to U(x(k)) for each k. Notation is
simplified by denoting {z(k)=(x(k), u(k)) : k0}, evolving on Z:= {(x, u) : xX, u U(x)}.
The paper concerns infinite-horizon optimal control, whose definition requires a cost function
c:ZR+, and a pair ze:= (xe, ue)Zthat achieves equilibrium:
xe= F(xe, ue).
The cost function c:ZR+vanishes at ze.
These assumptions are imposed so that there is hope that the (optimal) Q-function is finite
valued:
Q?(x, u) = min
u(1),u(2),...
X
k=0
c(x(k), u(k)) ,
x(0) = xX, u(0) = uU(x)
(2)
The Bellman equation may be expressed
Q?(x, u) = c(x, u) + Q?(F(x, u)),(3)
2
with Q(x) = minuQ(x, u) for any function Q. The optimal input is state feedback u?(k) =
φ?(x?(k)), using the “Q?-greedy” policy,
φ?(x)arg min
uU(x)
Q?(x, u), x X.(4)
Q-learning algorithms are designed to approximate Q?within a parameterized family of func-
tions {Qθ:θRd}, and based on an appropriate approximation obtain a policy in analogy with
(4):
φθ(x)arg min
uU(x)
Qθ(x, u) (5)
For any input-state sequence {u(k), x(k) : k0}, the Bellman equation (3) implies
Q?(z(k)) = c(z(k)) + Q?(x(k+ 1)) (6)
This motivates the temporal difference sequence: for any θ, the observed error at time kis denoted
D
k+1(θ) := Qθ(z(k)) + c(z(k)) + Qθ(x(k+ 1)) (7)
Given observations over a time-horizon 0 kN, one approach is to choose θthat minimizes
the mean-square Bellman error:
1
N
N1
X
k=0 D
k+1(θ)2
The GQ-algorithm is designed to solve a similar non-convex optimization problem. There has been
great success using an alternative Galerkin relaxation [18]: A sequence of d-dimensional eligibility
vectors {ζk}is constructed, and the goal then is to solve a version of the projected Bellman equation,
0=1
N
N1
X
k=0
D
k+1(θ)ζk(8)
The standard Q-learning algorithm is based on the temporal difference sequence Dk+1 :=
D
k+1(θk) in the recursion
θk+1 =θk+αk+1Dk+1ζθk
k(9)
with ζθk
k=θQθ(z(k))|θ=θk, and {αk+1}is a non-negative step-size sequence [23,18]. When
convergent, the limit satisfies a projected Bellman equation similar to (8):
f(θ)=0, f(θ) = ED
k+1(θ)ζθ
k(10)
While not obvious from its description, parameter estimates obtained from the DQN algorithm
solve the same projected Bellman equation, provided it is convergent (see [11]).
The main contributions and organization of the paper are summarized as follows:
(i) In TD-learning it is known that the basis must be linearly independent to obtain a unique
solution. Theory developed in Section 2.1 implies that a similar condition is both necessary and
sufficient to obtain a bounded constraint region in the convex program that defines convex Q-
learning. This result is obtained in the general setting with linear function approximation, so in
particular the state space need not be finite. The main conclusions are summarized in Thm. 2.2.
3
1.6
1.7
1.8
1.9
2θn(2)
n
1010
1020
1030
Small initial condition
Large initial condition
20 40 60 80 100 120
Figure 2: Evolution of one parameter using
the standard Q-learning algorithm (9) in an
application to the linear quadratic regulator
problem.
(ii) The dual of convex Q-learning is described in Sec-
tion 2.2, along with a number of consequences: Prop. 2.5
provides an interpretation of complementary slackness as
an exact solution to the dynamic programming equation
at selected state-action pairs. This suggests that regular-
ization is needed to avoid over-fitting in general.
In the tabular setting, the rank condition ensuring a
bounded constraint region is equivalent to full exploration
of all state-input pairs; this is also a sufficient condition
to ensure that convex Q-learning will compute exactly
the optimal Q-function. In this special case, the dual is
similar to the primal introduced by Manne [15].
(iii) Simulation studies reveal numerical challenges when
addressing sampled-data systems; the challenge is ad-
dressed here using state-dependent sampling.
(iv) Theory is illustrated in Section 3with applications to examples from OpenAI gym. Fig. 1
shows average cumulative reward as a function of training time for three examples. The algorithm
is remarkably robust, and is successful in cases where standard Q-learning diverges, such as the
LQR problem. An example of divergence is shown in Fig. 2. Details may be found in Section 3.
2 Convex Q-learning
Convex Q-learning is motivated by the classical LP characterization of the value function. For any
Q:ZR, and rR, let SQ(r) denote the sub-level set:
SQ(r) = {zZ:Q(z)r}
The function Qis called inf-compact if the set SQ(r) is pre-compact for rin the range of Q. The
following may be found in [18, Ch. 4].
Proposition 2.1 Suppose that the value function Q?defined in (2)is continuous, inf-compact,
and vanishes only at ze. Then, for any positive measure µon X×U,Q?solves the following convex
program:
max
Qhµ, Qi(11a)
s.t. Q(z)c(z) + Q(F(z)) , z Z(11b)
Qis continuous, and Q(ze)=0.(11c)
The nonlinear operation that defines Qcan be removed if the input space Uis finite, so that (11)
can be represented as an LP; it is always a convex program (even if infinite dimensional) because
this minimization operator is a concave functional. The LP construction is based on the inequalities
in (19) below.
Convex Q-learning is based on an approximation of the convex program (11), seeking an ap-
proximation to Q?among a finite-dimensional family {Qθ:θRd}. The value θimight represent
the ith weight in a neural network function approximation architecture, but to justify the adjective
convex we require a linear family:
Qθ(z) = θ|ψ(z) (12)
4
摘要:

SucientExplorationforConvexQ-learningFanLu,PrashantG.Mehta,SeanP.MeynandGergelyNeuOctober19,2022AbstractInrecentyearstherehasbeenacollectiveresearche ortto ndnewformulationsofrein-forcementlearningthataresimultaneouslymoreecientandmoreamenabletoanalysis.Thispaperconcernsoneapproachthatbuildsonthel...

展开>> 收起<<
Sucient Exploration for Convex Q-learning Fan Lu Prashant G. Mehta Sean P. Meyn and Gergely Neu October 19 2022.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.88MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注