Controlling a Markov Decision Process with an Abrupt Change in the Transition Kernel Nathan Dahlin Subhonmesh Bose Venugopal V. Veeravalli

2025-05-06 0 0 789.89KB 20 页 10玖币
侵权投诉
Controlling a Markov Decision Process with an
Abrupt Change in the Transition Kernel
Nathan Dahlin Subhonmesh Bose Venugopal V. Veeravalli *
October 11, 2022
Abstract
We consider the control of a Markov decision process (MDP) that undergoes an abrupt change in
its transition kernel (mode). We formulate the problem of minimizing regret under control switch-
ing based on mode change detection, compared to a mode-observing controller, as an optimal stop-
ping problem. Using a sequence of approximations, we reduce it to a quickest change detection
(QCD) problem with Markovian data, for which we characterize a state-dependent threshold-type
optimal change detection policy. Numerical experiments illustrate various properties of our control-
switching policy.
1 Introduction
Control engineering has perfected the art of controller synthesis to optimize system costs, given a static
environment, often modeled as an MDP. Practical engineering contexts are often non-stationary, i.e., the
underlying dynamics of the environment encoded in the transition kernel of the MDP, changes at some
point in time. A controller that is agnostic to that change can perform poorly against the changed envi-
ronment. If the change point is known a priori, or is revealed when it happens, optimizing the control de-
sign becomes simple – design controllers for the two “modes” of the system and then switch between the
controllers, when the mode change happens. We tackle the question of mode change detection, followed
by switching between mode-specic controllers. Specically, we consider the case where the underlying
transition kernel of an MDP changes at a random point, and the change must be detected by observing
the state dynamics.
The mode of the system for our problem functions essentially as a hidden state. As a result, con-
trolling a non-stationary MDP can be formulated as a partially-observed MDP (POMDP). Even when
the state space is small and nite, solving a POMDP can be computationally challenging. See [Mon82]
for details. The change-detection based control switching paradigm provides a computationally cheaper
alternative.
*All authors are with the Department of Electrical and Computer Engineering at the University of Illinois Urbana-
Champaign, Urbana, IL 61801. Emails: {dahlin,boses,vvv}@illinois.edu. This work was partly supported by a grant from
C3.ai Digital Transformation Institute.
1
arXiv:2210.04098v1 [eess.SY] 8 Oct 2022
Detection of a change in the statistics of a stochastic process has a long history in statistics and optimal
control literature, with roots in [She25]. Quickest change detection (QCD) theory seeks a causal control
policy that an observer of the stochastic process can use to detect a change. In designing such a detector,
one must balance between two competing erroneous declarations. A hasty detector might declare the
change too early, leading to a false alarm. With controller switching against non-stationary MDPs, an
early change will incur extra cost of using a possibly sub-optimal controller over the period when the
controller has changed, but the transition kernel has not. A lethargic detector who declares the change too
late, pays the delay penalty of continuing to use the wrong controller, even after the transition kernel has
changed. Thus, an optimal change detection mechanism balances between the possibilities of false alarm
and delay, optimizing the costs incurred in these situations. In this paper, we formalize the controllers
under the changing environment in Section 2and dene the regret of change detection-based controller
switching in Section 3.
When the pre- and post-change data is independent and identically distributed (i.i.d.), the optimal
Bayesian change detector with geometric prior distribution assumed over the change point, was derived
in [Shi07]. Alternate formulations without such prior distributions have been derived in [Lor71,Rit90,
Pol85]. See [PT12,VB14] for detailed surveys on this history. In this paper, we assume a geometric prior on
the change point. In Section 4, we simplify the regret expression for controller switching via approxima-
tions that rely on fast mixing of the Markov chains under the mode-specic controllers. Then in Section
5, we formulate the question of optimal controller switching for minimizing the approximate regret as
a QCD problem with Markovian data, and characterize an optimal change detector. Our main result
(Theorem 3) proves that the optimal switching policy is threshold-type, where the thresholds depend on
the observed Markov state of the system. While our goal in this paper is (approximate) regret minimiza-
tion with switching between mode-specic controllers for a single change in the transition kernel of an
MDP, Theorem 3applies more broadly to general QCD problems with Markovian observations. We re-
mark that while our results are presented with change in transition kernels, the theory can be extended
to consider changes in cost models across the change point.
In Section 6, we present numerical experiments to demonstrate the performance of the controller
switching policy that we design on example non-stationary MDPs. Our results demonstrate that the
change-detection based controller switching performs very similarly to that using mode-observation un-
der a variety of parameter choices.
Perhaps the closest in spirit to our work is [BLH17], where the authors provide several interesting
insights into the use of QCD to controller switching for non-stationary MDPs. Compared to [BLH17],
however, our primary goal is theoretical analysis to establish the optimality of the state-dependent
threshold-type Bayesian change detector for approximate regret minimization, and to provide condi-
tions under which such approximations are expected to be accurate. We conclude the paper in Section
7. Proofs of results are included in the appendix.
2 Formulating the Quickest Transition Kernel Change Detec-
tion Problem
Consider a controlled, non-stationary Markovian dynamical system evolving over a nite state space X
in discrete time (tZ+) under the action of a controller choosing actions from a nite action space U.
2
The system is non-stationary in the sense that the state transition kernel remains xed up until a random
time Γ, when the system changes its mode. When the mode change happens, the state evolution, i.e.,
the transition kernel, changes. Denote the pre-change and post-change modes by θ= 1 and θ= 2,
respectively. The corresponding transition kernels are given by P1(·|x, u)and P2(·|x, u), respectively,
for xX,uU. The, non-stationary state transition kernel is then given by
Pt(xt+1|xt, ut) = (P1(xt+1|xt, ut),if t < Γ,
P2(xt+1|xt, ut),otherwise.(1)
In general, the setting described thus far could additionally include mode dependent stage-cost functions.
For simplicity, assume that stage-costs are not mode dependent, and, given by the time invariant function
c:X×UR. Using the notation Yt
0to denote the sequence (Y0, . . . , Yt)for an arbitrary variable
Y, we dene ht:= (xt
0, ut1
0)as the state-action history available prior to taking an action at time t.
The objective is to choose a policy πto minimize the long term cost, i.e., for all xX,
J?(x) = min
πΠJ(x;π) := min
πΠ
EXtPt(·|xt1,ut1)
Utπ(·|ht)"
X
t=0
γtc(Xt, Ut)|X0=x#.(2)
As the change point Γis not observable, a Markovian policy may not, generally speaking, be op-
timal in (2). Considering θt∈ {1,2}, together with xtas part of an augmented state process st:=
(xt, θt), (2) can be formulated as a partially observed Markov decision process (POMDP). Direct so-
lutions to POMDPs are computationally intensive. We propose and analyze an alternative approach–
employ change detection and switch between two mode-specic controllers, i.e., optimal stationary poli-
cies for each of the system modes. We denote the mode-specic policies as πi:XUfor i= 1,2,
noting that in the xed mode settings we may restrict to deterministic, Markovian policies without loss
of optimality [Put14]. In particular, πiΠfor i= 1,2, and
J(x;πi, Pi) = J?(x;Pi) := min
πΠJ(x;π, Pi) := min
πΠ
EXtPi(·|xt1(xt1))
Utπ(·|ht)"
X
t=0
γtc(Xt, π(Xt))|X0=x#.
(3)
Limiting our search to policies that switch between π1and π2at a single time t=τ, the more general
policy design problem in (2) reduces to an optimal stopping problem, where stopping corresponds to
switching from π1to π2. In this context, a natural performance baseline policy is one which observes
the mode directly and switches controllers at time t= Γ. Intuitively, if an algorithm can accurately
detect the change point, then a switching policy employing such an algorithm should well approximate
the performance achieved when mode observations are available. Our goal is to utilize tools from the
QCD literature to develop a change detection algorithm for the described non-stationary MDP setting,
and characterize the regret of the resulting change-detection based controller with respect to the mode-
observing baseline.
We remark that our regret analysis will not exactly compare a CD-based algorithm with the “optimal”
control. To see why, note that π1is designed to minimize the long-run discounted cost, assuming that
system remains in mode 1 for the entire innite horizon. In other words, the design of π1is oblivious to the
possibility of a mode change, and hence, is not strictly optimal in our non-stationary setting, even when
3
θt= 1. The choice of operating with π2, however, is optimal post-change, as we preclude the possibility
of the mode switching back from θ= 2 to θ= 1. On the other hand, optimal policies arising from the
POMDP formulation previously mentioned would take into account the potential mode change, and
further, continuously adjust actions in accordance with a belief regarding the system mode. Tackling
such complications is relegated to future endeavors.
3 Dening the Regret
Let X={Xt}t0and X0={X0
t}t0denote the sequence of state observations generated by the
change detection (CD) based and mode observing controllers, under the same random change point Γ.
Note that the state trajectories of the CD and mode observing controllers will always agree prior to the
change point, or the time at which the CD controller switches to π2, whichever is earlier.
We seek to identify a causal switching rule τthat denes an extended integer-valued random variable,
adapted to the ltration {σ(X1, . . . , Xt)}, generated by the stochastic process. We will often use τto
denote the random stopping time chosen by the underlying decision rule. Note that we need not adapt
τto the sequence U={Ut}{t0}, as the observed states are mapped deterministically to actions by both
π1and π2.
Let Θt∈ {1,2}denote the mode of the system at time t. Further, let Dtdenote the switching
decision at time t, i.e., Dt= 1 for t<τ, and Dt= 2 otherwise. Then, the regret minimization problem
can be written as
min
τ∈S
E"
X
t=0
γtrt(Xt, Dt,Θt)#=: R(τ),(4)
where Sis the set of all causal switching policies satisfying P(τ < ) = 1, and
rt(Xt, Dt, θt) =
c(Xt, π1(Xt)) c(X0
t, π2(X0
t)),if Dt= 1,Θt= 2,
c(Xt, π2(Xt)) c(X0
t, π1(X0
t)),if Dt= 2,Θt= 1,
c(Xt, π2(Xt)) c(X0
t, π2(X0
t)),if Dt= 2,Θt= 2.
(5)
In order to further analyze the expected regret in (4), we need to introduce additional notation.
Jm
i|j(Q)denotes the expected mstep horizon cost, when the initial state is distributed according to Qand
policy πiis used under system mode j, for i, j = 1,2. If the initial state Xis given, we write Jm
i|j(δX),
where δXis the measure concentrated at X. Let Pm
i|j(δX)be the mstep probability distribution, given
that the system started in state X, and policy πiwas used in system mode j. Additionally, we denote the
stationary state distribution arising from the use of policy πiin mode j, for i, j = 1,2as i|j, and the
associated Markov chain as Mi|j. With this notation in hand, we can give expected regret terms corre-
sponding to the false alarm and delay cases, conditioned on the history of state observations until time
t.
For switching policies, Dt= 2 implies that Dt+k= 2 for all k > 0. The decision to switch ends
at the rst time when Dt= 1. Thus, we can fold the remaining innite horizon cost-to-go into the
cost of switching, and the optimal switching policy for (4) can be obtained from dynamic programming.
4
摘要:

ControllingaMarkovDecisionProcesswithanAbruptChangeintheTransitionKernelNathanDahlinSubhonmeshBoseVenugopalV.VeeravalliOctober11,2022AbstractWeconsiderthecontrolofaMarkovdecisionprocess(MDP)thatundergoesanabruptchangeinitstransitionkernel(mode).Weformulatetheproblemofminimizingregretundercontrolswi...

展开>> 收起<<
Controlling a Markov Decision Process with an Abrupt Change in the Transition Kernel Nathan Dahlin Subhonmesh Bose Venugopal V. Veeravalli.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:789.89KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注