Controlling a Markov Decision Process with an Abrupt Change in the Transition Kernel Nathan Dahlin Subhonmesh Bose Venugopal V. Veeravalli

2025-05-06 0 0 789.89KB 20 页 10玖币

侵权投诉

Controlling a Markov Decision Process with an

Abrupt Change in the Transition Kernel

Nathan Dahlin Subhonmesh Bose Venugopal V. Veeravalli *

October 11, 2022

Abstract

We consider the control of a Markov decision process (MDP) that undergoes an abrupt change in

its transition kernel (mode). We formulate the problem of minimizing regret under control switch-

ing based on mode change detection, compared to a mode-observing controller, as an optimal stop-

ping problem. Using a sequence of approximations, we reduce it to a quickest change detection

(QCD) problem with Markovian data, for which we characterize a state-dependent threshold-type

optimal change detection policy. Numerical experiments illustrate various properties of our control-

switching policy.

1 Introduction

Control engineering has perfected the art of controller synthesis to optimize system costs, given a static

environment, often modeled as an MDP. Practical engineering contexts are often non-stationary, i.e., the

underlying dynamics of the environment encoded in the transition kernel of the MDP, changes at some

point in time. A controller that is agnostic to that change can perform poorly against the changed envi-

ronment. If the change point is known a priori, or is revealed when it happens, optimizing the control de-

sign becomes simple – design controllers for the two “modes” of the system and then switch between the

controllers, when the mode change happens. We tackle the question of mode change detection, followed

by switching between mode-specic controllers. Specically, we consider the case where the underlying

transition kernel of an MDP changes at a random point, and the change must be detected by observing

the state dynamics.

The mode of the system for our problem functions essentially as a hidden state. As a result, con-

trolling a non-stationary MDP can be formulated as a partially-observed MDP (POMDP). Even when

the state space is small and nite, solving a POMDP can be computationally challenging. See [Mon82]

for details. The change-detection based control switching paradigm provides a computationally cheaper

alternative.

*All authors are with the Department of Electrical and Computer Engineering at the University of Illinois Urbana-

Champaign, Urbana, IL 61801. Emails: {dahlin,boses,vvv}@illinois.edu. This work was partly supported by a grant from

C3.ai Digital Transformation Institute.

arXiv:2210.04098v1 [eess.SY] 8 Oct 2022

Detection of a change in the statistics of a stochastic process has a long history in statistics and optimal

control literature, with roots in [She25]. Quickest change detection (QCD) theory seeks a causal control

policy that an observer of the stochastic process can use to detect a change. In designing such a detector,

one must balance between two competing erroneous declarations. A hasty detector might declare the

change too early, leading to a false alarm. With controller switching against non-stationary MDPs, an

early change will incur extra cost of using a possibly sub-optimal controller over the period when the

controller has changed, but the transition kernel has not. A lethargic detector who declares the change too

late, pays the delay penalty of continuing to use the wrong controller, even after the transition kernel has

changed. Thus, an optimal change detection mechanism balances between the possibilities of false alarm

and delay, optimizing the costs incurred in these situations. In this paper, we formalize the controllers

under the changing environment in Section 2and dene the regret of change detection-based controller

switching in Section 3.

When the pre- and post-change data is independent and identically distributed (i.i.d.), the optimal

Bayesian change detector with geometric prior distribution assumed over the change point, was derived

in [Shi07]. Alternate formulations without such prior distributions have been derived in [Lor71,Rit90,

Pol85]. See [PT12,VB14] for detailed surveys on this history. In this paper, we assume a geometric prior on

the change point. In Section 4, we simplify the regret expression for controller switching via approxima-

tions that rely on fast mixing of the Markov chains under the mode-specic controllers. Then in Section

5, we formulate the question of optimal controller switching for minimizing the approximate regret as

a QCD problem with Markovian data, and characterize an optimal change detector. Our main result

(Theorem 3) proves that the optimal switching policy is threshold-type, where the thresholds depend on

the observed Markov state of the system. While our goal in this paper is (approximate) regret minimiza-

tion with switching between mode-specic controllers for a single change in the transition kernel of an

MDP, Theorem 3applies more broadly to general QCD problems with Markovian observations. We re-

mark that while our results are presented with change in transition kernels, the theory can be extended

to consider changes in cost models across the change point.

In Section 6, we present numerical experiments to demonstrate the performance of the controller

switching policy that we design on example non-stationary MDPs. Our results demonstrate that the

change-detection based controller switching performs very similarly to that using mode-observation un-

der a variety of parameter choices.

Perhaps the closest in spirit to our work is [BLH17], where the authors provide several interesting

insights into the use of QCD to controller switching for non-stationary MDPs. Compared to [BLH17],

however, our primary goal is theoretical analysis to establish the optimality of the state-dependent

threshold-type Bayesian change detector for approximate regret minimization, and to provide condi-

tions under which such approximations are expected to be accurate. We conclude the paper in Section

7. Proofs of results are included in the appendix.

2 Formulating the Quickest Transition Kernel Change Detec-

tion Problem

Consider a controlled, non-stationary Markovian dynamical system evolving over a nite state space X

in discrete time (t∈Z+) under the action of a controller choosing actions from a nite action space U.

The system is non-stationary in the sense that the state transition kernel remains xed up until a random

time Γ, when the system changes its mode. When the mode change happens, the state evolution, i.e.,

the transition kernel, changes. Denote the pre-change and post-change modes by θ= 1 and θ= 2,

respectively. The corresponding transition kernels are given by P1(·|x, u)and P2(·|x, u), respectively,

for x∈X,u∈U. The, non-stationary state transition kernel is then given by

Pt(xt+1|xt, ut) = (P1(xt+1|xt, ut),if t < Γ,

P2(xt+1|xt, ut),otherwise.(1)

In general, the setting described thus far could additionally include mode dependent stage-cost functions.

For simplicity, assume that stage-costs are not mode dependent, and, given by the time invariant function

c:X×U→R. Using the notation Yt

0to denote the sequence (Y0, . . . , Yt)for an arbitrary variable

Y, we dene ht:= (xt

0, ut−1

0)as the state-action history available prior to taking an action at time t.

The objective is to choose a policy πto minimize the long term cost, i.e., for all x∈X,

J?(x) = min

π∈ΠJ∞(x;π) := min

π∈Π

EXt∼Pt(·|xt−1,ut−1)

Ut∼π(·|ht)"∞

t=0

γtc(Xt, Ut)|X0=x#.(2)

As the change point Γis not observable, a Markovian policy may not, generally speaking, be op-

timal in (2). Considering θt∈ {1,2}, together with xtas part of an augmented state process st:=

(xt, θt), (2) can be formulated as a partially observed Markov decision process (POMDP). Direct so-

lutions to POMDPs are computationally intensive. We propose and analyze an alternative approach–

employ change detection and switch between two mode-specic controllers, i.e., optimal stationary poli-

cies for each of the system modes. We denote the mode-specic policies as πi:X→Ufor i= 1,2,

noting that in the xed mode settings we may restrict to deterministic, Markovian policies without loss

of optimality [Put14]. In particular, πi∈Πfor i= 1,2, and

J∞(x;πi, Pi) = J?(x;Pi) := min

π∈ΠJ∞(x;π, Pi) := min

π∈Π

EXt∼Pi(·|xt−1,π(xt−1))

Ut∼π(·|ht)"∞

t=0

γtc(Xt, π(Xt))|X0=x#.

(3)

Limiting our search to policies that switch between π1and π2at a single time t=τ, the more general

policy design problem in (2) reduces to an optimal stopping problem, where stopping corresponds to

switching from π1to π2. In this context, a natural performance baseline policy is one which observes

the mode directly and switches controllers at time t= Γ. Intuitively, if an algorithm can accurately

detect the change point, then a switching policy employing such an algorithm should well approximate

the performance achieved when mode observations are available. Our goal is to utilize tools from the

QCD literature to develop a change detection algorithm for the described non-stationary MDP setting,

and characterize the regret of the resulting change-detection based controller with respect to the mode-

observing baseline.

We remark that our regret analysis will not exactly compare a CD-based algorithm with the “optimal”

control. To see why, note that π1is designed to minimize the long-run discounted cost, assuming that

system remains in mode 1 for the entire innite horizon. In other words, the design of π1is oblivious to the

possibility of a mode change, and hence, is not strictly optimal in our non-stationary setting, even when

θt= 1. The choice of operating with π2, however, is optimal post-change, as we preclude the possibility

of the mode switching back from θ= 2 to θ= 1. On the other hand, optimal policies arising from the

POMDP formulation previously mentioned would take into account the potential mode change, and

further, continuously adjust actions in accordance with a belief regarding the system mode. Tackling

such complications is relegated to future endeavors.

3 Dening the Regret

Let X={Xt}t≥0and X0={X0

t}t≥0denote the sequence of state observations generated by the

change detection (CD) based and mode observing controllers, under the same random change point Γ.

Note that the state trajectories of the CD and mode observing controllers will always agree prior to the

change point, or the time at which the CD controller switches to π2, whichever is earlier.

We seek to identify a causal switching rule τthat denes an extended integer-valued random variable,

adapted to the ltration {σ(X1, . . . , Xt)}, generated by the stochastic process. We will often use τto

denote the random stopping time chosen by the underlying decision rule. Note that we need not adapt

τto the sequence U={Ut}{t≥0}, as the observed states are mapped deterministically to actions by both

π1and π2.

Let Θt∈ {1,2}denote the mode of the system at time t. Further, let Dtdenote the switching

decision at time t, i.e., Dt= 1 for t<τ, and Dt= 2 otherwise. Then, the regret minimization problem

can be written as

min

τ∈S

E"∞

t=0

γtrt(Xt, Dt,Θt)#=: R(τ),(4)

where Sis the set of all causal switching policies satisfying P(τ < ∞) = 1, and

rt(Xt, Dt, θt) = 









c(Xt, π1(Xt)) −c(X0

t, π2(X0

t)),if Dt= 1,Θt= 2,

c(Xt, π2(Xt)) −c(X0

t, π1(X0

t)),if Dt= 2,Θt= 1,

c(Xt, π2(Xt)) −c(X0

t, π2(X0

t)),if Dt= 2,Θt= 2.

(5)

In order to further analyze the expected regret in (4), we need to introduce additional notation.

i|j(Q)denotes the expected mstep horizon cost, when the initial state is distributed according to Qand

policy πiis used under system mode j, for i, j = 1,2. If the initial state Xis given, we write Jm

i|j(δX),

where δXis the measure concentrated at X. Let Pm

i|j(δX)be the mstep probability distribution, given

that the system started in state X, and policy πiwas used in system mode j. Additionally, we denote the

stationary state distribution arising from the use of policy πiin mode j, for i, j = 1,2as ∆i|j, and the

associated Markov chain as Mi|j. With this notation in hand, we can give expected regret terms corre-

sponding to the false alarm and delay cases, conditioned on the history of state observations until time

For switching policies, Dt= 2 implies that Dt+k= 2 for all k > 0. The decision to switch ends

at the rst time when Dt= 1. Thus, we can fold the remaining innite horizon cost-to-go into the

cost of switching, and the optimal switching policy for (4) can be obtained from dynamic programming.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ControllingaMarkovDecisionProcesswithanAbruptChangeintheTransitionKernelNathanDahlinSubhonmeshBoseVenugopalV.VeeravalliOctober11,2022AbstractWeconsiderthecontrolofaMarkovdecisionprocess(MDP)thatundergoesanabruptchangeinitstransitionkernel(mode).Weformulatetheproblemofminimizingregretundercontrolswi...

展开>> 收起<<

Controlling a Markov Decision Process with an Abrupt Change in the Transition Kernel Nathan Dahlin Subhonmesh Bose Venugopal V. Veeravalli.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Controlling a Markov Decision Process with an Abrupt Change in the Transition Kernel Nathan Dahlin Subhonmesh Bose Venugopal V. Veeravalli

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: