Exploration Policies for On-the-Fly Controller Synthesis A Reinforcement Learning Approach Tom as Delgado1 Marco S anchez Sorondo1 Vıctor Braberman1 Sebasti an Uchitel12

2025-04-27 0 0 1.3MB 9 页 10玖币
侵权投诉
Exploration Policies for On-the-Fly Controller Synthesis:
A Reinforcement Learning Approach
Tom´
as Delgado1, Marco S´
anchez Sorondo1, V´
ıctor Braberman1, Sebasti´
an Uchitel12
1Universidad de Buenos Aires
2Imperial College
tdelgado@dc.uba.ar, msorondo@dc.uba.ar, vbraber@dc.uba.ar, suchitel@dc.uba.ar
Abstract
Controller synthesis is in essence a case of model-based
planning for non-deterministic environments in which plans
(actually “strategies”) are meant to preserve system goals
indefinitely. In the case of supervisory control environments
are specified as the parallel composition of state machines
and valid strategies are required to be “non-blocking” (i.e.,
always enabling the environment to reach certain marked
states) in addition to safe (i.e., keep the system within a
safe zone). Recently, On-the-fly Directed Controller Synthesis
techniques were proposed to avoid the exploration of the
entire -and exponentially large- environment space, at the
cost of non-maximal permissiveness, to either find a strategy
or conclude that there is none. The incremental exploration
of the plant is currently guided by a domain-independent
human-designed heuristic.
In this work, we propose a new method for obtaining heuristics
based on Reinforcement Learning (RL). The synthesis
algorithm is thus framed as an RL task with an unbounded
action space and a modified version of DQN is used. With a
simple and general set of features that abstracts both states and
actions, we show that it is possible to learn heuristics on small
versions of a problem that generalize to the larger instances,
effectively doing zero-shot policy transfer. Our agents learn
from scratch in a highly partially observable RL task and
outperform the existing heuristic overall, in instances unseen
during training.
1 Introduction
Reactive systems in domains such as communication
networks, automated manufacturing, air traffic control, and
robotics, can benefit from the automated construction of
correct control strategies. Discrete Event Control (Wonham
and Ramadge 1987), Automated Planning (Nau, Ghallab, and
Traverso 2004) and Reactive Synthesis (Pnueli and Rosner
1989) are fields that address this automated construction
problem. Although they have representational, expressiveness
and algorithmic differences, the three must deal with the
state explosion that results from analysing a compact input
description with exponentially large underlying semantics.
Recently, an increasing number of studies aimed at relating
the fields (e.g. Ehlers et al. 2017; Sardi
˜
na and D’Ippolito
Copyright
©
2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
2015; Camacho, Bienvenu, and McIlraith 2021; Hoffmann
et al. 2020).
In this paper we study an approach to discrete event
systems (DES) control in which a plant to be controlled
is specified modularly as the parallel composition of
communicating finite state automata (Wonham and Ramadge
1988). The aim is to build a director that is safe and
nonblocking. That is, it should guarantee that a marked state
of the plant can always be reached while ensuring that unsafe
plant states are never reached. Directors (Huang and Kumar
2008) are controllers that enable at most one controllable
event at any time, in contrast to supervisors, which are
maximally permissive.
Composing the automata of the plant can result in an
exponential state explosion. Approaches that first build the
full plant and compute a director can fail within a time and
memory budget even when there is a director that keeps the
system in a very small proportion of the full plant state space.
On-the-fly Directed Controller Synthesis (OTF-DCS) (Ciolek
et al. 2023) attempts to avoid state explosion by exploring the
composed plant incrementally, checking for the existence of
directors after each new transition is added. If guided by good
heuristics, this process allows finding controllers by building
only the parts of the plant that the controllers themselves
enable reaching. In the same work, several manually designed
heuristics were proposed and it was shown that for certain
domains it is possible to solve instances that cannot be solved
by a fully-compose and synthesise approach.
In this work we propose replacing manually designed
heuristics for OTF-DCS with an exploration policy learned
via Reinforcement Learning (RL) that is efficient for large
instances of a parameterized control problem having been
trained only on small instances of the problem. Note that
we cannot use RL to learn the control strategy itself because
RL cannot provide full guarantees that the controller will
satisfy the safety and non-blocking properties. We use RL to
learn an exploration strategy that minimizes the number of
transitions to be added to the plant by the OTF-DCS algorithm.
Additionally, note that RL has traditionally focused on tasks
where agents are evaluated in the same environment as they
are trained. This is not useful in our application because we
are ultimately interested in solving large instances that cannot
currently be solved in reasonable amounts of time. We use
RL to find policies that generalize to unseen instances.
arXiv:2210.05393v2 [cs.LG] 3 May 2023
For this purpose, the OTF-DCS algorithm (and not the
control problem being solved) is framed as a Markov
Decision Process in which reward is obtained by minimizing
the number of explored transitions of the plant. This task can
be challenging because (1) it has a sparse reward (“there is no
information about whether one action sequence is better than
another until it terminates”, Gehring et al. (2022)), (2) the
action set is large (one per plant transition) and actions are
used only once per episode, and (3) the state has a complex
graph structure.
We use a modified version of DQN (Mnih et al. 2013)
and we abstract both actions and states to a feature space
that is unique for all instances of a parameterized control
problem. This addresses challenges (2) and (3), and allows for
generalization, but causes the RL task to be highly partially
observable due to the information loss of the set of features.
We propose training in small instances of a problem (partially
addressing challenge (1)) for a relatively short time with a
small neural network. Then, a uniform sample of the policies
obtained during training is tested on slightly larger instances.
The one that shows best generalization capabilities is selected
and then fully evaluated in a large time-out setting.
Our results show first that with this technique it is possible
to learn competitive heuristics on the training instances; and
second, that these policies are effective when used in larger
instances. Our agents are evaluated both in terms of expanded
transitions and in terms of solved instances within a time
budget, and overall they outperform the best heuristic from
Ciolek et al. (2023), pushing the boundaries of instances
solved in various of the benchmark problems.
2 Background
2.1 Modular Directed Control
The discrete event system (DES) or plant to be controlled
is defined as a tuple
E= (SE, AE,E,¯s, ME)
, where
SE
is a finite set of states;
AE
is a finite set of event labels,
partitioned as
AC
EAU
E
, the controllable and uncontrollable
events;
E:SE×AE7→ SE
is a partial function that
encodes the transitions;
¯sSE
is the initial state; and
ME
SEis a set of marked states.
This automaton defines a language
L(E)A
E
, where
denotes the Kleene closure in the usual manner. A word
w
A
E
belongs to the language if it follows
E
with a sequence
of states ¯s=s0. . . st. In this case we note ¯sw
···
Est.
A function that based on the observed behaviour of a plant
decides which controllable events are allowed is referred
to as a control function. Given a DES
E
, a controller is a
function
σ:A
E7→ P(AC
E)
. A word
w∈ L(E)
belongs to
Lσ(E)
, the language generated by
σ
, if each event
li
is either
uncontrollable or enabled by σ(l0. . . li1).
Given a plant, we wish to find a controller that ensures that
a marked state can be reached from any reachable state (even
from marked states). The non-blocking property captures this
idea. Formally, a controller
σ
for a given DES is non-blocking
if for any trace
w∈ Lσ(E)
, there is a non-empty word
w0A
E
such that the concatenation
ww0Lσ(E)
and
¯sww0
···
Esm
for some
smME
. Additionally, a controller is
adirector if |σ(w)| ≤ 1for all wA
E.
Note that with this definition a non-blocking controller
must also be safe in the sense that it cannot allow a deadlock
state to be reachable (i.e. a state with no outgoing transitions).
Unsafe states can be modelled as deadlock states.
Modular modelling of DES control problems (Ramadge
and Wonham 1989) supports describing the plant by means
of multiple deterministic automata and their synchronous
product or parallel composition.
The parallel composition
(k)
of two DES
T
and
Q
yields a
DES
TkQ= (ST×SQ, ATAQ,TkQ,h¯
t, ¯qi, MT×MQ)
,
where AC
TkQ=AC
TAC
Qand TkQis the smallest relation
that satisfies the following rules:
(i) if t`
Tt0and `AT\AQthen ht, qi`
TkQht0
,qi,
(ii) if q`
Qq0and `AQ\ATthen ht, qi`
TkQht,q0i,
(iii)
if
t`
Tt0
,
q`
Qq0
, and
`ATAQ
then
ht, qi`
TkQ
ht0, q0i.
A Modular Directed Control Problem, or simply control
problem in this paper, is given by a set of deterministic
automata
E= (E1, . . . , En)
. A solution to this problem
is a non-blocking director for
E1k. . . kEn
. Additionally,
the control problems that we aim to solve in this work are
parametric. A control problem domain is a set of instances
Π = {Ep:p∈ C}
, where each
Ep
is a (modular) control
problem and
C
is a set of possible parameters. In our case,
control problems in each domain are generated by the same
specification, which takes parameters
p
as input and is written
in the FSP language (Magee and Kramer 2014).
2.2 On-the-Fly Modular Directed Control
The Modular Directed Control Problem can be solved by
fully building the composed plant and running a monolithic
algorithm such as the presented by Huang and Kumar (2008).
While this quickly becomes intractable, there are problems
for which the state explosion can be delayed significantly
by exploring a small subset of the plant that is enough to
determine a control strategy (or to conclude that there is
none). The OTF-DCS algorithm (Ciolek et al. 2023) is briefly
summarized in Algorithm 1. It performs a best-first search of
the composed plant, adding one transition at a time from the
exploration frontier to a partial exploration structure (A).
Formally, given
E= (SE, AE,E,¯s, ME)
, a control
problem, and
h={a0, . . . , at} ⊆→E
, a sequence of
transitions, the exploration frontier of
E
after expanding
sequence
h
is
F(E, h)
, the set of transitions
(s, `, s0)
(E\h)
such that
s= ¯s
or
(s00, `, s)h
for some
s00
.
An exploration sequence for
E
is
{a0, . . . , at} ⊆→E
such
that ai∈ F(E, {a0, . . . , ai1})for 0it.
With each added transition,
expandAndP ropagate
updates the classification of states in sets of losing (
LS
),
winning (
WS
), or neither. We say that a state
sE
is
winning (resp. losing) in a plant
E
if there is a (resp. there is
no) solution for
Es
, where
Es
is the result of changing the
initial state of
E
to
s
. Essentially, a state will be winning if
it is part of a loop that has a marked state and that has no
uncontrollable events that go to states outside the loop, or
摘要:

ExplorationPoliciesforOn-the-FlyControllerSynthesis:AReinforcementLearningApproachTom´asDelgado1,MarcoS´anchezSorondo1,V´ctorBraberman1,Sebasti´anUchitel121UniversidaddeBuenosAires2ImperialCollegetdelgado@dc.uba.ar,msorondo@dc.uba.ar,vbraber@dc.uba.ar,suchitel@dc.uba.arAbstractControllersynthesisis...

展开>> 收起<<
Exploration Policies for On-the-Fly Controller Synthesis A Reinforcement Learning Approach Tom as Delgado1 Marco S anchez Sorondo1 Vıctor Braberman1 Sebasti an Uchitel12.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.3MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注