For this purpose, the OTF-DCS algorithm (and not the
control problem being solved) is framed as a Markov
Decision Process in which reward is obtained by minimizing
the number of explored transitions of the plant. This task can
be challenging because (1) it has a sparse reward (“there is no
information about whether one action sequence is better than
another until it terminates”, Gehring et al. (2022)), (2) the
action set is large (one per plant transition) and actions are
used only once per episode, and (3) the state has a complex
graph structure.
We use a modified version of DQN (Mnih et al. 2013)
and we abstract both actions and states to a feature space
that is unique for all instances of a parameterized control
problem. This addresses challenges (2) and (3), and allows for
generalization, but causes the RL task to be highly partially
observable due to the information loss of the set of features.
We propose training in small instances of a problem (partially
addressing challenge (1)) for a relatively short time with a
small neural network. Then, a uniform sample of the policies
obtained during training is tested on slightly larger instances.
The one that shows best generalization capabilities is selected
and then fully evaluated in a large time-out setting.
Our results show first that with this technique it is possible
to learn competitive heuristics on the training instances; and
second, that these policies are effective when used in larger
instances. Our agents are evaluated both in terms of expanded
transitions and in terms of solved instances within a time
budget, and overall they outperform the best heuristic from
Ciolek et al. (2023), pushing the boundaries of instances
solved in various of the benchmark problems.
2 Background
2.1 Modular Directed Control
The discrete event system (DES) or plant to be controlled
is defined as a tuple
E= (SE, AE,→E,¯s, ME)
, where
SE
is a finite set of states;
AE
is a finite set of event labels,
partitioned as
AC
E∪AU
E
, the controllable and uncontrollable
events;
→E:SE×AE7→ SE
is a partial function that
encodes the transitions;
¯s∈SE
is the initial state; and
ME⊆
SEis a set of marked states.
This automaton defines a language
L(E)⊆A∗
E
, where
∗
denotes the Kleene closure in the usual manner. A word
w∈
A∗
E
belongs to the language if it follows
→E
with a sequence
of states ¯s=s0. . . st. In this case we note ¯sw
···
Est.
A function that based on the observed behaviour of a plant
decides which controllable events are allowed is referred
to as a control function. Given a DES
E
, a controller is a
function
σ:A∗
E7→ P(AC
E)
. A word
w∈ L(E)
belongs to
Lσ(E)
, the language generated by
σ
, if each event
li
is either
uncontrollable or enabled by σ(l0. . . li−1).
Given a plant, we wish to find a controller that ensures that
a marked state can be reached from any reachable state (even
from marked states). The non-blocking property captures this
idea. Formally, a controller
σ
for a given DES is non-blocking
if for any trace
w∈ Lσ(E)
, there is a non-empty word
w0∈A∗
E
such that the concatenation
ww0∈Lσ(E)
and
¯sww0
···
Esm
for some
sm∈ME
. Additionally, a controller is
adirector if |σ(w)| ≤ 1for all w∈A∗
E.
Note that with this definition a non-blocking controller
must also be safe in the sense that it cannot allow a deadlock
state to be reachable (i.e. a state with no outgoing transitions).
Unsafe states can be modelled as deadlock states.
Modular modelling of DES control problems (Ramadge
and Wonham 1989) supports describing the plant by means
of multiple deterministic automata and their synchronous
product or parallel composition.
The parallel composition
(k)
of two DES
T
and
Q
yields a
DES
TkQ= (ST×SQ, AT∪AQ,→TkQ,h¯
t, ¯qi, MT×MQ)
,
where AC
TkQ=AC
T∪AC
Qand →TkQis the smallest relation
that satisfies the following rules:
(i) if t`
→Tt0and `∈AT\AQthen ht, qi`
→TkQht0
,qi,
(ii) if q`
→Qq0and `∈AQ\ATthen ht, qi`
→TkQht,q0i,
(iii)
if
t`
→Tt0
,
q`
→Qq0
, and
`∈AT∩AQ
then
ht, qi`
→TkQ
ht0, q0i.
A Modular Directed Control Problem, or simply control
problem in this paper, is given by a set of deterministic
automata
E= (E1, . . . , En)
. A solution to this problem
is a non-blocking director for
E1k. . . kEn
. Additionally,
the control problems that we aim to solve in this work are
parametric. A control problem domain is a set of instances
Π = {Ep:p∈ C}
, where each
Ep
is a (modular) control
problem and
C
is a set of possible parameters. In our case,
control problems in each domain are generated by the same
specification, which takes parameters
p
as input and is written
in the FSP language (Magee and Kramer 2014).
2.2 On-the-Fly Modular Directed Control
The Modular Directed Control Problem can be solved by
fully building the composed plant and running a monolithic
algorithm such as the presented by Huang and Kumar (2008).
While this quickly becomes intractable, there are problems
for which the state explosion can be delayed significantly
by exploring a small subset of the plant that is enough to
determine a control strategy (or to conclude that there is
none). The OTF-DCS algorithm (Ciolek et al. 2023) is briefly
summarized in Algorithm 1. It performs a best-first search of
the composed plant, adding one transition at a time from the
exploration frontier to a partial exploration structure (A).
Formally, given
E= (SE, AE,→E,¯s, ME)
, a control
problem, and
h={a0, . . . , at} ⊆→E
, a sequence of
transitions, the exploration frontier of
E
after expanding
sequence
h
is
F(E, h)
, the set of transitions
(s, `, s0)∈
(→E\h)
such that
s= ¯s
or
(s00, `, s)∈h
for some
s00
.
An exploration sequence for
E
is
{a0, . . . , at} ⊆→E
such
that ai∈ F(E, {a0, . . . , ai−1})for 0≤i≤t.
With each added transition,
expandAndP ropagate
updates the classification of states in sets of losing (
LS
),
winning (
WS
), or neither. We say that a state
s∈E
is
winning (resp. losing) in a plant
E
if there is a (resp. there is
no) solution for
Es
, where
Es
is the result of changing the
initial state of
E
to
s
. Essentially, a state will be winning if
it is part of a loop that has a marked state and that has no
uncontrollable events that go to states outside the loop, or