Exploration Policies for On-the-Fly Controller Synthesis A Reinforcement Learning Approach Tom as Delgado1 Marco S anchez Sorondo1 Vıctor Braberman1 Sebasti an Uchitel12

2025-04-27 0 0 1.3MB 9 页 10玖币

侵权投诉

Exploration Policies for On-the-Fly Controller Synthesis:

A Reinforcement Learning Approach

Tom´

as Delgado1, Marco S´

anchez Sorondo1, V´

ıctor Braberman1, Sebasti´

an Uchitel12

1Universidad de Buenos Aires

2Imperial College

tdelgado@dc.uba.ar, msorondo@dc.uba.ar, vbraber@dc.uba.ar, suchitel@dc.uba.ar

Abstract

Controller synthesis is in essence a case of model-based

planning for non-deterministic environments in which plans

(actually “strategies”) are meant to preserve system goals

indeﬁnitely. In the case of supervisory control environments

are speciﬁed as the parallel composition of state machines

and valid strategies are required to be “non-blocking” (i.e.,

always enabling the environment to reach certain marked

states) in addition to safe (i.e., keep the system within a

safe zone). Recently, On-the-ﬂy Directed Controller Synthesis

techniques were proposed to avoid the exploration of the

entire -and exponentially large- environment space, at the

cost of non-maximal permissiveness, to either ﬁnd a strategy

or conclude that there is none. The incremental exploration

of the plant is currently guided by a domain-independent

human-designed heuristic.

In this work, we propose a new method for obtaining heuristics

based on Reinforcement Learning (RL). The synthesis

algorithm is thus framed as an RL task with an unbounded

action space and a modiﬁed version of DQN is used. With a

simple and general set of features that abstracts both states and

actions, we show that it is possible to learn heuristics on small

versions of a problem that generalize to the larger instances,

effectively doing zero-shot policy transfer. Our agents learn

from scratch in a highly partially observable RL task and

outperform the existing heuristic overall, in instances unseen

during training.

1 Introduction

Reactive systems in domains such as communication

networks, automated manufacturing, air trafﬁc control, and

robotics, can beneﬁt from the automated construction of

correct control strategies. Discrete Event Control (Wonham

and Ramadge 1987), Automated Planning (Nau, Ghallab, and

Traverso 2004) and Reactive Synthesis (Pnueli and Rosner

1989) are ﬁelds that address this automated construction

problem. Although they have representational, expressiveness

and algorithmic differences, the three must deal with the

state explosion that results from analysing a compact input

description with exponentially large underlying semantics.

Recently, an increasing number of studies aimed at relating

the ﬁelds (e.g. Ehlers et al. 2017; Sardi

na and D’Ippolito

2023, Association for the Advancement of Artiﬁcial

2015; Camacho, Bienvenu, and McIlraith 2021; Hoffmann

et al. 2020).

In this paper we study an approach to discrete event

systems (DES) control in which a plant to be controlled

is speciﬁed modularly as the parallel composition of

communicating ﬁnite state automata (Wonham and Ramadge

1988). The aim is to build a director that is safe and

nonblocking. That is, it should guarantee that a marked state

of the plant can always be reached while ensuring that unsafe

plant states are never reached. Directors (Huang and Kumar

2008) are controllers that enable at most one controllable

event at any time, in contrast to supervisors, which are

maximally permissive.

Composing the automata of the plant can result in an

exponential state explosion. Approaches that ﬁrst build the

full plant and compute a director can fail within a time and

memory budget even when there is a director that keeps the

system in a very small proportion of the full plant state space.

On-the-ﬂy Directed Controller Synthesis (OTF-DCS) (Ciolek

et al. 2023) attempts to avoid state explosion by exploring the

composed plant incrementally, checking for the existence of

directors after each new transition is added. If guided by good

heuristics, this process allows ﬁnding controllers by building

only the parts of the plant that the controllers themselves

enable reaching. In the same work, several manually designed

heuristics were proposed and it was shown that for certain

domains it is possible to solve instances that cannot be solved

by a fully-compose and synthesise approach.

In this work we propose replacing manually designed

heuristics for OTF-DCS with an exploration policy learned

via Reinforcement Learning (RL) that is efﬁcient for large

instances of a parameterized control problem having been

trained only on small instances of the problem. Note that

we cannot use RL to learn the control strategy itself because

RL cannot provide full guarantees that the controller will

satisfy the safety and non-blocking properties. We use RL to

learn an exploration strategy that minimizes the number of

transitions to be added to the plant by the OTF-DCS algorithm.

Additionally, note that RL has traditionally focused on tasks

where agents are evaluated in the same environment as they

are trained. This is not useful in our application because we

are ultimately interested in solving large instances that cannot

currently be solved in reasonable amounts of time. We use

RL to ﬁnd policies that generalize to unseen instances.

arXiv:2210.05393v2 [cs.LG] 3 May 2023

For this purpose, the OTF-DCS algorithm (and not the

control problem being solved) is framed as a Markov

Decision Process in which reward is obtained by minimizing

the number of explored transitions of the plant. This task can

be challenging because (1) it has a sparse reward (“there is no

information about whether one action sequence is better than

another until it terminates”, Gehring et al. (2022)), (2) the

action set is large (one per plant transition) and actions are

used only once per episode, and (3) the state has a complex

graph structure.

We use a modiﬁed version of DQN (Mnih et al. 2013)

and we abstract both actions and states to a feature space

that is unique for all instances of a parameterized control

problem. This addresses challenges (2) and (3), and allows for

generalization, but causes the RL task to be highly partially

observable due to the information loss of the set of features.

We propose training in small instances of a problem (partially

addressing challenge (1)) for a relatively short time with a

small neural network. Then, a uniform sample of the policies

obtained during training is tested on slightly larger instances.

The one that shows best generalization capabilities is selected

and then fully evaluated in a large time-out setting.

Our results show ﬁrst that with this technique it is possible

to learn competitive heuristics on the training instances; and

second, that these policies are effective when used in larger

instances. Our agents are evaluated both in terms of expanded

transitions and in terms of solved instances within a time

budget, and overall they outperform the best heuristic from

Ciolek et al. (2023), pushing the boundaries of instances

solved in various of the benchmark problems.

2 Background

2.1 Modular Directed Control

The discrete event system (DES) or plant to be controlled

is deﬁned as a tuple

E= (SE, AE,→E,¯s, ME)

, where

is a ﬁnite set of states;

is a ﬁnite set of event labels,

partitioned as

E∪AU

, the controllable and uncontrollable

events;

→E:SE×AE7→ SE

is a partial function that

encodes the transitions;

¯s∈SE

is the initial state; and

ME⊆

SEis a set of marked states.

This automaton deﬁnes a language

L(E)⊆A∗

, where

∗

denotes the Kleene closure in the usual manner. A word

w∈

A∗

belongs to the language if it follows

→E

with a sequence

of states ¯s=s0. . . st. In this case we note ¯sw

→

···

→

Est.

A function that based on the observed behaviour of a plant

decides which controllable events are allowed is referred

to as a control function. Given a DES

, a controller is a

function

σ:A∗

E7→ P(AC

. A word

w∈ L(E)

belongs to

Lσ(E)

, the language generated by

, if each event

is either

uncontrollable or enabled by σ(l0. . . li−1).

Given a plant, we wish to ﬁnd a controller that ensures that

a marked state can be reached from any reachable state (even

from marked states). The non-blocking property captures this

idea. Formally, a controller

for a given DES is non-blocking

if for any trace

w∈ Lσ(E)

, there is a non-empty word

w0∈A∗

such that the concatenation

ww0∈Lσ(E)

and

¯sww0

→

···

→

Esm

for some

sm∈ME

. Additionally, a controller is

adirector if |σ(w)| ≤ 1for all w∈A∗

Note that with this deﬁnition a non-blocking controller

must also be safe in the sense that it cannot allow a deadlock

state to be reachable (i.e. a state with no outgoing transitions).

Unsafe states can be modelled as deadlock states.

Modular modelling of DES control problems (Ramadge

and Wonham 1989) supports describing the plant by means

of multiple deterministic automata and their synchronous

product or parallel composition.

The parallel composition

(k)

of two DES

and

yields a

DES

TkQ= (ST×SQ, AT∪AQ,→TkQ,h¯

t, ¯qi, MT×MQ)

where AC

TkQ=AC

T∪AC

Qand →TkQis the smallest relation

that satisﬁes the following rules:

(i) if t`

→Tt0and `∈AT\AQthen ht, qi`

→TkQht0

,qi,

(ii) if q`

→Qq0and `∈AQ\ATthen ht, qi`

→TkQht,q0i,

(iii)

→Tt0

→Qq0

, and

`∈AT∩AQ

then

ht, qi`

→TkQ

ht0, q0i.

A Modular Directed Control Problem, or simply control

problem in this paper, is given by a set of deterministic

automata

E= (E1, . . . , En)

. A solution to this problem

is a non-blocking director for

E1k. . . kEn

. Additionally,

the control problems that we aim to solve in this work are

parametric. A control problem domain is a set of instances

Π = {Ep:p∈ C}

, where each

is a (modular) control

problem and

is a set of possible parameters. In our case,

control problems in each domain are generated by the same

speciﬁcation, which takes parameters

as input and is written

in the FSP language (Magee and Kramer 2014).

2.2 On-the-Fly Modular Directed Control

The Modular Directed Control Problem can be solved by

fully building the composed plant and running a monolithic

algorithm such as the presented by Huang and Kumar (2008).

While this quickly becomes intractable, there are problems

for which the state explosion can be delayed signiﬁcantly

by exploring a small subset of the plant that is enough to

determine a control strategy (or to conclude that there is

none). The OTF-DCS algorithm (Ciolek et al. 2023) is brieﬂy

summarized in Algorithm 1. It performs a best-ﬁrst search of

the composed plant, adding one transition at a time from the

exploration frontier to a partial exploration structure (A).

Formally, given

E= (SE, AE,→E,¯s, ME)

, a control

problem, and

h={a0, . . . , at} ⊆→E

, a sequence of

transitions, the exploration frontier of

after expanding

sequence

F(E, h)

, the set of transitions

(s, `, s0)∈

(→E\h)

such that

s= ¯s

(s00, `, s)∈h

for some

s00

An exploration sequence for

{a0, . . . , at} ⊆→E

such

that ai∈ F(E, {a0, . . . , ai−1})for 0≤i≤t.

With each added transition,

expandAndP ropagate

updates the classiﬁcation of states in sets of losing (

winning (

), or neither. We say that a state

s∈E

winning (resp. losing) in a plant

if there is a (resp. there is

no) solution for

, where

is the result of changing the

initial state of

. Essentially, a state will be winning if

it is part of a loop that has a marked state and that has no

uncontrollable events that go to states outside the loop, or

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExplorationPoliciesforOn-the-FlyControllerSynthesis:AReinforcementLearningApproachTom´asDelgado1,MarcoS´anchezSorondo1,V´ctorBraberman1,Sebasti´anUchitel121UniversidaddeBuenosAires2ImperialCollegetdelgado@dc.uba.ar,msorondo@dc.uba.ar,vbraber@dc.uba.ar,suchitel@dc.uba.arAbstractControllersynthesisis...

展开>> 收起<<

Exploration Policies for On-the-Fly Controller Synthesis A Reinforcement Learning Approach Tom as Delgado1 Marco S anchez Sorondo1 Vıctor Braberman1 Sebasti an Uchitel12.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploration Policies for On-the-Fly Controller Synthesis A Reinforcement Learning Approach Tom as Delgado1 Marco S anchez Sorondo1 Vıctor Braberman1 Sebasti an Uchitel12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: