Reinforcement learning approach for multi-agent exible scheduling problems Hongjian Zhou1 Boyang Gu2and Chenghao Jin3

2025-04-29 0 0 831.02KB 18 页 10玖币

侵权投诉

Reinforcement learning approach for multi-agent

ﬂexible scheduling problems

Hongjian Zhou1, Boyang Gu2and Chenghao Jin3

1University of Oxford, Oxford OX1 2JD, United Kingdom

2Imperial College London, London SW7 2BX, United Kingdom

E-mail: 1hongjian.zhou@exeter.ox.ac.uk, 2boyang.gu19@imperial.ac.uk,

3steven jin gbhg@foxmail.com

Abstract. Scheduling plays an important role in automated production. Its impact can

be found in various ﬁelds such as the manufacturing industry, the service industry and the

technology industry. A scheduling problem (NP-hard) is a task of ﬁnding a sequence of job

assignments on a given set of machines with the goal of optimizing the objective deﬁned.

Methods such as Operation Research, Dispatching Rules, and Combinatorial Optimization have

been applied to scheduling problems but no solution guarantees to ﬁnd the optimal solution.

The recent development of Reinforcement Learning has shown success in sequential decision-

making problems. This research presents a Reinforcement Learning approach for scheduling

problems. In particular, this study delivers an OpenAI gym environment with search-space

reduction for JSSP and provides a heuristic-guided Q-Learning solution with state-of-the-art

performance for Multi-agent Flexible Job Shop Problems.

Key words. scheduling, job shop, ﬂexible job shop, reinforcement learning, heuristic-guided

reinforcement learning, Q-learning, multi-agent system, optimization

1. Introduction

Production scheduling is deﬁned as an optimization of the allocation of resources over time

among activities performed by machines. It is crucial to real-world production since the resources

are always limited and require scheduling for optimal outcomes. In this article, we mainly discuss

the Job Shop Scheduling Problem (JSSP), a very common application in production scheduling.

Recently, many approaches are developed for JSSP. In order to study them, we propose an

environment which can be used to test diﬀerent algorithms and can be customized to deal with

other variations of JSSP. We also present a new Q-learning algorithm which outperforms most

traditional approaches.

The article is organized as follows: Section 2 gives a literature review of the related works

of the scheduling problem, including the common models, related problems, and solutions.

Section 3 gives a review of the background knowledge to understand the model of JSSP. Section 4

provides the detailed explanation of our designed environment for JSSP. Section 5 explains

our improved Q-learning algorithm. Section 6 analyzes and evaluates the performance of our

algorithm. Section 7 concludes the whole article and discusses possible future works.

arXiv:2210.03674v1 [cs.AI] 7 Oct 2022

2. Related Work

One classical model of scheduling problems is the Markov Decision Process (MDP), since most

scheduling problems make decisions more on the current state rather than on the entire history.

Of course, mathematically most scheduling problems are not Markovian, so there will be some

adjustments to the ordinary MDP model. Generalized semi-Markov Decision Process (GSMDP)

is one of the suitable adjusted models [1], where it allows the enabled event to trigger at any

time and optimize the makespan. When stochasticity is added to the model, GSMDP can use

continuous phase-type distributions to simulate the duration, such as the duration time of the

task [2].

Before the introduction of reinforcement learning (RL) to the scheduling problem, many

traditional approaches were developed. One of the famous scheduling problems is the Traveling

Salesmen Problem (TSP), where Lin-Kernighan-Helsgaun TSP solver was once the state-of-art

heuristic for it [3]. LKH solver transfers asymmetric problems into symmetric ones and uses

the amount of violated constraints to scale the penalty. Another popular problem related to

scheduling is the Resource-Constrained Project Scheduling Problem (RCPSP). Kolisch extended

the precedence based minimum slack priority rule (MSLK) to a precedence and resources based

slack priority rule to replace RSM to solve RCPSP [4]. If we relax the resource constraint

and add stochasticity to the activities themselves, we get the Stochastic Scheduling Problem

(SSP). There are many non-RL approaches to it. For example, the Rollout algorithm is used

to solve the Quiz Problem, which is one variation of SSP [5]. There are also many heuristic

approaches which describe diﬀerent aspects of the stochastic features, which allows the usage of

a computationally tractable closed-loop algorithm to solve SSP [6].

Another well-known area of scheduling problems is the Job Shop Scheduling Problem (JSSP),

which we will further discuss it later in Section 4. Miloˇs introduced a classical model of

it, including job-based representation, operation-based representation, and disjunctive graph-

based representation [7], while actually there are many variations of JSSP. According to the

number of operations, the number of available machines for each operation, and whether

stochasticity is introduced to the problem, we have Parallel Machines Job Shop Scheduling

Problem (JSSP-PM), Flexible Job Shop Scheduling (FJSSP), and Stochastic Flexible Job Shop

Scheduling (SFJSSP) [8]. Wittrock introduced a non-RL algorithm to a variation of FJSSP,

which decompose the problem into three sub-problems (machine allocation, order sequencing,

and execution time) and each of these is solved using a fast heuristic [9]. There are also many

RL-based algorithms developed for solving JSSP. Rinciog and Meyer developed the possibility

of using RL-based algorithms to solve JSSP and gave a general design [10]. They separated

MDPs for production scheduling into operation sequencing, routing before sequencing, interlaced

routing and sequencing transport-centric routing and re-scheduling. They also classiﬁed RL

algorithms along three discrete axes: value, policy or actor-critic methods; model-free or model-

based methods; and on or oﬀ-policy methods. For more speciﬁc work, Cheng, Gen and Tsujimura

introduced nine most common representations (job-based, operation-based, preference list-based,

job pair relation-based, priority rule-based, disjunctive graph-based, completion time-based,

machine-based, and random keys) of JSSP that are suitable for further Genetic algorithm (GA)

work [11], and those representations are also suitable for other algorithms, both RL and non-

RL ones. They also gave a full review of diﬀerent variations of GA that can be applied to

JSSP, including adapting problems to the genetic algorithms, adapting the genetic algorithms

to problems, and adapting both the genetic algorithms and problems [12]. Hasan, Sarker and

Cornforth also gave a detailed improved GA algorithm which includes phenotypes creation,

feasibility check, solution ranking or heuristic ordering, and generation updates (crossover and

mutations) [13]. Type-Aware Graph Attention (TAGAT) algorithm can also solve JSSP by

using ScheduleNet, a decentralized decision-making policy that can coordinate multiple agents

to complete tasks [14]. Besides those two algorithms, there are also many works on solving JSSP

based on the Q-learning algorithm. The main variation of diﬀerent Q-learning algorithms is they

have diﬀerent reward or cost strategies for diﬀerent aims, such as makespan [8] or idle time [15].

Need to mention that besides JSSP, RL-based algorithms are also developed for other diﬀerent

scheduling problems. For example, Kuhnle et al. applied RL to create an adaptive production

control system by the real-world example of order dispatching in a complex job shop [16].

2.1. Flexible Job Shop Schedule Solutions

There are three main approaches to solving the FJSSP: hierarchical approaches, integrated

approaches, RL approaches [8].

2.1.1. Hierarchical Approaches Hierarchical approaches decompose the problem into smaller

problems with the intent of reducing complexity. The most common decomposition is to split

the original question into assignments and sequencing parts. The assignments part determines

the assignments of operations to machines. The problem is then transformed into a JSSP where

each operation has an assigned machine for execution. Last, the sequencing part determines

the sequence of execution. Two common algorithms adapting the hierarchical approaches are

tabu search [17] and simulated annealing [18]. Mohammad and Parviz [17] proposed a tabu

search procedure to ﬁnd the best sequence of operations and choice of machine alternatives

simultaneously. The procedure in [17] uses a metaheuristic search method to ﬁnd the best

choice of machine’s alternative for each job and operation sequence. These sequences are then

improved iteratively by comparing them with neighbourhood solutions. The proposed algorithm

ﬁnds success in solving smaller instances (less than 5 jobs and machines) of FJSSP but fails to

solve larger instances (more than 5 jobs and machines) [17]. On the other hand, Yazdani

and Gholami [18] used a simulated annealing algorithm to address the FJSSP. They proposed

a meta-heuristic algorithm to explore the solution space using a stochastic local search with

probabilistic moves of avoiding local optima [18]. Similar to [17], the proposed algorithm ﬁnds

success in solving smaller instances but struggles to ﬁnd the optimal solution for larger instances.

2.1.2. Integrated Approaches Integrated approaches attempt to solve the problem as a whole

instead of decomposing it. Common practices of integrated approaches are dispatching rules

[19] and Genetic Algorithm [20]. Jun and Lee proposed a Random Forest for Obtaining Rules

for Scheduling (RANFORS) approach which consists of schedule generation, rule learning with

data transformation and rule improvement with discretization [19]. The result shows success

in capturing both explicit and implicit knowledge while outperforming prevalent dispatching

rules in performance and robustness. Alternatively, Pezzela and Morganti [20] presented a

genetic algorithm which integrates diﬀerent strategies for generating the initial population and

reproducing new ones. The results obtained are comparable to the best-known algorithms based

on tabu search.

2.1.3. RL Approaches RL approaches learn action-state values that give the expected utility of

taking a given action in a given state, then choose the optimal pairs as the solution. Both Reyna

[21] and Martinez [22] used Q-learning to solve FJSSP. While Martinez decomposes the original

FJSSP into assignment and sequencing, Reyna approaches FJSSP as one single problem. The

results obtained from both approaches show success in smaller and medium-sized problems (less

than 10 jobs and machines), outperforming classical methods such as genetic algorithm and

tabu search. However, they struggle to ﬁnd the optimal solution for large-sized problems (more

than 10 jobs and machines). In addition to Q-learning, researchers tried more advanced learning

methods such as Deep Reinforcement Learning (DRL) to solve JSSP. DRL Tassel, Gebser [23]

and Chang, He and Yu [24] both used DRL to solve JSSP. While Tassel and Gebser focused on

solving the classical JSSP problem where the machine and operation pairs are pre-determined,

Chang, He and Yu focused on solving the dynamic FJSP with random job arrival. Tassel and

Gebser used two Multi-Layer Perceptron (MLP) models for action selection and state-value

prediction respectively. On the other hand, Chang, He and Yu used Deep Q-network (DQN)

to learn the experience tuple with states, actions and rewards. Results obtained from both

study show success in small, medium and large-sized problems and outperforming most classical

approaches and RL approaches.

3. Problem Formulation

Markov Decision Process(MDP) and its variations are the most common ways of describing

scheduling problems, which we will introduce below. Note that all the stochastic process in this

section is time-homogeneous.

3.1. MDP

The decision maker makes decisions as the system evolves through time. We denote Tas the

time points where the state is observed and the action must be chosen, ie., the set of all decision

epochs or stages. Tcan either be discrete or continuous, with or without ﬁnite horizons. We

will focus on the discrete case with an inﬁnite horizon for the rest of the article. Conventionally,

we let T=Nin this case. For any t∈T, we denote Stthe set of all possible state at time t

and Xtthe random variable of state at time t. For any s∈St, we denote As,t the set of all

applicable actions when state sis observed at time tand Atthe random variable of action at

time t. According to the action athe agent chose, the system will transfer into another state

in the next time point, ie., s0∈St+1, with a probability p(s, a, s0) and receive a reward. This

stochastic process is a Markov Decision Process (MDP) if

P(Xt+1 =st+1|X0=s0, A0=a0, . . . , Xt=st, At=at) = P(Xt+1 =st+1|Xt=st, At=at),

∀t, s, a . (1)

Here we denote

p(st, at, st+1) = P(Xt+1 =st+1|Xt=st, At=at),(2)

so clearly Pst+1∈St+1 p(st, at, st+1) = 1.

To summarize, the standard discrete MDP consists of ﬁve components: Decision Epochs (T),

State Space (S), Action Space (A), Transition Probability (P), and Reward Function (R) [25].

3.2. Semi-MDP

The main diﬀerence between MDP and semi-MDP is that we consider the time diﬀerence between

the two decision epochs. We ﬁrst introduce the Markov Renewal Process.

Suppose we have the same deﬁnition of An, S, A, we modiﬁed our deﬁnition of the set of

decision epochs Tinto the following: Instead of considering Tas consecutive time points, we

consider only the time points where observation changes and action is made. For each n∈N,

we have a random variable Tn∈[0,+∞) s.t. T0≤T1≤ · · · .

The stochastic process (X, T ) is a Markov Renewal Process if

P(Xn+1 =sn+1, Tn+1 −Tn≤t|X0=s0, A0=a0, T0=t0,· · · , Xn=sn, An=an, Tn=tn)

=P(Xn+1 =sn+1, Tn+1 −Tn≤t|Xn=sn, An=an),

∀n, s, a, t .

(3)

If we deﬁne (Y, B) = {Yt, Bt:t≥0}as

Yt=Xnif Tn≤t≤Tn+1 ,

Bt=Anif Tn≤t≤Tn+1 ,(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Reinforcementlearningapproachformulti-agentexibleschedulingproblemsHongjianZhou1,BoyangGu2andChenghaoJin31UniversityofOxford,OxfordOX12JD,UnitedKingdom2ImperialCollegeLondon,LondonSW72BX,UnitedKingdomE-mail:1hongjian.zhou@exeter.ox.ac.uk,2boyang.gu19@imperial.ac.uk,3stevenjingbhg@foxmail.comAbstract...

展开>> 收起<<

Reinforcement learning approach for multi-agent exible scheduling problems Hongjian Zhou1 Boyang Gu2and Chenghao Jin3.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Reinforcement learning approach for multi-agent exible scheduling problems Hongjian Zhou1 Boyang Gu2and Chenghao Jin3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: