Reinforcement learning approach for multi-agent exible scheduling problems Hongjian Zhou1 Boyang Gu2and Chenghao Jin3

2025-04-29 0 0 831.02KB 18 页 10玖币
侵权投诉
Reinforcement learning approach for multi-agent
flexible scheduling problems
Hongjian Zhou1, Boyang Gu2and Chenghao Jin3
1University of Oxford, Oxford OX1 2JD, United Kingdom
2Imperial College London, London SW7 2BX, United Kingdom
E-mail: 1hongjian.zhou@exeter.ox.ac.uk, 2boyang.gu19@imperial.ac.uk,
3steven jin gbhg@foxmail.com
Abstract. Scheduling plays an important role in automated production. Its impact can
be found in various fields such as the manufacturing industry, the service industry and the
technology industry. A scheduling problem (NP-hard) is a task of finding a sequence of job
assignments on a given set of machines with the goal of optimizing the objective defined.
Methods such as Operation Research, Dispatching Rules, and Combinatorial Optimization have
been applied to scheduling problems but no solution guarantees to find the optimal solution.
The recent development of Reinforcement Learning has shown success in sequential decision-
making problems. This research presents a Reinforcement Learning approach for scheduling
problems. In particular, this study delivers an OpenAI gym environment with search-space
reduction for JSSP and provides a heuristic-guided Q-Learning solution with state-of-the-art
performance for Multi-agent Flexible Job Shop Problems.
Key words. scheduling, job shop, flexible job shop, reinforcement learning, heuristic-guided
reinforcement learning, Q-learning, multi-agent system, optimization
1. Introduction
Production scheduling is defined as an optimization of the allocation of resources over time
among activities performed by machines. It is crucial to real-world production since the resources
are always limited and require scheduling for optimal outcomes. In this article, we mainly discuss
the Job Shop Scheduling Problem (JSSP), a very common application in production scheduling.
Recently, many approaches are developed for JSSP. In order to study them, we propose an
environment which can be used to test different algorithms and can be customized to deal with
other variations of JSSP. We also present a new Q-learning algorithm which outperforms most
traditional approaches.
The article is organized as follows: Section 2 gives a literature review of the related works
of the scheduling problem, including the common models, related problems, and solutions.
Section 3 gives a review of the background knowledge to understand the model of JSSP. Section 4
provides the detailed explanation of our designed environment for JSSP. Section 5 explains
our improved Q-learning algorithm. Section 6 analyzes and evaluates the performance of our
algorithm. Section 7 concludes the whole article and discusses possible future works.
arXiv:2210.03674v1 [cs.AI] 7 Oct 2022
2. Related Work
One classical model of scheduling problems is the Markov Decision Process (MDP), since most
scheduling problems make decisions more on the current state rather than on the entire history.
Of course, mathematically most scheduling problems are not Markovian, so there will be some
adjustments to the ordinary MDP model. Generalized semi-Markov Decision Process (GSMDP)
is one of the suitable adjusted models [1], where it allows the enabled event to trigger at any
time and optimize the makespan. When stochasticity is added to the model, GSMDP can use
continuous phase-type distributions to simulate the duration, such as the duration time of the
task [2].
Before the introduction of reinforcement learning (RL) to the scheduling problem, many
traditional approaches were developed. One of the famous scheduling problems is the Traveling
Salesmen Problem (TSP), where Lin-Kernighan-Helsgaun TSP solver was once the state-of-art
heuristic for it [3]. LKH solver transfers asymmetric problems into symmetric ones and uses
the amount of violated constraints to scale the penalty. Another popular problem related to
scheduling is the Resource-Constrained Project Scheduling Problem (RCPSP). Kolisch extended
the precedence based minimum slack priority rule (MSLK) to a precedence and resources based
slack priority rule to replace RSM to solve RCPSP [4]. If we relax the resource constraint
and add stochasticity to the activities themselves, we get the Stochastic Scheduling Problem
(SSP). There are many non-RL approaches to it. For example, the Rollout algorithm is used
to solve the Quiz Problem, which is one variation of SSP [5]. There are also many heuristic
approaches which describe different aspects of the stochastic features, which allows the usage of
a computationally tractable closed-loop algorithm to solve SSP [6].
Another well-known area of scheduling problems is the Job Shop Scheduling Problem (JSSP),
which we will further discuss it later in Section 4. Miloˇs introduced a classical model of
it, including job-based representation, operation-based representation, and disjunctive graph-
based representation [7], while actually there are many variations of JSSP. According to the
number of operations, the number of available machines for each operation, and whether
stochasticity is introduced to the problem, we have Parallel Machines Job Shop Scheduling
Problem (JSSP-PM), Flexible Job Shop Scheduling (FJSSP), and Stochastic Flexible Job Shop
Scheduling (SFJSSP) [8]. Wittrock introduced a non-RL algorithm to a variation of FJSSP,
which decompose the problem into three sub-problems (machine allocation, order sequencing,
and execution time) and each of these is solved using a fast heuristic [9]. There are also many
RL-based algorithms developed for solving JSSP. Rinciog and Meyer developed the possibility
of using RL-based algorithms to solve JSSP and gave a general design [10]. They separated
MDPs for production scheduling into operation sequencing, routing before sequencing, interlaced
routing and sequencing transport-centric routing and re-scheduling. They also classified RL
algorithms along three discrete axes: value, policy or actor-critic methods; model-free or model-
based methods; and on or off-policy methods. For more specific work, Cheng, Gen and Tsujimura
introduced nine most common representations (job-based, operation-based, preference list-based,
job pair relation-based, priority rule-based, disjunctive graph-based, completion time-based,
machine-based, and random keys) of JSSP that are suitable for further Genetic algorithm (GA)
work [11], and those representations are also suitable for other algorithms, both RL and non-
RL ones. They also gave a full review of different variations of GA that can be applied to
JSSP, including adapting problems to the genetic algorithms, adapting the genetic algorithms
to problems, and adapting both the genetic algorithms and problems [12]. Hasan, Sarker and
Cornforth also gave a detailed improved GA algorithm which includes phenotypes creation,
feasibility check, solution ranking or heuristic ordering, and generation updates (crossover and
mutations) [13]. Type-Aware Graph Attention (TAGAT) algorithm can also solve JSSP by
using ScheduleNet, a decentralized decision-making policy that can coordinate multiple agents
to complete tasks [14]. Besides those two algorithms, there are also many works on solving JSSP
based on the Q-learning algorithm. The main variation of different Q-learning algorithms is they
have different reward or cost strategies for different aims, such as makespan [8] or idle time [15].
Need to mention that besides JSSP, RL-based algorithms are also developed for other different
scheduling problems. For example, Kuhnle et al. applied RL to create an adaptive production
control system by the real-world example of order dispatching in a complex job shop [16].
2.1. Flexible Job Shop Schedule Solutions
There are three main approaches to solving the FJSSP: hierarchical approaches, integrated
approaches, RL approaches [8].
2.1.1. Hierarchical Approaches Hierarchical approaches decompose the problem into smaller
problems with the intent of reducing complexity. The most common decomposition is to split
the original question into assignments and sequencing parts. The assignments part determines
the assignments of operations to machines. The problem is then transformed into a JSSP where
each operation has an assigned machine for execution. Last, the sequencing part determines
the sequence of execution. Two common algorithms adapting the hierarchical approaches are
tabu search [17] and simulated annealing [18]. Mohammad and Parviz [17] proposed a tabu
search procedure to find the best sequence of operations and choice of machine alternatives
simultaneously. The procedure in [17] uses a metaheuristic search method to find the best
choice of machine’s alternative for each job and operation sequence. These sequences are then
improved iteratively by comparing them with neighbourhood solutions. The proposed algorithm
finds success in solving smaller instances (less than 5 jobs and machines) of FJSSP but fails to
solve larger instances (more than 5 jobs and machines) [17]. On the other hand, Yazdani
and Gholami [18] used a simulated annealing algorithm to address the FJSSP. They proposed
a meta-heuristic algorithm to explore the solution space using a stochastic local search with
probabilistic moves of avoiding local optima [18]. Similar to [17], the proposed algorithm finds
success in solving smaller instances but struggles to find the optimal solution for larger instances.
2.1.2. Integrated Approaches Integrated approaches attempt to solve the problem as a whole
instead of decomposing it. Common practices of integrated approaches are dispatching rules
[19] and Genetic Algorithm [20]. Jun and Lee proposed a Random Forest for Obtaining Rules
for Scheduling (RANFORS) approach which consists of schedule generation, rule learning with
data transformation and rule improvement with discretization [19]. The result shows success
in capturing both explicit and implicit knowledge while outperforming prevalent dispatching
rules in performance and robustness. Alternatively, Pezzela and Morganti [20] presented a
genetic algorithm which integrates different strategies for generating the initial population and
reproducing new ones. The results obtained are comparable to the best-known algorithms based
on tabu search.
2.1.3. RL Approaches RL approaches learn action-state values that give the expected utility of
taking a given action in a given state, then choose the optimal pairs as the solution. Both Reyna
[21] and Martinez [22] used Q-learning to solve FJSSP. While Martinez decomposes the original
FJSSP into assignment and sequencing, Reyna approaches FJSSP as one single problem. The
results obtained from both approaches show success in smaller and medium-sized problems (less
than 10 jobs and machines), outperforming classical methods such as genetic algorithm and
tabu search. However, they struggle to find the optimal solution for large-sized problems (more
than 10 jobs and machines). In addition to Q-learning, researchers tried more advanced learning
methods such as Deep Reinforcement Learning (DRL) to solve JSSP. DRL Tassel, Gebser [23]
and Chang, He and Yu [24] both used DRL to solve JSSP. While Tassel and Gebser focused on
solving the classical JSSP problem where the machine and operation pairs are pre-determined,
Chang, He and Yu focused on solving the dynamic FJSP with random job arrival. Tassel and
Gebser used two Multi-Layer Perceptron (MLP) models for action selection and state-value
prediction respectively. On the other hand, Chang, He and Yu used Deep Q-network (DQN)
to learn the experience tuple with states, actions and rewards. Results obtained from both
study show success in small, medium and large-sized problems and outperforming most classical
approaches and RL approaches.
3. Problem Formulation
Markov Decision Process(MDP) and its variations are the most common ways of describing
scheduling problems, which we will introduce below. Note that all the stochastic process in this
section is time-homogeneous.
3.1. MDP
The decision maker makes decisions as the system evolves through time. We denote Tas the
time points where the state is observed and the action must be chosen, ie., the set of all decision
epochs or stages. Tcan either be discrete or continuous, with or without finite horizons. We
will focus on the discrete case with an infinite horizon for the rest of the article. Conventionally,
we let T=Nin this case. For any tT, we denote Stthe set of all possible state at time t
and Xtthe random variable of state at time t. For any sSt, we denote As,t the set of all
applicable actions when state sis observed at time tand Atthe random variable of action at
time t. According to the action athe agent chose, the system will transfer into another state
in the next time point, ie., s0St+1, with a probability p(s, a, s0) and receive a reward. This
stochastic process is a Markov Decision Process (MDP) if
P(Xt+1 =st+1|X0=s0, A0=a0, . . . , Xt=st, At=at) = P(Xt+1 =st+1|Xt=st, At=at),
t, s, a . (1)
Here we denote
p(st, at, st+1) = P(Xt+1 =st+1|Xt=st, At=at),(2)
so clearly Pst+1St+1 p(st, at, st+1) = 1.
To summarize, the standard discrete MDP consists of five components: Decision Epochs (T),
State Space (S), Action Space (A), Transition Probability (P), and Reward Function (R) [25].
3.2. Semi-MDP
The main difference between MDP and semi-MDP is that we consider the time difference between
the two decision epochs. We first introduce the Markov Renewal Process.
Suppose we have the same definition of An, S, A, we modified our definition of the set of
decision epochs Tinto the following: Instead of considering Tas consecutive time points, we
consider only the time points where observation changes and action is made. For each nN,
we have a random variable Tn[0,+) s.t. T0T1 · · · .
The stochastic process (X, T ) is a Markov Renewal Process if
P(Xn+1 =sn+1, Tn+1 Tnt|X0=s0, A0=a0, T0=t0,· · · , Xn=sn, An=an, Tn=tn)
=P(Xn+1 =sn+1, Tn+1 Tnt|Xn=sn, An=an),
n, s, a, t .
(3)
If we define (Y, B) = {Yt, Bt:t0}as
Yt=Xnif TntTn+1 ,
Bt=Anif TntTn+1 ,(4)
摘要:

Reinforcementlearningapproachformulti-agentexibleschedulingproblemsHongjianZhou1,BoyangGu2andChenghaoJin31UniversityofOxford,OxfordOX12JD,UnitedKingdom2ImperialCollegeLondon,LondonSW72BX,UnitedKingdomE-mail:1hongjian.zhou@exeter.ox.ac.uk,2boyang.gu19@imperial.ac.uk,3stevenjingbhg@foxmail.comAbstract...

展开>> 收起<<
Reinforcement learning approach for multi-agent exible scheduling problems Hongjian Zhou1 Boyang Gu2and Chenghao Jin3.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:831.02KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注