Application of Deep Q Learning with Simulation Results for Elevator Optimization Zheng Cao1 Raymond Guo2 Caesar M. Tuguinay3

2025-04-27 0 0 452.55KB 16 页 10玖币
侵权投诉
Application of Deep Q Learning with Simulation
Results for Elevator Optimization
Zheng Cao1, Raymond Guo2, Caesar M. Tuguinay3,
Mark Pock4, Jiayi Gao5, Ziyu Wang6
University of Washington, Seattle, USA
Department of Mathematics zc68@uw.edu1
Department of Computer Science & Math rpg360@uw.edu2
Department of Mathematics ctuguina@uw.edu3
Department of Computer Science markpock@uw.edu4
Academy for Young Scholars jerrygao@uw.edu5
Department of Economics ziyuw5@uw.edu6
Abstract
This paper presents a methodology for combining programming and mathe-
matics to optimize elevator wait times. Based on simulated user data generated
according to the canonical three-peak model of elevator traffic, we first develop
a na¨ıve model from an intuitive understanding of the logic behind elevators. We
take into consideration a general array of features including capacity, acceleration,
and maximum wait time thresholds to adequately model realistic circumstances.
Using the same evaluation framework, we proceed to develop a Deep Q Learning
model in an attempt to match the hard-coded na¨ıve approach for elevator control.
Throughout the majority of the paper, we work under a Markov Decision Process
(MDP) schema, but later explore how the assumption fails to characterize the highly
stochastic overall Elevator Group Control System (EGCS).
Keywords: Deep Q Learning, Optimization, Simulation, Markov Decision Process,
Temporal Difference, Elevator Group Control Systems
1
arXiv:2210.00065v3 [cs.LG] 23 Dec 2022
2
Contents
1 Introduction 3
1.1 FramingandLiterature............................. 3
1.2 Strategy ..................................... 3
2 Theoretical Background 4
2.1 Reinforcement Learning, Markov Decision Processes . . . . . . . . . . . . . 4
2.2 Q-Learning ................................... 4
3 Simulation 6
3.1 Individual Person Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 OceBuildingValues ............................. 6
3.3 SimulatedTable................................. 6
4 Modeling 6
4.1 Elevator ..................................... 6
4.2 Environment................................... 7
4.3 ModelInteractions ............................... 7
4.4 Design...................................... 7
4.5 Na¨ıve Approach to Elevator Control: . . . . . . . . . . . . . . . . . . . . . 8
5 Modeling via Deep Q Network 8
5.1 Motivations ................................... 8
5.2 EGCS Data Encoding and Decoding . . . . . . . . . . . . . . . . . . . . . 9
5.3 DQNActionModel............................... 10
6 Results 11
6.1 ModelResults.................................. 11
6.2 Model Failure and Why The Model Failed . . . . . . . . . . . . . . . . . . 13
6.3 PossibleSolutions................................ 13
7 Future Work 13
8 Conclusion 14
9 Acknowledgement 14
3
1 Introduction
Elevators figure strongly into the daily life of the ordinary urbanite. The role of elevator
wait times may seem negligible, but this comes as a result of decades of optimizations and
improvements in EGCS facilitated by a wide array of fields. The minimization of elevator
wait times becomes especially crucial during the down-peak and up-peak, when many
elevators are crowded. Poor algorithmic design results in frustrated and tired workers
cramming around elevator doors, wasting valuable time.
This research group has approached elevator optimization via two angles, addressed
by two separate teams – explicit mathematical modelling and machine learning for ap-
proximate optimization. This paper focuses on the latter approach. Our team’s source
code is contained in the “Elevator Project” GitHub. [4] We first generate data accord-
ing to the canonical three-peak model, and use it to build a base-case model to analyze
the performance of traditional elevator design. We subsequently turn towards Deep Q
Learning to attempt optimization over the na¨ıve base-case.
1.1 Framing and Literature
We frame our discussion of elevator optimization through the hierarchical paradigm
of Elevator Group Control Systems (EGCS), the central mechanisms in multi-elevator
buildings which control and monitor elevator motion. Where elevators stay by default,
which elevators will be dispatched to various hall calls, etc. are managed by EGCS. Its
importance to internal transportation has led to an array of research where innovations
from across engineering disciplines have been combined and synthesized to produce the
modern elevator – a far cry from the early days of elevators which ran on schedules.
Throughout the paper, we interpret the idea of EGCS in a way abstracted from its material
implementations – as the overall state, algorithm for responding to that state, and state
transition functions in a particular building.
Several authors before us have attempted to apply machine learning to EGCS – par-
ticle swarm optimizations [1], Convolutional Neural Networks (CNNs), and neuro-fuzzy
systems [5], amongst other approaches. Combining machine learning approaches with
more rigorous mathematical approaches has been a particularly fruitful approach to other
problems – for example, Zheng Cao’s previous paper “Application of Convolutional Neural
Networks with Quasi-Reversibility Method Results for Option Forecasting.” [3]
1.2 Strategy
Our approach to optimization is straightforward – an application of Deep Q Learning
to what we characterize as a classification problem for an arbitrary decision algorithm
taking a simplified version of a building’s current state and outputting commands to the
elevator(s) therein.
4
2 Theoretical Background
2.1 Reinforcement Learning, Markov Decision Processes
The generic Reinforcement Learning (RL) problem is framed as an interaction between
an agent and an environment. At each time-step, the agent selects an action out of a set
of possibilities. The environment responds by shifting to a different state and presenting
that state to the agent alongside a scalar reward. This interaction continues until the
environment reaches a terminal state (there are also RL problems involving environments
without a terminal state, but the problem discussed here is not one of them). A complete
sequence of actions from the agent and responses from the environment, from start to
terminal state, is known as an “episode”. We denote the nth state by Sn, the nth action
by An, and the nth reward by Rn(where S0is the initial state, A0is the first action, and
R0is the reward given in response to that action).
In this sense, RL problems can be thought of as a series of classification problems
where the agent, at every time step n, is tasked with choosing the action that maximizes
some function of the rewards following the nth action. This function is usually (and is
here),
t
X
i=n
λinSi
known as the discounted return, where tis the time step after which the terminal state is
reached. 0 λ < 1 is the “discount factor” and chosen as a hyperparameter in training,
where lower values will make the model prioritize increasing immediate rewards, and
higher values will make the model have a more “long term” value.
A finite Markov Decision Process (MDP) is a special case of these RL problems where
the number of states and possible rewards is finite, the number of actions that can be
chosen in response to each state is finite, and most importantly, the probability of any
state-reward pair given in response to any action and previous state is dependent only
on that action and previous state (and not any of the actions and states that preceded
them). This is known as the “Markov Property” and can be expressed symbolically by
Pr{Rt+1 =r, St+1 =s|S0, A0, R1, ..., St1, At1, Rt, St, At}= Pr{Rt+1 =r, St+1 =s|St, At}
for any r, s that lie in the set of possible rewards and states respectively. A more formal
definition is given in the appendix. MDPs are important because most proofs that provide
guarantees that RL methods converge are based on the MDP case, although empirically
great success has been achieved in applying these methods to non-MDP RL problems. Its
importance will be expounded upon in a later section. [8]
2.2 Q-Learning
Finite MDPs are solved by finding a good policy π(a|s), which is a probabilistic func-
tion that defines the actions of the agent. In particular, π(a|s) denotes the probability
of the agent choosing action anext, if the last state were to be s. With respect to such
a policy π, we can define a value function vπ(s), which defines the expected return of an
agent starting at state sand following policy π. Similarly, we can define an action-value
function qπ(s, a) which defines the expected return of an agent starting at state s, taking
摘要:

ApplicationofDeepQLearningwithSimulationResultsforElevatorOptimizationZhengCao1,RaymondGuo2,CaesarM.Tuguinay3,MarkPock4,JiayiGao5,ZiyuWang6UniversityofWashington,Seattle,USADepartmentofMathematicszc68@uw.edu1DepartmentofComputerScience&Mathrpg360@uw.edu2DepartmentofMathematicsctuguina@uw.edu3Departm...

收起<<
Application of Deep Q Learning with Simulation Results for Elevator Optimization Zheng Cao1 Raymond Guo2 Caesar M. Tuguinay3.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:452.55KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注