Application of Deep Q Learning with Simulation Results for Elevator Optimization Zheng Cao1 Raymond Guo2 Caesar M. Tuguinay3

2025-04-27 0 0 452.55KB 16 页 10玖币

侵权投诉

Application of Deep Q Learning with Simulation

Results for Elevator Optimization

Zheng Cao1, Raymond Guo2, Caesar M. Tuguinay3,

Mark Pock4, Jiayi Gao5, Ziyu Wang6

University of Washington, Seattle, USA

Department of Mathematics zc68@uw.edu1

Department of Computer Science & Math rpg360@uw.edu2

Department of Mathematics ctuguina@uw.edu3

Department of Computer Science markpock@uw.edu4

Academy for Young Scholars jerrygao@uw.edu5

Department of Economics ziyuw5@uw.edu6

Abstract

This paper presents a methodology for combining programming and mathe-

matics to optimize elevator wait times. Based on simulated user data generated

according to the canonical three-peak model of elevator traﬃc, we ﬁrst develop

a na¨ıve model from an intuitive understanding of the logic behind elevators. We

take into consideration a general array of features including capacity, acceleration,

and maximum wait time thresholds to adequately model realistic circumstances.

Using the same evaluation framework, we proceed to develop a Deep Q Learning

model in an attempt to match the hard-coded na¨ıve approach for elevator control.

Throughout the majority of the paper, we work under a Markov Decision Process

(MDP) schema, but later explore how the assumption fails to characterize the highly

stochastic overall Elevator Group Control System (EGCS).

Keywords: Deep Q Learning, Optimization, Simulation, Markov Decision Process,

Temporal Diﬀerence, Elevator Group Control Systems

arXiv:2210.00065v3 [cs.LG] 23 Dec 2022

Contents

1 Introduction 3

1.1 FramingandLiterature............................. 3

1.2 Strategy ..................................... 3

2 Theoretical Background 4

2.1 Reinforcement Learning, Markov Decision Processes . . . . . . . . . . . . . 4

2.2 Q-Learning ................................... 4

3 Simulation 6

3.1 Individual Person Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 OﬃceBuildingValues ............................. 6

3.3 SimulatedTable................................. 6

4 Modeling 6

4.1 Elevator ..................................... 6

4.2 Environment................................... 7

4.3 ModelInteractions ............................... 7

4.4 Design...................................... 7

4.5 Na¨ıve Approach to Elevator Control: . . . . . . . . . . . . . . . . . . . . . 8

5 Modeling via Deep Q Network 8

5.1 Motivations ................................... 8

5.2 EGCS Data Encoding and Decoding . . . . . . . . . . . . . . . . . . . . . 9

5.3 DQNActionModel............................... 10

6 Results 11

6.1 ModelResults.................................. 11

6.2 Model Failure and Why The Model Failed . . . . . . . . . . . . . . . . . . 13

6.3 PossibleSolutions................................ 13

7 Future Work 13

8 Conclusion 14

9 Acknowledgement 14

1 Introduction

Elevators ﬁgure strongly into the daily life of the ordinary urbanite. The role of elevator

wait times may seem negligible, but this comes as a result of decades of optimizations and

improvements in EGCS facilitated by a wide array of ﬁelds. The minimization of elevator

wait times becomes especially crucial during the down-peak and up-peak, when many

elevators are crowded. Poor algorithmic design results in frustrated and tired workers

cramming around elevator doors, wasting valuable time.

This research group has approached elevator optimization via two angles, addressed

by two separate teams – explicit mathematical modelling and machine learning for ap-

proximate optimization. This paper focuses on the latter approach. Our team’s source

code is contained in the “Elevator Project” GitHub. [4] We ﬁrst generate data accord-

ing to the canonical three-peak model, and use it to build a base-case model to analyze

the performance of traditional elevator design. We subsequently turn towards Deep Q

Learning to attempt optimization over the na¨ıve base-case.

1.1 Framing and Literature

We frame our discussion of elevator optimization through the hierarchical paradigm

of Elevator Group Control Systems (EGCS), the central mechanisms in multi-elevator

buildings which control and monitor elevator motion. Where elevators stay by default,

which elevators will be dispatched to various hall calls, etc. are managed by EGCS. Its

importance to internal transportation has led to an array of research where innovations

from across engineering disciplines have been combined and synthesized to produce the

modern elevator – a far cry from the early days of elevators which ran on schedules.

Throughout the paper, we interpret the idea of EGCS in a way abstracted from its material

implementations – as the overall state, algorithm for responding to that state, and state

transition functions in a particular building.

Several authors before us have attempted to apply machine learning to EGCS – par-

ticle swarm optimizations [1], Convolutional Neural Networks (CNNs), and neuro-fuzzy

systems [5], amongst other approaches. Combining machine learning approaches with

more rigorous mathematical approaches has been a particularly fruitful approach to other

problems – for example, Zheng Cao’s previous paper “Application of Convolutional Neural

Networks with Quasi-Reversibility Method Results for Option Forecasting.” [3]

1.2 Strategy

Our approach to optimization is straightforward – an application of Deep Q Learning

to what we characterize as a classiﬁcation problem for an arbitrary decision algorithm

taking a simpliﬁed version of a building’s current state and outputting commands to the

elevator(s) therein.

2 Theoretical Background

2.1 Reinforcement Learning, Markov Decision Processes

The generic Reinforcement Learning (RL) problem is framed as an interaction between

an agent and an environment. At each time-step, the agent selects an action out of a set

of possibilities. The environment responds by shifting to a diﬀerent state and presenting

that state to the agent alongside a scalar reward. This interaction continues until the

environment reaches a terminal state (there are also RL problems involving environments

without a terminal state, but the problem discussed here is not one of them). A complete

sequence of actions from the agent and responses from the environment, from start to

terminal state, is known as an “episode”. We denote the nth state by Sn, the nth action

by An, and the nth reward by Rn(where S0is the initial state, A0is the ﬁrst action, and

R0is the reward given in response to that action).

In this sense, RL problems can be thought of as a series of classiﬁcation problems

where the agent, at every time step n, is tasked with choosing the action that maximizes

some function of the rewards following the nth action. This function is usually (and is

here),

i=n

λi−nSi

known as the discounted return, where tis the time step after which the terminal state is

reached. 0 ≤λ < 1 is the “discount factor” and chosen as a hyperparameter in training,

where lower values will make the model prioritize increasing immediate rewards, and

higher values will make the model have a more “long term” value.

A ﬁnite Markov Decision Process (MDP) is a special case of these RL problems where

the number of states and possible rewards is ﬁnite, the number of actions that can be

chosen in response to each state is ﬁnite, and most importantly, the probability of any

state-reward pair given in response to any action and previous state is dependent only

on that action and previous state (and not any of the actions and states that preceded

them). This is known as the “Markov Property” and can be expressed symbolically by

Pr{Rt+1 =r, St+1 =s|S0, A0, R1, ..., St−1, At−1, Rt, St, At}= Pr{Rt+1 =r, St+1 =s|St, At}

for any r, s that lie in the set of possible rewards and states respectively. A more formal

deﬁnition is given in the appendix. MDPs are important because most proofs that provide

guarantees that RL methods converge are based on the MDP case, although empirically

great success has been achieved in applying these methods to non-MDP RL problems. Its

importance will be expounded upon in a later section. [8]

2.2 Q-Learning

Finite MDPs are solved by ﬁnding a good policy π(a|s), which is a probabilistic func-

tion that deﬁnes the actions of the agent. In particular, π(a|s) denotes the probability

of the agent choosing action anext, if the last state were to be s. With respect to such

a policy π, we can deﬁne a value function vπ(s), which deﬁnes the expected return of an

agent starting at state sand following policy π. Similarly, we can deﬁne an action-value

function qπ(s, a) which deﬁnes the expected return of an agent starting at state s, taking

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ApplicationofDeepQLearningwithSimulationResultsforElevatorOptimizationZhengCao1,RaymondGuo2,CaesarM.Tuguinay3,MarkPock4,JiayiGao5,ZiyuWang6UniversityofWashington,Seattle,USADepartmentofMathematicszc68@uw.edu1DepartmentofComputerScience&Mathrpg360@uw.edu2DepartmentofMathematicsctuguina@uw.edu3Departm...

收起<<

Application of Deep Q Learning with Simulation Results for Elevator Optimization Zheng Cao1 Raymond Guo2 Caesar M. Tuguinay3.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Application of Deep Q Learning with Simulation Results for Elevator Optimization Zheng Cao1 Raymond Guo2 Caesar M. Tuguinay3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: