A Bibliometric Analysis and Review on Reinforcement Learning for Transportation Applications Can Lia Lei Baib Lina Yaoa S. Travis Wallerc Wei Liud

2025-04-30 0 0 3.19MB 40 页 10玖币
侵权投诉
A Bibliometric Analysis and Review on Reinforcement Learning for
Transportation Applications
Can Lia, Lei Baib, Lina Yaoa, S. Travis Wallerc, Wei Liud,
aSchool of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
bSchool of Electrical and Information Engineering, University of Sydney, Sydney, NSW 2008, Australia
cLighthouse Professorship “Transport Modelling and Simulation”, Faculty of Transport and Traffic Sciences,
Technische Universität Dresden, Germany
dDepartment of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University, Hong Kong,
China
Abstract
Transportation is the backbone of the economy and urban development. Improving the efficiency,
sustainability, resilience, and intelligence of transportation systems is critical and also challeng-
ing. The constantly changing traffic conditions, the uncertain influence of external factors (e.g.,
weather, accidents), and the interactions among multiple travel modes and multi-type flows result
in the dynamic and stochastic natures of transportation systems. The planning, operation, and
control of transportation systems require flexible and adaptable strategies in order to deal with un-
certainty, non-linearity, variability, and high complexity. In this context, Reinforcement Learning
(RL) that enables autonomous decision-makers to interact with the complex environment, learn
from the experiences, and select optimal actions has been rapidly emerging as one of the most useful
approaches for smart transportation. This paper conducts a bibliometric analysis to identify the
development of RL-based methods for transportation applications, typical journals/conferences,
and leading topics in the field of intelligent transportation in recent ten years. Then, this paper
presents a comprehensive literature review on applications of RL in transportation by categorizing
different methods with respect to the specific application domains. The potential future research
directions of RL applications and developments are also discussed.
Keywords: Machine Learning; Reinforcement Leaning; Transportation; Bibliometric Analysis
1. Introduction
The travel demand is increasing along with the growth of social and economic activities, which
results in great challenges in terms of crowding, congestion, emission, energy, and safety. Mean-
while, the massive amount of multi-source data has been continuously and/or automatically col-
lected. In this context, artificial intelligence (AI) methods have been proposed to take advantage
of the growing data availability in order to address challenges faced by transportation systems and
travelers and thus improve system safety, sustainability, resilience, and efficiency.
Reinforcement Learning (RL) is an essential branch of AI-based methods, which is an experience-
driven autonomous learning strategy and is often formulated based on Markov Decision Processes
(MDPs). RL can be regarded as a process where the agent learns optimal behaviors/decisions
by trial-and-error interactions with the environment (Kaelbling et al., 1996). It is more practical
than supervised learning methods in many occasions, which does not necessarily require prior ex-
periences or sufficient historical data to train the agent (Ye et al., 2019). Some RL-based models
are quite well scalable to high-dimensional systems (Desjardins and Chaib-Draa, 2011), making
them adaptable to complex problems based on simple instances. Moreover, RL-based approaches
that do not need re-optimization when changes occur in the environment may save computation
efforts and increase practicality (Zhou et al., 2019b). Also, the RL-based strategy is capable of
Corresponding author
Email address: wei.w.liu@polyu.edu.hk (Wei Liu)
1
arXiv:2210.14524v1 [cs.LG] 26 Oct 2022
capturing the long-term effect of current actions and achieving greater efficiency and profits (Pan
et al., 2019). The aforementioned advantages of Reinforcement Learning indeed attract substantial
research efforts that adopt and develop RL-based models for decision-making, especially in game
playing. For example, the DeepMind team first applies RL on Atari 2600 games to learn optimal
policies and achieves better performance than human players (Mnih et al., 2015). AlphaGo (Silver
et al., 2016) verifies the superiority of Reinforcement Learning with an extremely high winning
ratio.
In line with the success of Reinforcement Learning in the field of game playing, many studies
have developed and/or applied RL strategies in the transportation sector. The experimental
results evaluating on real-world datasets or synthetic datasets demonstrate the effectiveness of
Reinforcement Learning in learning and managing transportation systems, improving accuracy
and efficiency, and reducing resource consumption. There are several existing reviews on RL
studies in the transportation domain. In particular, Mannion et al. (2016) and Yau et al. (2017)
focus on traffic signal control with RL; Aradi (2022) and Kiran et al. (2021) focus on deep RL
models for autonomous driving. Three additional review studies (Abdulhai and Kattan, 2003;
Haydari and Yilmaz, 2022; Farazi et al., 2021) have covered more transportation applications with
Reinforcement Learning. Abdulhai and Kattan (2003) was published in 2003, which does not cover
the substantial development of RL methods in transportation in recent years. Farazi et al. (2021)
mainly focuses on deep RL methods for applications in transportation (e.g., autonomous driving
and traffic signal control). However, non-deep RL models have not been examined. Haydari and
Yilmaz (2022) has discussed both deep RL and non-deep RL methods and covers a wide range of
RL applications in transportation (including traffic signal control, energy management for electric
vehicle, road control, and autonomous driving). However, the importance of fairness in developing
RL methods for transportation applications is ignored in previous works. Moreover, none has
provided a bibliometric analysis of RL methods for transportation applications. Differently, this
study takes advantage of the bibliometric analysis to provide a systematic review on applications
of both deep RL and non-deep RL methods in transportation, and provide more comprehensive
coverage of applications than related existing reviews (e.g., including RL applications in taxi and
bus systems that have not been covered by Haydari and Yilmaz (2022)). Besides, this paper
summarizes several aspects that require substantial efforts in terms of developing RL methods for
real-world transportation applications, i.e., scalability, practicality, transferability, and fairness.
Fig. 1. Classification of RL Applications in Transportation
Specifically, this study provides a summary on applications of RL to address relevant trans-
portation issues and takes advantage of the bibliometric analysis approach to construct network
connections of the journals/conferences and keywords to identify the influential journals/conferences
and areas of concern. Several future directions of RL studies in transportation are also discussed.
The major transportation topics that involve RL methods discussed in this study include traffic
control, taxi and ride-sourcing/sharing, assistant and autonomous driving, routing, public trans-
portation and bike-sharing system, and electric vehicles. The detailed classification of topics is
shown in Fig. 1. This review collects over a hundred of related papers mostly published in the last
ten years in major journals in the transportation domain (e.g., Transportation Research Part B,
2
Part C, IEEE Transactions on Intelligent Transportation Systems, IET Intelligent Transport Sys-
tems) and major related conferences in the computer science domain (e.g., AAAI, KDD, WWW,
CIKM), which will be discussed in Section 3. To summarize, this paper provides a reference
point to researchers for interdisciplinary Reinforcement Learning research in transportation and
computer science.
The rest of this paper is structured as follows. Section 2 introduces the basic formulations
of Reinforcement Learning and Section 3 conducts the bibliometric study. The review of the six
topic categories for transportation applications with RL are presented in Section 4 Section 9,
respectively. Future directions of RL in transportation and the conclusion of this paper are
discussed in Section 10.
2. Preliminary
This section presents the basic formulation of Markov Decision Process (MDP) and Reinforce-
ment Learning, and the usage of data for RL algorithms. Three main algorithms in RL are also
summarized, i.e., Value-based RL, Policy-based RL, and Actor-Critic-based RL.
2.1. Markov Decision Process
Reinforcement Learning is formulated based on Markov Decision Process (MDP), a framework
applied in stochastic control theory (Sutton and Barto, 2018). MDP consists of five elements,
<S,A,P,R, γ > where Srepresents the set of states, Adenotes the set of actions, Pis the
probabilistic transition function, Ris the reward function, and γ[0,1] denotes the discount
factor. At time step t, under a state st∈ S, the agent performs an action at A and then
receives an immediate reward rt(st, at)∈ R from the environment. The environment state will
change to st+1 ∈ S based on the transition probability P(st+1|st, at). The goal of the agent is
to find a optimal policy πfor maximizing the cumulative reward with discount factor where
G=PT
t=1 γtrt,π=argmaxπE[G|π], and Erepresents the expected value.
2.2. Reinforcement Learning Algorithms
Based on the policy π, in order to evaluate the current state and according action in RL, the
state-value function Vπ(s)and state-action value function Q(s, a)are introduced below:
Vπ(s) = E[G|s](1)
Qπ(s, a) = E[G|s, a](2)
Vπ(s) = X
a
π(a|s)Q(s, a)(3)
Qπ(s, a) = X
s0
P(s0|s, a)(r(s, a) + V(s0)) (4)
The optimal policy can be obtained by letting π(s) = argmaxaQ(s, a)and the state-value function
is Vπ(s) = maxaQπ(s, a). Then, Bellman Expectation Equation (Bellman, 1952) can be used to
solve the value function:
Vπ(s) = X
a
π(a|s)X
s0,r
p(s0, r|s, a)[r+γV π(s0)] (5)
Policy iteration and value iteration are utilized to solve πand Vπ(s). Thus, according to the
mechanism of iteration, RL can be divided into three categories, i.e., the Value-based RL, Policy-
based RL, and Actor-Critic-based RL. The basic definitions of these three policies are introduced
below.
3
2.2.1. Value-based Reinforcement Learning
In the value iteration approach, the value function is updated following the Bellman Optimal
Equation (Bellman, 1952):
Vk+1(s) = max
a
E[rt+1 +γVk(St+1)|(St=s, At=a)] (6)
Two classic approaches have been used to estimate Vπ(s), Monte-Carlo-based approach (MC) and
Temporal-Difference-based approach (TD). In MC, based on current state s(t), the agent starts
to interact with the environment until reaching a termination condition. Then, the cumulative
reward Gtcan be calculated based on given rules. The aim is to drive Vt(s)close to Gt, which
leads to the update policy as follow:
Vt(s)Vt(s) + α(GtVt(s)) (7)
where αis the learning rate. Since the reward obtained by MC is estimated at the end of the
episode, there can be large variances in the cumulative reward. On the contrary, TD only simulates
one step in the episode and the update policy is as follows:
Vt(s)Vt(s) + α(Rt+γVt(s+ 1) Vt(s)) (8)
which yields smaller variance but can be less accurate due to a lack of an overview of the whole
episode.
Typical TD-based strategies are Q-learning (Watkins and Dayan, 1992) and State-Action-
Reward-State-Action (Sarsa) algorithm (Sutton, 1996) which replace Vπ(s)with Q(s, a)following
Eq. (8). The update policy of Q-learning is represented as:
Qπ(st, at)Qπ(st, at) + α(rt+γmaxat+1 Qπ(st+1, at+1)Qπ(st, at)) (9)
And the update policy of Sarsa can be shown as:
Qπ(st, at)Qπ(st, at) + α(rt+γQπ(st+1, at+1)Qπ(st, at)) (10)
Both Q-learning and Sarsa have two policies, a behavior policy to interact with the environment
and sample potential actions from the learning data with randomness and a target policy to
improve the performance with the help of sampling data and thus obtain the optimal policy.
Furthermore, according to the data usage when updating value functions, RL can be divided into
on-policy and off-policy methods. On-policy methods update the policy that is used to make
decisions, while off-policy methods update a policy different from that used to generate the data
(Sutton et al., 1998). Sarsa is an on-policy strategy (i.e., the target policy is the same as the
behavior policy), while Q-learning is an off-policy method (i.e., the target policy is to suppose the
selecting action with the largest reward to update the value function).
In some applications, a large number of states and actions can hardly be captured in Q-learning.
Thus, deep models are used to approximate the value function. Mnih et al. (2015) proposes Deep
Q-Network (DQN) for optimal policy finding. Given a Q-function Qand a target Q-function ˆ
Q
initializing by ˆ
Q=Q, an experience replay buffer is utilized to store the transition (st, at, rt, st+1)
in each time step where atis obtained by Q. When enough sample data is obtained, a mini-batch
of samples is chosen randomly to get the target value by ˆ
Q:
y=ri+γmax
a
ˆ
Q(si+1, a)(11)
Then, the parameters of Qare updated by driving Q(si, ai)close to ywith the gradient descent
method. The target network ˆ
Qwill be reset after Csteps by ˆ
Q=Q.
Further DQN-based methods such as Double-DQN (Van Hasselt et al., 2016) and Dueling-
DQN (Wang et al., 2016) are developed for more robust and faster policy learning. In detail, to
reduce the overestimations caused by DQN (i.e., the estimated value is larger than the true value),
Double-DQN implements the choice and the evaluation of actions with different value functions,
QA(s, a)and QB(s, a). Thus, the updated function can be represented as:
QB(st, at)QA(st, at) + α(rt+γmaxat+1 QB(st+1, argmaxaQA(st+1, at)) QA(st, at))
QB(st, at)QB(st, at) + α(rt+γmaxat+1 QA(st+1, argmaxaQB(st+1, at)) QB(st, at)) (12)
4
Moreover, Dueling-DQN divides the action value function into state value function and advantage
function, i.e., Qπ(st, at) = Vπst+Aπ(st, at)where Aπ(st, at)denotes as the advantage function
for the strategy evaluation. The advantage function is larger than zero means that the action is
better than the average action, otherwise the current action is worse than the average action.
2.2.2. Policy-based Reinforcement Learning
Value-based RL models are often effective in discrete control tasks but are hard to be adapted
into continuous action space (Arulkumaran et al., 2017). Policy-based RL can help solve such
issues and is more applicable in high dimensional action spaces.
Sutton et al. (2000) introduces the Policy Gradient method where the policy is written as
πθ(s, a)with the parameter vector θ. The objective is to maximize the average reward ρ(π)which
can be obtained by the policy gradient:
ρ
θ =X
s
dπ(s)X
a
π(s, a)
θ Qπ(s, a)(13)
where dπ(s, a) = P
t=0 γtP r(st=s|s0, π)and Qπ(s, a) = E[P
k=1 γk1rt+k|st=s, at=a, π].
Then, the Policy Gradient with Function Approximation can be written as:
ρ
θ =X
s
dπ(s)X
a
π(s, a)
θ fw(s, a)(14)
where fw(s,a)
w =π(s,a)
θ
1
π(s,a).
Proximal Policy Optimization (PPO) (Schulman et al., 2017) is an improved and widely
adopted on-policy algorithm which solves the problem that the choice of step size in the Pol-
icy Gradient algorithm is not straightforward.
2.2.3. Actor-Critic-based Reinforcement Learning
Actor-Critic-based (AC-based) RL (Sutton et al., 2000) takes advantage of both value-based
function and policy-based function. The actor network interacts with the environment and gener-
ates actions. The critic network uses the value function to evaluate the performance of the actor
and guide the actor’s actions in the next time step.
Some widely-used algorithms in AC-based RL are Deterministic Policy Gradient (DPG) (Silver
et al., 2014), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016), Advantage
Actor-Critic (A2C), and Asynchronous Advantage Actor-Critic (A3C) (Babaeizadeh et al., 2017).
DPG and DDPG are off-policy methods that can be easier to train in high dimensional action
space, and DDPG is based on deep learning. A2C and A3C are on-policy algorithms where
A2C adopts synchronous control method, and A3C adopts asynchronous control method for actor
network updating. A3C is often adopted in transportation problems for policy-making, which is
introduced in detail as an example to illustrate the mechanism of synchronous methods. A3C takes
advantage of the Actor-Critic framework and introduces the synchronous method to improve the
performance and efficiency. Multiple threads are utilized in A3C to collect data in parallel, i.e.,
each thread is an independent agent to explore an independent environment, and each agent can
use different strategies to sample data. Sampling data independently is able to obtain unrelated
samples and increase sampling speed.
2.3. Data Usage
Both synthetic data and real-world data are largely used in studies for transportation appli-
cations with RL. On the one hand, it is easier and more feasible to obtain synthetic data. A
large number of scenarios/samples with different characteristics can be constructed to evaluate
the proposed methods. However, some uncertainties, disruptions, and accidents occurring in prac-
tice are hard to be measured or simulated, which leaves a certain and unknown gap with actual
environments. On the other hand, the real-world data can reflect the actual situations more accu-
rately, which means the proposed method can be put into practice for the scenario corresponding
to the collected data. It is harder to obtain complete and diverse real-world data due to several
reasons, e.g., the confidentiality of various sources and the lack of information. Also, a real-world
5
摘要:

ABibliometricAnalysisandReviewonReinforcementLearningforTransportationApplicationsCanLia,LeiBaib,LinaYaoa,S.TravisWallerc,WeiLiud,aSchoolofComputerScienceandEngineering,UniversityofNewSouthWales,Sydney,NSW2052,AustraliabSchoolofElectricalandInformationEngineering,UniversityofSydney,Sydney,NSW2008,A...

展开>> 收起<<
A Bibliometric Analysis and Review on Reinforcement Learning for Transportation Applications Can Lia Lei Baib Lina Yaoa S. Travis Wallerc Wei Liud.pdf

共40页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:40 页 大小:3.19MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 40
客服
关注