A Bibliometric Analysis and Review on Reinforcement Learning for Transportation Applications Can Lia Lei Baib Lina Yaoa S. Travis Wallerc Wei Liud

2025-04-30 0 0 3.19MB 40 页 10玖币

侵权投诉

A Bibliometric Analysis and Review on Reinforcement Learning for

Transportation Applications

Can Lia, Lei Baib, Lina Yaoa, S. Travis Wallerc, Wei Liud,∗

aSchool of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia

bSchool of Electrical and Information Engineering, University of Sydney, Sydney, NSW 2008, Australia

cLighthouse Professorship “Transport Modelling and Simulation”, Faculty of Transport and Traﬃc Sciences,

Technische Universität Dresden, Germany

dDepartment of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University, Hong Kong,

China

Abstract

Transportation is the backbone of the economy and urban development. Improving the eﬃciency,

sustainability, resilience, and intelligence of transportation systems is critical and also challeng-

ing. The constantly changing traﬃc conditions, the uncertain inﬂuence of external factors (e.g.,

weather, accidents), and the interactions among multiple travel modes and multi-type ﬂows result

in the dynamic and stochastic natures of transportation systems. The planning, operation, and

control of transportation systems require ﬂexible and adaptable strategies in order to deal with un-

certainty, non-linearity, variability, and high complexity. In this context, Reinforcement Learning

(RL) that enables autonomous decision-makers to interact with the complex environment, learn

from the experiences, and select optimal actions has been rapidly emerging as one of the most useful

approaches for smart transportation. This paper conducts a bibliometric analysis to identify the

development of RL-based methods for transportation applications, typical journals/conferences,

and leading topics in the ﬁeld of intelligent transportation in recent ten years. Then, this paper

presents a comprehensive literature review on applications of RL in transportation by categorizing

diﬀerent methods with respect to the speciﬁc application domains. The potential future research

directions of RL applications and developments are also discussed.

Keywords: Machine Learning; Reinforcement Leaning; Transportation; Bibliometric Analysis

1. Introduction

The travel demand is increasing along with the growth of social and economic activities, which

results in great challenges in terms of crowding, congestion, emission, energy, and safety. Mean-

while, the massive amount of multi-source data has been continuously and/or automatically col-

lected. In this context, artiﬁcial intelligence (AI) methods have been proposed to take advantage

of the growing data availability in order to address challenges faced by transportation systems and

travelers and thus improve system safety, sustainability, resilience, and eﬃciency.

Reinforcement Learning (RL) is an essential branch of AI-based methods, which is an experience-

driven autonomous learning strategy and is often formulated based on Markov Decision Processes

(MDPs). RL can be regarded as a process where the agent learns optimal behaviors/decisions

by trial-and-error interactions with the environment (Kaelbling et al., 1996). It is more practical

than supervised learning methods in many occasions, which does not necessarily require prior ex-

periences or suﬃcient historical data to train the agent (Ye et al., 2019). Some RL-based models

are quite well scalable to high-dimensional systems (Desjardins and Chaib-Draa, 2011), making

them adaptable to complex problems based on simple instances. Moreover, RL-based approaches

that do not need re-optimization when changes occur in the environment may save computation

eﬀorts and increase practicality (Zhou et al., 2019b). Also, the RL-based strategy is capable of

∗Corresponding author

Email address: wei.w.liu@polyu.edu.hk (Wei Liu)

arXiv:2210.14524v1 [cs.LG] 26 Oct 2022

capturing the long-term eﬀect of current actions and achieving greater eﬃciency and proﬁts (Pan

et al., 2019). The aforementioned advantages of Reinforcement Learning indeed attract substantial

research eﬀorts that adopt and develop RL-based models for decision-making, especially in game

playing. For example, the DeepMind team ﬁrst applies RL on Atari 2600 games to learn optimal

policies and achieves better performance than human players (Mnih et al., 2015). AlphaGo (Silver

et al., 2016) veriﬁes the superiority of Reinforcement Learning with an extremely high winning

ratio.

In line with the success of Reinforcement Learning in the ﬁeld of game playing, many studies

have developed and/or applied RL strategies in the transportation sector. The experimental

results evaluating on real-world datasets or synthetic datasets demonstrate the eﬀectiveness of

Reinforcement Learning in learning and managing transportation systems, improving accuracy

and eﬃciency, and reducing resource consumption. There are several existing reviews on RL

studies in the transportation domain. In particular, Mannion et al. (2016) and Yau et al. (2017)

focus on traﬃc signal control with RL; Aradi (2022) and Kiran et al. (2021) focus on deep RL

models for autonomous driving. Three additional review studies (Abdulhai and Kattan, 2003;

Haydari and Yilmaz, 2022; Farazi et al., 2021) have covered more transportation applications with

Reinforcement Learning. Abdulhai and Kattan (2003) was published in 2003, which does not cover

the substantial development of RL methods in transportation in recent years. Farazi et al. (2021)

mainly focuses on deep RL methods for applications in transportation (e.g., autonomous driving

and traﬃc signal control). However, non-deep RL models have not been examined. Haydari and

Yilmaz (2022) has discussed both deep RL and non-deep RL methods and covers a wide range of

RL applications in transportation (including traﬃc signal control, energy management for electric

vehicle, road control, and autonomous driving). However, the importance of fairness in developing

RL methods for transportation applications is ignored in previous works. Moreover, none has

provided a bibliometric analysis of RL methods for transportation applications. Diﬀerently, this

study takes advantage of the bibliometric analysis to provide a systematic review on applications

of both deep RL and non-deep RL methods in transportation, and provide more comprehensive

coverage of applications than related existing reviews (e.g., including RL applications in taxi and

bus systems that have not been covered by Haydari and Yilmaz (2022)). Besides, this paper

summarizes several aspects that require substantial eﬀorts in terms of developing RL methods for

real-world transportation applications, i.e., scalability, practicality, transferability, and fairness.

Fig. 1. Classiﬁcation of RL Applications in Transportation

Speciﬁcally, this study provides a summary on applications of RL to address relevant trans-

portation issues and takes advantage of the bibliometric analysis approach to construct network

connections of the journals/conferences and keywords to identify the inﬂuential journals/conferences

and areas of concern. Several future directions of RL studies in transportation are also discussed.

The major transportation topics that involve RL methods discussed in this study include traﬃc

control, taxi and ride-sourcing/sharing, assistant and autonomous driving, routing, public trans-

portation and bike-sharing system, and electric vehicles. The detailed classiﬁcation of topics is

shown in Fig. 1. This review collects over a hundred of related papers mostly published in the last

ten years in major journals in the transportation domain (e.g., Transportation Research Part B,

Part C, IEEE Transactions on Intelligent Transportation Systems, IET Intelligent Transport Sys-

tems) and major related conferences in the computer science domain (e.g., AAAI, KDD, WWW,

CIKM), which will be discussed in Section 3. To summarize, this paper provides a reference

point to researchers for interdisciplinary Reinforcement Learning research in transportation and

computer science.

The rest of this paper is structured as follows. Section 2 introduces the basic formulations

of Reinforcement Learning and Section 3 conducts the bibliometric study. The review of the six

topic categories for transportation applications with RL are presented in Section 4 −Section 9,

respectively. Future directions of RL in transportation and the conclusion of this paper are

discussed in Section 10.

2. Preliminary

This section presents the basic formulation of Markov Decision Process (MDP) and Reinforce-

ment Learning, and the usage of data for RL algorithms. Three main algorithms in RL are also

summarized, i.e., Value-based RL, Policy-based RL, and Actor-Critic-based RL.

2.1. Markov Decision Process

Reinforcement Learning is formulated based on Markov Decision Process (MDP), a framework

applied in stochastic control theory (Sutton and Barto, 2018). MDP consists of ﬁve elements,

<S,A,P,R, γ > where Srepresents the set of states, Adenotes the set of actions, Pis the

probabilistic transition function, Ris the reward function, and γ∈[0,1] denotes the discount

factor. At time step t, under a state st∈ S, the agent performs an action at∈ A and then

receives an immediate reward rt(st, at)∈ R from the environment. The environment state will

change to st+1 ∈ S based on the transition probability P(st+1|st, at). The goal of the agent is

to ﬁnd a optimal policy π∗for maximizing the cumulative reward with discount factor where

G=PT

t=1 γtrt,π∗=argmaxπE[G|π], and Erepresents the expected value.

2.2. Reinforcement Learning Algorithms

Based on the policy π, in order to evaluate the current state and according action in RL, the

state-value function Vπ(s)and state-action value function Q(s, a)are introduced below:

Vπ(s) = E[G|s](1)

Qπ(s, a) = E[G|s, a](2)

Vπ(s) = X

π(a|s)Q(s, a)(3)

Qπ(s, a) = X

P(s0|s, a)(r(s, a) + V(s0)) (4)

The optimal policy can be obtained by letting π(s) = argmaxaQ(s, a)and the state-value function

is Vπ(s) = maxaQπ(s, a). Then, Bellman Expectation Equation (Bellman, 1952) can be used to

solve the value function:

Vπ(s) = X

π(a|s)X

s0,r

p(s0, r|s, a)[r+γV π(s0)] (5)

Policy iteration and value iteration are utilized to solve πand Vπ(s). Thus, according to the

mechanism of iteration, RL can be divided into three categories, i.e., the Value-based RL, Policy-

based RL, and Actor-Critic-based RL. The basic deﬁnitions of these three policies are introduced

below.

2.2.1. Value-based Reinforcement Learning

In the value iteration approach, the value function is updated following the Bellman Optimal

Equation (Bellman, 1952):

Vk+1(s) = max

E[rt+1 +γVk(St+1)|(St=s, At=a)] (6)

Two classic approaches have been used to estimate Vπ(s), Monte-Carlo-based approach (MC) and

Temporal-Diﬀerence-based approach (TD). In MC, based on current state s(t), the agent starts

to interact with the environment until reaching a termination condition. Then, the cumulative

reward Gtcan be calculated based on given rules. The aim is to drive Vt(s)close to Gt, which

leads to the update policy as follow:

Vt(s)←Vt(s) + α(Gt−Vt(s)) (7)

where αis the learning rate. Since the reward obtained by MC is estimated at the end of the

episode, there can be large variances in the cumulative reward. On the contrary, TD only simulates

one step in the episode and the update policy is as follows:

Vt(s)←Vt(s) + α(Rt+γVt(s+ 1) −Vt(s)) (8)

which yields smaller variance but can be less accurate due to a lack of an overview of the whole

episode.

Typical TD-based strategies are Q-learning (Watkins and Dayan, 1992) and State-Action-

Reward-State-Action (Sarsa) algorithm (Sutton, 1996) which replace Vπ(s)with Q(s, a)following

Eq. (8). The update policy of Q-learning is represented as:

Qπ(st, at)←Qπ(st, at) + α(rt+γmaxat+1 Qπ(st+1, at+1)−Qπ(st, at)) (9)

And the update policy of Sarsa can be shown as:

Qπ(st, at)←Qπ(st, at) + α(rt+γQπ(st+1, at+1)−Qπ(st, at)) (10)

Both Q-learning and Sarsa have two policies, a behavior policy to interact with the environment

and sample potential actions from the learning data with randomness and a target policy to

improve the performance with the help of sampling data and thus obtain the optimal policy.

Furthermore, according to the data usage when updating value functions, RL can be divided into

on-policy and oﬀ-policy methods. On-policy methods update the policy that is used to make

decisions, while oﬀ-policy methods update a policy diﬀerent from that used to generate the data

(Sutton et al., 1998). Sarsa is an on-policy strategy (i.e., the target policy is the same as the

behavior policy), while Q-learning is an oﬀ-policy method (i.e., the target policy is to suppose the

selecting action with the largest reward to update the value function).

In some applications, a large number of states and actions can hardly be captured in Q-learning.

Thus, deep models are used to approximate the value function. Mnih et al. (2015) proposes Deep

Q-Network (DQN) for optimal policy ﬁnding. Given a Q-function Qand a target Q-function ˆ

initializing by ˆ

Q=Q, an experience replay buﬀer is utilized to store the transition (st, at, rt, st+1)

in each time step where atis obtained by Q. When enough sample data is obtained, a mini-batch

of samples is chosen randomly to get the target value by ˆ

y=ri+γmax

Q(si+1, a)(11)

Then, the parameters of Qare updated by driving Q(si, ai)close to ywith the gradient descent

method. The target network ˆ

Qwill be reset after Csteps by ˆ

Q=Q.

Further DQN-based methods such as Double-DQN (Van Hasselt et al., 2016) and Dueling-

DQN (Wang et al., 2016) are developed for more robust and faster policy learning. In detail, to

reduce the overestimations caused by DQN (i.e., the estimated value is larger than the true value),

Double-DQN implements the choice and the evaluation of actions with diﬀerent value functions,

QA(s, a)and QB(s, a). Thus, the updated function can be represented as:

QB(st, at)←QA(st, at) + α(rt+γmaxat+1 QB(st+1, argmaxaQA(st+1, at)) −QA(st, at))

QB(st, at)←QB(st, at) + α(rt+γmaxat+1 QA(st+1, argmaxaQB(st+1, at)) −QB(st, at)) (12)

Moreover, Dueling-DQN divides the action value function into state value function and advantage

function, i.e., Qπ(st, at) = Vπst+Aπ(st, at)where Aπ(st, at)denotes as the advantage function

for the strategy evaluation. The advantage function is larger than zero means that the action is

better than the average action, otherwise the current action is worse than the average action.

2.2.2. Policy-based Reinforcement Learning

Value-based RL models are often eﬀective in discrete control tasks but are hard to be adapted

into continuous action space (Arulkumaran et al., 2017). Policy-based RL can help solve such

issues and is more applicable in high dimensional action spaces.

Sutton et al. (2000) introduces the Policy Gradient method where the policy is written as

πθ(s, a)with the parameter vector θ. The objective is to maximize the average reward ρ(π)which

can be obtained by the policy gradient:

∂ρ

∂θ =X

dπ(s)X

∂π(s, a)

∂θ Qπ(s, a)(13)

where dπ(s, a) = P∞

t=0 γtP r(st=s|s0, π)and Qπ(s, a) = E[P∞

k=1 γk−1rt+k|st=s, at=a, π].

Then, the Policy Gradient with Function Approximation can be written as:

∂ρ

∂θ =X

dπ(s)X

∂π(s, a)

∂θ fw(s, a)(14)

where ∂fw(s,a)

∂w =∂π(s,a)

∂θ

π(s,a).

Proximal Policy Optimization (PPO) (Schulman et al., 2017) is an improved and widely

adopted on-policy algorithm which solves the problem that the choice of step size in the Pol-

icy Gradient algorithm is not straightforward.

2.2.3. Actor-Critic-based Reinforcement Learning

Actor-Critic-based (AC-based) RL (Sutton et al., 2000) takes advantage of both value-based

function and policy-based function. The actor network interacts with the environment and gener-

ates actions. The critic network uses the value function to evaluate the performance of the actor

and guide the actor’s actions in the next time step.

Some widely-used algorithms in AC-based RL are Deterministic Policy Gradient (DPG) (Silver

et al., 2014), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016), Advantage

Actor-Critic (A2C), and Asynchronous Advantage Actor-Critic (A3C) (Babaeizadeh et al., 2017).

DPG and DDPG are oﬀ-policy methods that can be easier to train in high dimensional action

space, and DDPG is based on deep learning. A2C and A3C are on-policy algorithms where

A2C adopts synchronous control method, and A3C adopts asynchronous control method for actor

network updating. A3C is often adopted in transportation problems for policy-making, which is

introduced in detail as an example to illustrate the mechanism of synchronous methods. A3C takes

advantage of the Actor-Critic framework and introduces the synchronous method to improve the

performance and eﬃciency. Multiple threads are utilized in A3C to collect data in parallel, i.e.,

each thread is an independent agent to explore an independent environment, and each agent can

use diﬀerent strategies to sample data. Sampling data independently is able to obtain unrelated

samples and increase sampling speed.

2.3. Data Usage

Both synthetic data and real-world data are largely used in studies for transportation appli-

cations with RL. On the one hand, it is easier and more feasible to obtain synthetic data. A

large number of scenarios/samples with diﬀerent characteristics can be constructed to evaluate

the proposed methods. However, some uncertainties, disruptions, and accidents occurring in prac-

tice are hard to be measured or simulated, which leaves a certain and unknown gap with actual

environments. On the other hand, the real-world data can reﬂect the actual situations more accu-

rately, which means the proposed method can be put into practice for the scenario corresponding

to the collected data. It is harder to obtain complete and diverse real-world data due to several

reasons, e.g., the conﬁdentiality of various sources and the lack of information. Also, a real-world

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ABibliometricAnalysisandReviewonReinforcementLearningforTransportationApplicationsCanLia,LeiBaib,LinaYaoa,S.TravisWallerc,WeiLiud,aSchoolofComputerScienceandEngineering,UniversityofNewSouthWales,Sydney,NSW2052,AustraliabSchoolofElectricalandInformationEngineering,UniversityofSydney,Sydney,NSW2008,A...

展开>> 收起<<

A Bibliometric Analysis and Review on Reinforcement Learning for Transportation Applications Can Lia Lei Baib Lina Yaoa S. Travis Wallerc Wei Liud.pdf

共40页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Bibliometric Analysis and Review on Reinforcement Learning for Transportation Applications Can Lia Lei Baib Lina Yaoa S. Travis Wallerc Wei Liud

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: