Self-Adaptive Driving in Nonstationary Environments through Conjectural Online Lookahead Adaptation Tao Li1Haozhe Lei1and Quanyan Zhu1

2025-05-03 0 0 6.74MB 10 页 10玖币
侵权投诉
Self-Adaptive Driving in Nonstationary Environments through
Conjectural Online Lookahead Adaptation
Tao Li,1Haozhe Lei,1and Quanyan Zhu1
Abstract Powered by deep representation learning, re-
inforcement learning (RL) provides an end-to-end learning
framework capable of solving self-driving (SD) tasks without
manual designs. However, time-varying nonstationary environ-
ments cause proficient but specialized RL policies to fail at
execution time. For example, an RL-based SD policy trained
under sunny days does not generalize well to rainy weather.
Even though meta learning enables the RL agent to adapt to
new tasks/environments, its offline operation fails to equip the
agent with online adaptation ability when facing nonstationary
environments. This work proposes an online meta reinforcement
learning algorithm based on the conjectural online lookahead
adaptation (COLA). COLA determines the online adaptation
at every step by maximizing the agent’s conjecture of the
future performance in a lookahead horizon. Experimental
results demonstrate that under dynamically changing weather
and lighting conditions, the COLA-based self-adaptive driving
outperforms the baseline policies regarding online adaptability.
A demo video, source code, and appendixes are available at
https://github.com/Panshark/COLA
I. INTRODUCTION
Recent breakthroughs from machine learning [1]–[3] have
spurred wide interest and explorations in learning-based
self driving (SD) [4]. Among all the endeavors, end-to-end
reinforcement learning [5] has attracted particular attention.
Unlike modularized approaches, where different modules
handle perception, localization, decision-making, and motion
control, end-to-end learning approaches aim to output a
synthesized driving policy from raw sensor data.
However, the limited generalization ability prevents RL
from wide application in real SD systems. To obtain a
satisfying driving policy, RL methods such as Q-learning and
its variants [2], [6] or policy-based ones [7], [8] require an
offline training process. Training is performed in advance in
a stationary environment, producing a policy that can be used
to make decisions at execution time in the same environments
as seen during training. However, the assumption that the
agent interacts with the same stationary environment as train-
ing time is often violated in practical problems. Unexpected
perturbations from the nonstationary environments pose a
great challenge to existing RL approaches, as the trained
policy does not generalize to new environments [9].
To elaborate on the limited generalization issue, we con-
sider the vision-based lane-keeping task under changing
weather conditions shown in Figure 1. The agent needs to
The authors contributed equally to this work.
1The authors are with the Department of Electrical and Computer En-
gineering, New York University, Brooklyn, NY, 11201, USA. {tl2636,
hl4155, qz494}@nyu.edu. This work is partially supported by
grant ECCS-1847056 from National Science Foundation (NSF).
Fig. 1: An illustration of conjectural online lookahead adap-
tation. When driving in a changing environment, the agent
first uses a residual neural network and bayesian filtering to
calibrate its belief at every time step about the hidden mode.
Based on its belief, the agent conjectures its performance
in the future within a lookahead horizon. The policy is
adapted through conjectural lookahead optimization, leading
to a suboptimal (empirically) online control.
provide automatic driving controls (e.g., throttling, braking,
and steering) to keep a vehicle in its travel lane, using only
images from the front-facing camera as input. The driving
testbed is built on the CARLA platform [10]. As shown
in Figure 1, different weather conditions create significant
visual differences, and a vision-based policy learned under a
single weather condition may not generalize to other condi-
tions. As later observed in one experiment [see Figure 3a],
a vision-based SD policy trained under the cloudy daytime
condition does not generalize to the rainy condition. The
trained policy relies on the solid yellow lines in the camera
images to guide the vehicle. However, such a policy fails on
a rainy day when the lines are barely visible.
The challenge of limited generalization capabilities has
motivated various research efforts. In particular, as a
learning-to-learn approach, meta learning [11] stands out
as one of the well-accepted frameworks for designing fast
adaptation strategies. Note that the current meta learning
approaches primarily operate in an offline manner. For exam-
ple, in model-agnostic meta learning (MAML) [12], the meta
policy needs to be first trained in advance. When adapting
arXiv:2210.03209v3 [cs.RO] 8 Mar 2023
the meta policy to the new environment in the execution
time, the agent needs a batch of sample trajectories to
compute the policy gradients to update the policy. However,
when interacting with a nonstationary environment online,
the agent has no access to prior knowledge or batches of
sample data. Hence, it is unable to learn a meta policy before-
hand. In addition, past experiences only reveal information
about previous environments. The gradient updates based
on past observations may not suffice to prepare the agent
for the future. In summary, the challenge mentioned above
persists when the RL agent interacts with a nonstationary
environment in an online manner.
Our Contributions To address the challenge of limited
online adaptation ability, this work proposes an online meta
reinforcement learning algorithm based on conjectural online
lookahead adaptation (COLA). Unlike previous meta RL
formulations focusing on policy learning, COLA is con-
cerned with learning meta adaptation strategies online in
a nonstationary environment. We refer to this novel learn-
ing paradigm as online meta adaptation learning (OMAL).
Specifically, COLA determines the adaptation mapping at
every step by maximizing the agent’s conjecture of the
future performance in a lookahead horizon. This lookahead
optimization is approximately solved using off-policy data,
achieving real-time adaptation. A schematic illustration of
the proposed COLA algorithm is provided in Figure 1.
In summary, the main contributions of this work include
that 1) we formulate the problem of learning adaptation
strategies online in a nonstationary environment as online
meta-adaptation learning (OMAL); 2) we develop a real-
time OMAL algorithm based on conjectural online lookahead
adaptation (COLA); 3) experiments show that COLA equips
the self-driving agent with online adaptability, leading to self-
adaptive driving under dynamic weather.
Fig. 2: An example of lane-keeping task in an urban
driving environment under time-varying weather conditions.
Dynamic weather is realized by varying three weather pa-
rameters: cloudiness, rain, and puddles. Different weather
conditions cause significant visual differences in the low-
resolution image.
II. SELF-DRIVING IN NONSTATIONARY ENVIRONMENTS:
MODELING AND CHALLENGES
We model the lane-keeping task under time-varying
weather conditions shown in Figure 2 as a Hidden-Mode
Markov Decision Process (HM-MDP) [13]. The state input
at time tis a low-resolution image, denoted by st∈ S. Based
on the state input, the agent employs a control action at∈ A,
including acceleration, braking, and steering. The discrete
control commands used in our experiment can be found in
Appendix Section III. The driving performance at stwhen
taking atis measured by a reward function rt=r(st, at).
Upon receiving the control commands, the vehicle changes
its pose/motion and moves to the next position. The
new surrounding traffic conditions captured by the camera
serve as the new state input subject to a transition kernel
P(st+1|st, at;zt), where zt∈ Z is the environment mode or
latent variable hidden from the agent, corresponding to the
weather condition at time t. The transition P(st+1|st, at;zt)
tells how likely the agent is to observe a certain image st+1
under the current weather conditions zt. CARLA simulator
controls the weather condition through three weather pa-
rameters: cloudiness, rain, and puddles, and ztis a three-
dimensional vector with entries being the values of these
parameters.
In HM-MDP, the hidden mode shifts stochastically ac-
cording to a Markov chain pz(zt+1|zt)with initial distri-
bution ρz(z1). As detailed in the Appendix Section III, the
hidden mode (weather condition) shifts in our experiment
are realized by varying three weather parameters subject to
a periodic function. One realization example is provided
in Figure 2. Note that the hidden mode zt, as its name
suggested, is not observable in the decision-making process.
Let It={st, at1, rt1}be the set of agent’s observations
at time t, referred to as the information structure [14]. Then,
the agent’s policy πis a mapping from the past observations
t
k=1Ikto a distribution over the action set ∆(A).
a) Reinforcement Learning and Meta Learning: Stan-
dard RL concerns stationary MDP, where the mode remains
unchanged (i.e., zt=z) for the whole decision horizon H.
Due to stationarity, one can search for the optimal policy
within the class of Markov policies [15], where the policy
π:S ∆(A)only depends on the current state input.
This work considers a neural network policy π(s, a;θ), θ
ΘRdas the state inputs are high-dimensional images. The
RL problem for the stationary MDP (fixing z) is to find an
optimal policy maximizing the expected cumulative rewards
discounted by γ(0,1]:
max
θJz(θ) := EP(·|·;z)(·;θ)[
H
X
t=1
γtr(st, at)].(1)
We employ the policy gradient method to solve
for the maximization using sample trajectories τ=
(s1, a1, . . . , sH, aH). The idea is to apply gradient descent
with respect to the objective function Jz(θ). Following
the policy gradient theorem [7], we obtain Jz(θ) =
E[g(τ;θ)], where g(τ;θ) = PH
h=1 θlog π(ah|sh;θ)Rh(τ),
and Rh(τ) = PH
t=hr(st, at). In RL practice, the policy
gradient Jz(θ)is replaced by its MC estimation since
evaluating the exact value is intractable. Given a batch of
trajectories D={τ}under the policy π(·|·;θ), the MC
estimation is ˆ
J(θ, D(θ)) := 1/|D| Pτ∈D(θ)g(τ;θ). Our
implementation applies the actor-critic (AC) method [16], a
摘要:

Self-AdaptiveDrivinginNonstationaryEnvironmentsthroughConjecturalOnlineLookaheadAdaptationTaoLi;1HaozheLei;1andQuanyanZhu1Abstract—Poweredbydeeprepresentationlearning,re-inforcementlearning(RL)providesanend-to-endlearningframeworkcapableofsolvingself-driving(SD)taskswithoutmanualdesigns.However,ti...

展开>> 收起<<
Self-Adaptive Driving in Nonstationary Environments through Conjectural Online Lookahead Adaptation Tao Li1Haozhe Lei1and Quanyan Zhu1.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:6.74MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注