Self-Adaptive Driving in Nonstationary Environments through Conjectural Online Lookahead Adaptation Tao Li1Haozhe Lei1and Quanyan Zhu1

2025-05-03 0 0 6.74MB 10 页 10玖币

侵权投诉

Self-Adaptive Driving in Nonstationary Environments through

Conjectural Online Lookahead Adaptation

Tao Li,1∗Haozhe Lei,1∗and Quanyan Zhu1

Abstract— Powered by deep representation learning, re-

inforcement learning (RL) provides an end-to-end learning

framework capable of solving self-driving (SD) tasks without

manual designs. However, time-varying nonstationary environ-

ments cause proﬁcient but specialized RL policies to fail at

execution time. For example, an RL-based SD policy trained

under sunny days does not generalize well to rainy weather.

Even though meta learning enables the RL agent to adapt to

new tasks/environments, its ofﬂine operation fails to equip the

agent with online adaptation ability when facing nonstationary

environments. This work proposes an online meta reinforcement

learning algorithm based on the conjectural online lookahead

adaptation (COLA). COLA determines the online adaptation

at every step by maximizing the agent’s conjecture of the

future performance in a lookahead horizon. Experimental

results demonstrate that under dynamically changing weather

and lighting conditions, the COLA-based self-adaptive driving

outperforms the baseline policies regarding online adaptability.

A demo video, source code, and appendixes are available at

https://github.com/Panshark/COLA

I. INTRODUCTION

Recent breakthroughs from machine learning [1]–[3] have

spurred wide interest and explorations in learning-based

self driving (SD) [4]. Among all the endeavors, end-to-end

reinforcement learning [5] has attracted particular attention.

Unlike modularized approaches, where different modules

handle perception, localization, decision-making, and motion

control, end-to-end learning approaches aim to output a

synthesized driving policy from raw sensor data.

However, the limited generalization ability prevents RL

from wide application in real SD systems. To obtain a

satisfying driving policy, RL methods such as Q-learning and

its variants [2], [6] or policy-based ones [7], [8] require an

ofﬂine training process. Training is performed in advance in

a stationary environment, producing a policy that can be used

to make decisions at execution time in the same environments

as seen during training. However, the assumption that the

agent interacts with the same stationary environment as train-

ing time is often violated in practical problems. Unexpected

perturbations from the nonstationary environments pose a

great challenge to existing RL approaches, as the trained

policy does not generalize to new environments [9].

To elaborate on the limited generalization issue, we con-

sider the vision-based lane-keeping task under changing

weather conditions shown in Figure 1. The agent needs to

∗The authors contributed equally to this work.

1The authors are with the Department of Electrical and Computer En-

gineering, New York University, Brooklyn, NY, 11201, USA. {tl2636,

hl4155, qz494}@nyu.edu. This work is partially supported by

grant ECCS-1847056 from National Science Foundation (NSF).

Fig. 1: An illustration of conjectural online lookahead adap-

tation. When driving in a changing environment, the agent

ﬁrst uses a residual neural network and bayesian ﬁltering to

calibrate its belief at every time step about the hidden mode.

Based on its belief, the agent conjectures its performance

in the future within a lookahead horizon. The policy is

adapted through conjectural lookahead optimization, leading

to a suboptimal (empirically) online control.

provide automatic driving controls (e.g., throttling, braking,

and steering) to keep a vehicle in its travel lane, using only

images from the front-facing camera as input. The driving

testbed is built on the CARLA platform [10]. As shown

in Figure 1, different weather conditions create signiﬁcant

visual differences, and a vision-based policy learned under a

single weather condition may not generalize to other condi-

tions. As later observed in one experiment [see Figure 3a],

a vision-based SD policy trained under the cloudy daytime

condition does not generalize to the rainy condition. The

trained policy relies on the solid yellow lines in the camera

images to guide the vehicle. However, such a policy fails on

a rainy day when the lines are barely visible.

The challenge of limited generalization capabilities has

motivated various research efforts. In particular, as a

learning-to-learn approach, meta learning [11] stands out

as one of the well-accepted frameworks for designing fast

adaptation strategies. Note that the current meta learning

approaches primarily operate in an ofﬂine manner. For exam-

ple, in model-agnostic meta learning (MAML) [12], the meta

policy needs to be ﬁrst trained in advance. When adapting

arXiv:2210.03209v3 [cs.RO] 8 Mar 2023

the meta policy to the new environment in the execution

time, the agent needs a batch of sample trajectories to

compute the policy gradients to update the policy. However,

when interacting with a nonstationary environment online,

the agent has no access to prior knowledge or batches of

sample data. Hence, it is unable to learn a meta policy before-

hand. In addition, past experiences only reveal information

about previous environments. The gradient updates based

on past observations may not sufﬁce to prepare the agent

for the future. In summary, the challenge mentioned above

persists when the RL agent interacts with a nonstationary

environment in an online manner.

Our Contributions To address the challenge of limited

online adaptation ability, this work proposes an online meta

reinforcement learning algorithm based on conjectural online

lookahead adaptation (COLA). Unlike previous meta RL

formulations focusing on policy learning, COLA is con-

cerned with learning meta adaptation strategies online in

a nonstationary environment. We refer to this novel learn-

ing paradigm as online meta adaptation learning (OMAL).

Speciﬁcally, COLA determines the adaptation mapping at

every step by maximizing the agent’s conjecture of the

future performance in a lookahead horizon. This lookahead

optimization is approximately solved using off-policy data,

achieving real-time adaptation. A schematic illustration of

the proposed COLA algorithm is provided in Figure 1.

In summary, the main contributions of this work include

that 1) we formulate the problem of learning adaptation

strategies online in a nonstationary environment as online

meta-adaptation learning (OMAL); 2) we develop a real-

time OMAL algorithm based on conjectural online lookahead

adaptation (COLA); 3) experiments show that COLA equips

the self-driving agent with online adaptability, leading to self-

adaptive driving under dynamic weather.

Fig. 2: An example of lane-keeping task in an urban

driving environment under time-varying weather conditions.

Dynamic weather is realized by varying three weather pa-

rameters: cloudiness, rain, and puddles. Different weather

conditions cause signiﬁcant visual differences in the low-

resolution image.

II. SELF-DRIVING IN NONSTATIONARY ENVIRONMENTS:

MODELING AND CHALLENGES

We model the lane-keeping task under time-varying

weather conditions shown in Figure 2 as a Hidden-Mode

Markov Decision Process (HM-MDP) [13]. The state input

at time tis a low-resolution image, denoted by st∈ S. Based

on the state input, the agent employs a control action at∈ A,

including acceleration, braking, and steering. The discrete

control commands used in our experiment can be found in

Appendix Section III. The driving performance at stwhen

taking atis measured by a reward function rt=r(st, at).

Upon receiving the control commands, the vehicle changes

its pose/motion and moves to the next position. The

new surrounding trafﬁc conditions captured by the camera

serve as the new state input subject to a transition kernel

P(st+1|st, at;zt), where zt∈ Z is the environment mode or

latent variable hidden from the agent, corresponding to the

weather condition at time t. The transition P(st+1|st, at;zt)

tells how likely the agent is to observe a certain image st+1

under the current weather conditions zt. CARLA simulator

controls the weather condition through three weather pa-

rameters: cloudiness, rain, and puddles, and ztis a three-

dimensional vector with entries being the values of these

parameters.

In HM-MDP, the hidden mode shifts stochastically ac-

cording to a Markov chain pz(zt+1|zt)with initial distri-

bution ρz(z1). As detailed in the Appendix Section III, the

hidden mode (weather condition) shifts in our experiment

are realized by varying three weather parameters subject to

a periodic function. One realization example is provided

in Figure 2. Note that the hidden mode zt, as its name

suggested, is not observable in the decision-making process.

Let It={st, at−1, rt−1}be the set of agent’s observations

at time t, referred to as the information structure [14]. Then,

the agent’s policy πis a mapping from the past observations

∪t

k=1Ikto a distribution over the action set ∆(A).

a) Reinforcement Learning and Meta Learning: Stan-

dard RL concerns stationary MDP, where the mode remains

unchanged (i.e., zt=z) for the whole decision horizon H.

Due to stationarity, one can search for the optimal policy

within the class of Markov policies [15], where the policy

π:S → ∆(A)only depends on the current state input.

This work considers a neural network policy π(s, a;θ), θ ∈

Θ⊂Rdas the state inputs are high-dimensional images. The

RL problem for the stationary MDP (ﬁxing z) is to ﬁnd an

optimal policy maximizing the expected cumulative rewards

discounted by γ∈(0,1]:

max

θJz(θ) := EP(·|·;z),π(·;θ)[

t=1

γtr(st, at)].(1)

We employ the policy gradient method to solve

for the maximization using sample trajectories τ=

(s1, a1, . . . , sH, aH). The idea is to apply gradient descent

with respect to the objective function Jz(θ). Following

the policy gradient theorem [7], we obtain ∇Jz(θ) =

E[g(τ;θ)], where g(τ;θ) = PH

h=1 ∇θlog π(ah|sh;θ)Rh(τ),

and Rh(τ) = PH

t=hr(st, at). In RL practice, the policy

gradient ∇Jz(θ)is replaced by its MC estimation since

evaluating the exact value is intractable. Given a batch of

trajectories D={τ}under the policy π(·|·;θ), the MC

estimation is ˆ

∇J(θ, D(θ)) := 1/|D| Pτ∈D(θ)g(τ;θ). Our

implementation applies the actor-critic (AC) method [16], a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Self-AdaptiveDrivinginNonstationaryEnvironmentsthroughConjecturalOnlineLookaheadAdaptationTaoLi;1HaozheLei;1andQuanyanZhu1AbstractPoweredbydeeprepresentationlearning,re-inforcementlearning(RL)providesanend-to-endlearningframeworkcapableofsolvingself-driving(SD)taskswithoutmanualdesigns.However,ti...

展开>> 收起<<

Self-Adaptive Driving in Nonstationary Environments through Conjectural Online Lookahead Adaptation Tao Li1Haozhe Lei1and Quanyan Zhu1.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Self-Adaptive Driving in Nonstationary Environments through Conjectural Online Lookahead Adaptation Tao Li1Haozhe Lei1and Quanyan Zhu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: