
the meta policy to the new environment in the execution
time, the agent needs a batch of sample trajectories to
compute the policy gradients to update the policy. However,
when interacting with a nonstationary environment online,
the agent has no access to prior knowledge or batches of
sample data. Hence, it is unable to learn a meta policy before-
hand. In addition, past experiences only reveal information
about previous environments. The gradient updates based
on past observations may not suffice to prepare the agent
for the future. In summary, the challenge mentioned above
persists when the RL agent interacts with a nonstationary
environment in an online manner.
Our Contributions To address the challenge of limited
online adaptation ability, this work proposes an online meta
reinforcement learning algorithm based on conjectural online
lookahead adaptation (COLA). Unlike previous meta RL
formulations focusing on policy learning, COLA is con-
cerned with learning meta adaptation strategies online in
a nonstationary environment. We refer to this novel learn-
ing paradigm as online meta adaptation learning (OMAL).
Specifically, COLA determines the adaptation mapping at
every step by maximizing the agent’s conjecture of the
future performance in a lookahead horizon. This lookahead
optimization is approximately solved using off-policy data,
achieving real-time adaptation. A schematic illustration of
the proposed COLA algorithm is provided in Figure 1.
In summary, the main contributions of this work include
that 1) we formulate the problem of learning adaptation
strategies online in a nonstationary environment as online
meta-adaptation learning (OMAL); 2) we develop a real-
time OMAL algorithm based on conjectural online lookahead
adaptation (COLA); 3) experiments show that COLA equips
the self-driving agent with online adaptability, leading to self-
adaptive driving under dynamic weather.
Fig. 2: An example of lane-keeping task in an urban
driving environment under time-varying weather conditions.
Dynamic weather is realized by varying three weather pa-
rameters: cloudiness, rain, and puddles. Different weather
conditions cause significant visual differences in the low-
resolution image.
II. SELF-DRIVING IN NONSTATIONARY ENVIRONMENTS:
MODELING AND CHALLENGES
We model the lane-keeping task under time-varying
weather conditions shown in Figure 2 as a Hidden-Mode
Markov Decision Process (HM-MDP) [13]. The state input
at time tis a low-resolution image, denoted by st∈ S. Based
on the state input, the agent employs a control action at∈ A,
including acceleration, braking, and steering. The discrete
control commands used in our experiment can be found in
Appendix Section III. The driving performance at stwhen
taking atis measured by a reward function rt=r(st, at).
Upon receiving the control commands, the vehicle changes
its pose/motion and moves to the next position. The
new surrounding traffic conditions captured by the camera
serve as the new state input subject to a transition kernel
P(st+1|st, at;zt), where zt∈ Z is the environment mode or
latent variable hidden from the agent, corresponding to the
weather condition at time t. The transition P(st+1|st, at;zt)
tells how likely the agent is to observe a certain image st+1
under the current weather conditions zt. CARLA simulator
controls the weather condition through three weather pa-
rameters: cloudiness, rain, and puddles, and ztis a three-
dimensional vector with entries being the values of these
parameters.
In HM-MDP, the hidden mode shifts stochastically ac-
cording to a Markov chain pz(zt+1|zt)with initial distri-
bution ρz(z1). As detailed in the Appendix Section III, the
hidden mode (weather condition) shifts in our experiment
are realized by varying three weather parameters subject to
a periodic function. One realization example is provided
in Figure 2. Note that the hidden mode zt, as its name
suggested, is not observable in the decision-making process.
Let It={st, at−1, rt−1}be the set of agent’s observations
at time t, referred to as the information structure [14]. Then,
the agent’s policy πis a mapping from the past observations
∪t
k=1Ikto a distribution over the action set ∆(A).
a) Reinforcement Learning and Meta Learning: Stan-
dard RL concerns stationary MDP, where the mode remains
unchanged (i.e., zt=z) for the whole decision horizon H.
Due to stationarity, one can search for the optimal policy
within the class of Markov policies [15], where the policy
π:S → ∆(A)only depends on the current state input.
This work considers a neural network policy π(s, a;θ), θ ∈
Θ⊂Rdas the state inputs are high-dimensional images. The
RL problem for the stationary MDP (fixing z) is to find an
optimal policy maximizing the expected cumulative rewards
discounted by γ∈(0,1]:
max
θJz(θ) := EP(·|·;z),π(·;θ)[
H
X
t=1
γtr(st, at)].(1)
We employ the policy gradient method to solve
for the maximization using sample trajectories τ=
(s1, a1, . . . , sH, aH). The idea is to apply gradient descent
with respect to the objective function Jz(θ). Following
the policy gradient theorem [7], we obtain ∇Jz(θ) =
E[g(τ;θ)], where g(τ;θ) = PH
h=1 ∇θlog π(ah|sh;θ)Rh(τ),
and Rh(τ) = PH
t=hr(st, at). In RL practice, the policy
gradient ∇Jz(θ)is replaced by its MC estimation since
evaluating the exact value is intractable. Given a batch of
trajectories D={τ}under the policy π(·|·;θ), the MC
estimation is ˆ
∇J(θ, D(θ)) := 1/|D| Pτ∈D(θ)g(τ;θ). Our
implementation applies the actor-critic (AC) method [16], a