CW-ERM: Improving Autonomous Driving Planning with
Closed-loop Weighted Empirical Risk Minimization
Eesha Kumar1,∗, Yiming Zhang2, Stefano Pini1, Simon Stent1,
Ana Ferreira2, Sergey Zagoruyko1, Christian S. Perone1,∗
Abstract— The imitation learning of self-driving vehicle poli-
cies through behavioral cloning is often carried out in an open-
loop fashion, ignoring the effect of actions to future states.
Training such policies purely with Empirical Risk Minimization
(ERM) can be detrimental to real-world performance, as it
biases policy networks towards matching only open-loop behav-
ior, showing poor results when evaluated in closed-loop. In this
work, we develop an efficient and simple-to-implement principle
called Closed-loop Weighted Empirical Risk Minimization (CW-
ERM), in which a closed-loop evaluation procedure is first
used to identify training data samples that are important for
practical driving performance and then we these samples to
help debias the policy network. We evaluate CW-ERM in a
challenging urban driving dataset and show that this procedure
yields a significant reduction in collisions as well as other non-
differentiable closed-loop metrics.
I. INTRODUCTION
Learning effective planning policies for self-driving vehi-
cles (SDVs) from data such as human demonstrations remains
one of the major challenges in robotics and machine learning.
Since early works such as ALVINN [1], Imitation Learning
has seen major recent developments using modern Deep
Neural Networks (DNNs) [2]–[7]. Imitation Learning (IL),
and especially Behavioral Cloning (BC), however, still face
fundamental challenges [8], including causal confusion [9]
(later identified as a feedback-driven covariate shift [10]) and
dataset biases [8], to name a few.
There is one particular limitation of IL policies trained
with BC that is, however, often overlooked: the mismatch
between training and inference-time execution of the policy
actions. Most of the time, BC policies are trained in an open-
loop fashion, predicting the next action given the immediate
previous action and optionally conditioned on recent past
actions [2]–[5], [7]. These policies, however, when executed
in real-world, impact the future states. Small prediction errors
can drive covariate shift and make the network predict in an
out-of-distribution regime.
In this work, we address the mismatch between training
and inference through the development of a simple training
principle. Using a closed-loop simulator, we first identify and
then reweight samples that are important for the closed-loop
performance of the planner. We call this approach
CW-ERM
1
Author is with Woven Planet United Kingdom Limited,
114-116 Curtain Road, London, United Kingdom, EC2A 3AH.
firstname.lastname@woven-planet.global
2
Author is with Woven Planet North America, Inc.,
900 Arastradero Rd, Palo Alto, CA, USA 94304.
firstname.lastname@woven-planet.global
∗Equal contribution.
(Closed-loop Weighted Empirical Risk Minimization), since
we use Weighted ERM [11] to correct the training distribution
in favour of closed-loop performance. We extensively evaluate
this principle on real-world urban driving data and show that
it can achieve significant improvements on planner metrics
that matter for real-world performance (e.g. collisions).
Our contributions are therefore the following:
•
We motivate and propose Closed-loop Weighted Em-
pirical Risk Minimization (CW-ERM), a technique that
leverages closed-loop evaluation metrics acquired from
policy rollouts in a simulator to debias the policy network
and reduce the distributional differences between training
(open-loop) and inference time (closed-loop);
•
we evaluate CW-ERM experimentally on a challenging
urban driving dataset in a closed-loop fashion to show
that our method, although simple to implement, yields
significant improvements in closed-loop performance
without requiring complex and computationally expen-
sive closed-loop training methods;
•
we also show an important connection of our method
to a family of methods that addresses covariate shift
through density ratio estimation.
In Section II, we detail the proposed CW-ERM and in
Section IV we show the CW-ERM experiments and compare
them against ERM.
II. METHODOLOGY
A. Problem Setup
The traditional formulation of supervised learning for
imitation learning, also called behavioral cloning (BC), can
be formulated as finding the policy ˆπBC :
ˆπBC = argmin
π∈Π
Es∼dπ∗,a∼π∗(s)[`(s, a, π)] (1)
where the state
s
is sampled from the expert state dis-
tribution
dπ∗
induced when following the expert policy
π∗
.
Actions
a
are sampled from the expert policy
π∗(s)
. The loss
`
is also known as the surrogate loss that will find the policy
ˆπBC
that best mimics the unknown expert policy
π∗(s)
. In
practice, we only observe a finite set of state-action pairs
(si, a∗
i)m
i=1
, so the optimization is only approximate and we
then follow the Empirical Risk Minimization (ERM) principle
to find the policy πfrom the policy class Π.
If we let
Es∼dπ∗,a∼π∗(s)[`(s, a, π)] =
, then it follows
that
J(π)≤J(π∗) + T2
as shown by the proof in [13],
where
J
is the total cost and
T
is the task horizon. As we
can see, the total cost can grow quadratically in T.
arXiv:2210.02174v2 [cs.LG] 11 Oct 2022