are explicitly completed to a sequence of states w.r.t. the
system dynamics represented as equality constraints. Then, a
gradient-based correction accounts for inequality constraints
while satisfying the equality constraints.
Contributions. To summarize, the paper makes the fol-
lowing contributions: (i) It proposes a general Differentiable
Constraint Imitation Learning (DCIL) framework for incor-
porating constraints, which is agnostic to the particular neural
network architecture. (ii) It demonstrates the approach’s
effectiveness in one mobile robot and one automated driving
environment during closed-loop evaluation. The approach
outperforms multiple state-of-the art baselines considering
a variety of metrics.
II. RELATED WORK
The proposed approach is situated within the broader
scope integrating constraints into learning-based approaches
and IL in the robotics and automated driving literature. This
section classifies related work into two major categories.
Modification of the Training Loss. The first class of ap-
proaches incorporates constraints by modifying the training
loss. A simple approach adds the constraints as weighted
penalties to the imitation loss. [10] proposes an application
for automated driving. The work shows that additional loss
functions penalizing constraint violations improve the closed-
loop performance. [11] modifies the training process with a
primal-dual formulation and converts the constrained opti-
mization problem into an alternating min-max optimization
with Lagrangian variables. [12] uses an energy-based for-
mulation. During training, the loss pushes down the energy
of positive samples (close to the expert demonstration) and
pulls up the energy-values on negative samples, which violate
constraints (e.g., colliding trajectories). While these methods
are more robust to errors in constraint-specifications, they
often fail in OOD scenarios as errors made by the learned
model still compound over time. That can lead to unexpected
behavior like leaving the driving corridor [8].
Projection onto Feasible Sets. The second group of ap-
proaches projects the neural network’s output onto a solution
that is compliant with the constraints. Instead of predicting
a future sequence of states, a neural network predicts a
sequence of controls [13]. Unrolling a dynamics model
generates a feasible state trajectory consistent with the robot
system dynamics. However, the approach does not account
for general nonlinear inequality constraints. [14] presents an
inverse reinforcement learning approach. First, a set of safe
trajectories is sampled, and learning is only performed on
the safe samples. SafetyNet [15] trains an IL planner and
proposes a sampling-based fallback layer performing sanity
checks. [16] proposes a similar approach using quadratic
optimization. Other works incorporate quadratic programs
[17] or convex optimization programs [18] as an implicit
layer into neural network architectures. These approaches
constitute the last layer to project the output to a set of
feasible solutions. [19] directly modifies the network archi-
tecture by encoding convex polytopes. Sampling, quadratic
optimization and convexity severely restrict the solution
space.
Most closely related to our approach is the work of
[9]. The authors present a hybrid approach, which accounts
for nonconvex, nonlinear constraints. Experiments deal with
numerical examples with simple network architectures. We
extend this work to the real-world-oriented robot IL set-
ting with more complex architectures for high-dimensional
feature spaces. Further, we use an explicit completion by
unrolling a robot dynamics model.
Just recently, concurrent works propose approaches which
also incorporate nonlinear constraints using Signal Temporal
Logic [20] and differentiable control barrier functions [21],
which emphasizes the importance of using nonlinearities. In
contrast, our approach relies on a differentiable completion,
and gradient-based correction procedure, and the training
is guided by auxiliary losses. [20] evaluates on simple toy
examples, whereas our analysis considers a more realistic
environment. [21] evaluates in real-world experiments but
only use a circular robot footprint and object representa-
tion, whereas this work evaluates using different constraints.
Moreover, our approach is able to resolve incorrect con-
straints that render the problem infeasible.
III. PROBLEM FORMULATION
Assume robots dynamics described by nonlinear, time-
invariant differential equations with time t∈R, state x∈ X
and controls u∈ U ⊂ Rnu:
˙
x(t) = fx(t),u(t).(1)
The state space size Xof dimension nxis the union
of an arbitrary number of real spaces and non-Euclidean
rotation groups SO(2). In addition to the low-dimensional
state representation x, assume access to a high-dimensional
environment representation e∈E⊂Rne(e.g., a birds-
eye-view (BEV) image of the scene). Further, the system is
bounded by a set of nonlinear constraints C(e.g., by control
bounds, rules, or safety constraints).
A (sub-)optimal expert, pursuing a policy πexp, controls
the robot and generates a dataset D={(xi,ui,ei,Ci)}I
i=0
with I∈N+samples. A future trajectory of length H∈N+
containing states and controls belonging to sample iis given
by yGT =xT
i,uT
i. . . , xT
i+H,uT
i+H−1T. During training, the
objective is to find the optimal parameters θ∈Rnθunder a
maximum likelihood estimation:
θ∗= arg min
θ
EdyGT,ˆ
y,(2)
subject to equation (1) and the constraints C. The function
ddenotes a distance measure and ˆ
y=πθ(xi,ei)is the
output of the function πθparameterized by θ. Function πθ
is described by a neural network Nθand the completion
fcompl and correction fcorr procedure. During inference, given
the environment representation, the robot’s goal is to predict
a sequence of states and controls compliant with the con-
straints. In the spirit of an model predictive control (MPC)
framework, the first control vector is applied or an underlying
tracking controller regulates the robot along the reference.