
The adjoint approach suggested with the introduction of
Neural ODEs (Chen et al., 2018) is a modern variant of
Control Vector Iteration (CVI) (Luus, 2009), a sequential
indirect strategy that optimizes a fixed vector of param-
eters by relaxing just one of the necessary conditions of
optimality 2. The relaxed condition is approximated itera-
tively along optimization rounds where the dynamical and
adjoint equations are always satisfied, as in feasible paths
methods (Chachuat, 2007). An algorithmic improvement
introduced by (Chen et al., 2018) is the efficient use of
reverse-mode Automatic Differentiation (AD) to calculate
vector-Jacobians products that appear through the dif-
ferential equations defining the optimality conditions of
the problem. This grants the possibility of dealing with
high dimensional parameter problems efficiently, which
is crucial for neural policies within dynamical systems
in continuous time. Furthermore, it crucially avoids the
symbolic derivations historically associated with indirect
methods including CVI in the numerical optimal control
literature (Biegler, 2010).
The use of neural networks within control systems was
explored originally in the 90s (Chen, 1989; Miller et al.,
1990) where the focus was on discrete time systems (Hunt
et al., 1992). The use of neural control policies in this
discrete vein has also attracted attention recently in non-
linear control (Rackauckas et al., 2020a; Adhau et al., 2021;
Jin et al., 2020) and MPC (Amos et al., 2018; Karg and
Lucia, 2020; Drgona et al., 2022). The gradients required
for optimization are computed through direct application
of AD over the evolution of a discrete system; a strategy
coined as differentiable control or more generally differ-
entiable simulations. These strategies revive some of the
original ideas that brought interest to neural networks in
nonlinear control (Cao, 2005) with modern computational
tooling for AD calculations (Baydin et al., 2017).
Reinforcement Learning (RL) commonly leverage neural
parametrization of policies within Markov decision pro-
cesses (MPD) (Sutton et al., 1992) and has had a rising
success with dimensionality scaling thanks to deep learning
(Schmidhuber, 2015). The adaptation of RL approaches
to continuous time scenarios based on Hamilton-Jacobi-
Bellman formulations was originally explored in (Munos,
1997; Doya, 2000). A continuous time actor-critic variation
based on discrete time data was analysed more recently
in (Yildiz et al., 2021). The study of the model-free pol-
icy gradient method to continuous time was explored by
(Munos, 2006). In continuous time settings with determin-
istic dynamics as an environment, neural policies may be
trained with techniques from dynamic optimization like
CVI, avoiding noisy sampling estimations as in classic RL.
A comparison of the training performance improvement
between deterministic model-based neural policies and
model-free policy gradient was showcased in (Ainsworth
et al., 2021).
procedure that departs from the optimality conditions of the prob-
lem, also called indirect approaches. Since the adjoint equations
are part of the optimality conditions in dynamic optimization, the
classic indirect classification includes both variants in ML literature.
From an optimization perspective, there is not difference in the
approach that calculates the gradients since both do so through
adjoints; the selection of the approach is merely practical, based on
the peculiarities of each problem (Ma et al., 2021).
2This is covered in detail in section 3.1.
With a view in practical applications, it is crucial to
be able to satisfy constraints while also optimizing the
policy performance. While this has been a major focus
in optimal control since its inception (Bryson and Ho,
1975), there has been little attention to general nonlinear
scenarios. Most works assume either linear dynamics or
fixed control profiles instead of state feedback policies.
In relation to model-free RL, constraint enforcement is
an active area of research (Brunke et al., 2021). Recent
work has explored variants of objective penalties (Achiam
et al., 2017), Lyapunov functions (Chow et al., 2019) and
satisfaction in chance techniques (Petsagkourakis et al.,
2022).
Here we develop a strategy that allows continuous time
policies to solve general nonlinear control problems while
satisfying constraints successfully. Our approach is based
on the deterministic calculation of the cost functional gra-
dient with respect to the static feedback policy parameters
given a white-box dynamical system environment. Satu-
ration from the policy architecture enforces hard control
constraints while state constraints are enforced through re-
laxed logarithmic penalties and an adaptive barrier update
strategy. We furthermore showcase how the inclusion of the
feedback controller within the ODE definition shapes the
whole phase space of the system. This Neural ODE quality
is impossible to achieve in nonlinear systems with standard
optimal control methods that only provide controls as a
function of time.
2. PROBLEM STATEMENT
We consider continuous time optimal control problems
with fixed initial condition and fixed final time. The cost
functional is in the Bolza form, including both a running
(`) and a terminal cost (φ) in the functional objective (J)
(Bryson and Ho, 1975):
min
θJ=Ztf
t0
`(x(t), πθ(x)) dt +φ(x(tf)),
s.t. ˙x(t) = f(x(t), πθ(x), t),
x(t0) = x0,
g(x(t), πθ(x)) ≤0,
(1)
where the time window is fixed t∈[t0, tf], x(t)∈Rnx
is the state, πθ(x) : Rnx→Rnuis the state feedback
controller and θ∈Rnθare its parameters.
3. METHODOLOGY
The most common parametrization in trajectory optimiza-
tion methods utilizes low-order polynomials with a prede-
fined set of time intervals to approximate the controller
as a function of time (Rao, 2009; Teo et al., 2021). In
contrast, in our work the the control function is posed the
output of a parametrized state feedback controller. The
controller parameters are constant through all the inte-
gration time (they define statically the nonlinear feedback
controller) and are the only optimization variables of the
problem. This transforms the original dynamical optimiza-
tion problem into a parameter estimation one (Teo et al.,
2021). The approach approximates an optimal closed-loop
policy, which is a continuous function that is well-defined
for states outside the optimal path. This quality allows