Neural ODEs as Feedback Policies for Nonlinear Optimal Control Ilya Orson SandovalPanagiotis Petsagkourakis

2025-05-02 0 0 655.28KB 8 页 10玖币
侵权投诉
Neural ODEs as Feedback Policies for
Nonlinear Optimal Control ?
Ilya Orson Sandoval Panagiotis Petsagkourakis
Ehecatl Antonio del Rio-Chanona
Centre for Process Systems Engineering
Imperial College London, London, United Kingdom
(os220@ic.ac.uk, p.petsagkourakis@imperial.ac.uk &
a.del-rio-chanona@imperial.ac.uk).
Abstract: Neural ordinary differential equations (Neural ODEs) define continuous time
dynamical systems with neural networks. The interest in their application for modelling has
sparked recently, spanning hybrid system identification problems and time series analysis. In
this work we propose the use of a neural control policy capable of satisfying state and control
constraints to solve nonlinear optimal control problems. The control policy optimization is posed
as a Neural ODE problem to efficiently exploit the availability of a dynamical system model. We
showcase the efficacy of this type of deterministic neural policies in two constrained systems:
the controlled Van der Pol system and a bioreactor control problem. This approach represents
a practical approximation to the intractable closed-loop solution of nonlinear control problems.
Keywords: Optimal Control, Feedback Policy, Reinforcement Learning, Adjoint Sensitivity
Analysis, Control Vector Iteration, Penalty Methods, Nonlinear Optimization.
1. INTRODUCTION
Neural policies represent the dominant approach to
parametrize controllers in Reinforcement Learning (RL)
research. Their attractiveness relies on their dimensional
scaling property and universal approximation capacity.
However, their training procedure usually relies on inef-
ficient sampling based strategies to estimate the gradient
of the objective function to be optimized. On the contrary,
when an environment model is available as a differential
equation system, it is possible to leverage it for efficiency
through methods based on optimal control and dynamic
optimization (Ainsworth et al., 2021; Yildiz et al., 2021).
In this work we explore this setting, exploiting a neural
policy to parametrize a deterministic control function as
a state feedback controller. This approach provides an ap-
proximation to the practically intractable optimal closed-
loop policy in continuous time.
The optimization of such a controller follows the same
strategy as the training of Neural Ordinary Differential
Equations (Neural ODEs) (Chen et al., 2018). In our appli-
cation, the weights of the network only define the control
function within a predefined system, instead of defining
the whole differential equation system as in NeuralODEs.
To understand the training procedure, it is fruitful to
overview the close connection between adjoint sensitivity
analysis used in dynamical systems and the backpropa-
gation algorithm used in neural networks. We revise the
literature surrounding these topics and their use in recent
applications where both fields meet.
?This work has been submitted to IFAC for possible publication.
Backpropagation may be seen as a judicious application
of the chain rule in computational routines, introduced in
optimal control (Griewank, 2012) and popularized within
the neural network community (Rumelhart et al., 1986).
The first derivations trace back to the introduction of
the Kelley-Bryson gradient method to solve multistage
nonlinear control problems (Dreyfus, 1962). In neural
networks, it may be derived from the optimization problem
where Lagrange multipliers (adjoint variables) enforcing
the transitions between states (Mizutani et al., 2000).
On continuous time optimal control problems, the opti-
mality conditions from Pontryagin’s Maximum Principle
(Pontryagin et al., 1986) establish a connection between
the sensitivities of a functional cost and the adjoint vari-
ables. This relationship is exploited in continuous sensitiv-
ity analysis, where it is used to estimate the influence of
parameters in the solution of differential equation systems
(Serban and Hindmarsh, 2005; Jorgensen, 2007). When
the time is discretized in an ODE system, backpropaga-
tion is analogous to the adjoint system of the maximum
principle (Griewank, 2012; Baydin et al., 2017), and its
use to propagate sensitivities is called discrete sensitivity
analysis. Modern differential equation solvers include im-
plementations of either continuous or discrete sensitivity
analysis, relying on the solution of secondary differential
equation systems (optimize-then-discretize) or automatic
differentiation of the integrator routines (discretize-then-
optimize). 1
1The discretize-then-optimize distinction has a different meaning
in optimal control literature (Biegler, 2010). There it refers to di-
rect approaches to the optimization problem, which first discretize
the dynamical equations to afterwards solve the finite-dimensional
nonlinear optimization. The optimize-then-discretize refers to any
arXiv:2210.11245v2 [math.OC] 12 Nov 2022
The adjoint approach suggested with the introduction of
Neural ODEs (Chen et al., 2018) is a modern variant of
Control Vector Iteration (CVI) (Luus, 2009), a sequential
indirect strategy that optimizes a fixed vector of param-
eters by relaxing just one of the necessary conditions of
optimality 2. The relaxed condition is approximated itera-
tively along optimization rounds where the dynamical and
adjoint equations are always satisfied, as in feasible paths
methods (Chachuat, 2007). An algorithmic improvement
introduced by (Chen et al., 2018) is the efficient use of
reverse-mode Automatic Differentiation (AD) to calculate
vector-Jacobians products that appear through the dif-
ferential equations defining the optimality conditions of
the problem. This grants the possibility of dealing with
high dimensional parameter problems efficiently, which
is crucial for neural policies within dynamical systems
in continuous time. Furthermore, it crucially avoids the
symbolic derivations historically associated with indirect
methods including CVI in the numerical optimal control
literature (Biegler, 2010).
The use of neural networks within control systems was
explored originally in the 90s (Chen, 1989; Miller et al.,
1990) where the focus was on discrete time systems (Hunt
et al., 1992). The use of neural control policies in this
discrete vein has also attracted attention recently in non-
linear control (Rackauckas et al., 2020a; Adhau et al., 2021;
Jin et al., 2020) and MPC (Amos et al., 2018; Karg and
Lucia, 2020; Drgona et al., 2022). The gradients required
for optimization are computed through direct application
of AD over the evolution of a discrete system; a strategy
coined as differentiable control or more generally differ-
entiable simulations. These strategies revive some of the
original ideas that brought interest to neural networks in
nonlinear control (Cao, 2005) with modern computational
tooling for AD calculations (Baydin et al., 2017).
Reinforcement Learning (RL) commonly leverage neural
parametrization of policies within Markov decision pro-
cesses (MPD) (Sutton et al., 1992) and has had a rising
success with dimensionality scaling thanks to deep learning
(Schmidhuber, 2015). The adaptation of RL approaches
to continuous time scenarios based on Hamilton-Jacobi-
Bellman formulations was originally explored in (Munos,
1997; Doya, 2000). A continuous time actor-critic variation
based on discrete time data was analysed more recently
in (Yildiz et al., 2021). The study of the model-free pol-
icy gradient method to continuous time was explored by
(Munos, 2006). In continuous time settings with determin-
istic dynamics as an environment, neural policies may be
trained with techniques from dynamic optimization like
CVI, avoiding noisy sampling estimations as in classic RL.
A comparison of the training performance improvement
between deterministic model-based neural policies and
model-free policy gradient was showcased in (Ainsworth
et al., 2021).
procedure that departs from the optimality conditions of the prob-
lem, also called indirect approaches. Since the adjoint equations
are part of the optimality conditions in dynamic optimization, the
classic indirect classification includes both variants in ML literature.
From an optimization perspective, there is not difference in the
approach that calculates the gradients since both do so through
adjoints; the selection of the approach is merely practical, based on
the peculiarities of each problem (Ma et al., 2021).
2This is covered in detail in section 3.1.
With a view in practical applications, it is crucial to
be able to satisfy constraints while also optimizing the
policy performance. While this has been a major focus
in optimal control since its inception (Bryson and Ho,
1975), there has been little attention to general nonlinear
scenarios. Most works assume either linear dynamics or
fixed control profiles instead of state feedback policies.
In relation to model-free RL, constraint enforcement is
an active area of research (Brunke et al., 2021). Recent
work has explored variants of objective penalties (Achiam
et al., 2017), Lyapunov functions (Chow et al., 2019) and
satisfaction in chance techniques (Petsagkourakis et al.,
2022).
Here we develop a strategy that allows continuous time
policies to solve general nonlinear control problems while
satisfying constraints successfully. Our approach is based
on the deterministic calculation of the cost functional gra-
dient with respect to the static feedback policy parameters
given a white-box dynamical system environment. Satu-
ration from the policy architecture enforces hard control
constraints while state constraints are enforced through re-
laxed logarithmic penalties and an adaptive barrier update
strategy. We furthermore showcase how the inclusion of the
feedback controller within the ODE definition shapes the
whole phase space of the system. This Neural ODE quality
is impossible to achieve in nonlinear systems with standard
optimal control methods that only provide controls as a
function of time.
2. PROBLEM STATEMENT
We consider continuous time optimal control problems
with fixed initial condition and fixed final time. The cost
functional is in the Bolza form, including both a running
(`) and a terminal cost (φ) in the functional objective (J)
(Bryson and Ho, 1975):
min
θJ=Ztf
t0
`(x(t), πθ(x)) dt +φ(x(tf)),
s.t. ˙x(t) = f(x(t), πθ(x), t),
x(t0) = x0,
g(x(t), πθ(x)) 0,
(1)
where the time window is fixed t[t0, tf], x(t)Rnx
is the state, πθ(x) : RnxRnuis the state feedback
controller and θRnθare its parameters.
3. METHODOLOGY
The most common parametrization in trajectory optimiza-
tion methods utilizes low-order polynomials with a prede-
fined set of time intervals to approximate the controller
as a function of time (Rao, 2009; Teo et al., 2021). In
contrast, in our work the the control function is posed the
output of a parametrized state feedback controller. The
controller parameters are constant through all the inte-
gration time (they define statically the nonlinear feedback
controller) and are the only optimization variables of the
problem. This transforms the original dynamical optimiza-
tion problem into a parameter estimation one (Teo et al.,
2021). The approach approximates an optimal closed-loop
policy, which is a continuous function that is well-defined
for states outside the optimal path. This quality allows
摘要:

NeuralODEsasFeedbackPoliciesforNonlinearOptimalControl?IlyaOrsonSandovalPanagiotisPetsagkourakisEhecatlAntoniodelRio-ChanonaCentreforProcessSystemsEngineeringImperialCollegeLondon,London,UnitedKingdom(os220@ic.ac.uk,p.petsagkourakis@imperial.ac.uk&a.del-rio-chanona@imperial.ac.uk).Abstract:Neura...

展开>> 收起<<
Neural ODEs as Feedback Policies for Nonlinear Optimal Control Ilya Orson SandovalPanagiotis Petsagkourakis.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:655.28KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注