II. MOTION PLANNING FOR CAVS IN MIXED TRAFFIC
WITH MODEL PREDICTIVE CONTROL
In this section, we present a game-theoretic MPC formu-
lation for motion planning of a CAV while interacting with
an HDV along with the moving horizon IRL technique to
learn the objective weights of the HDV from real-time data.
A. Model Predictive Control for Motion Planning
We consider an interactive driving scenario including a
CAV and an HDV whose indices are 1and 2, respectively.
The goal of the MPC motion planner is to generate the
trajectory and control actions of CAV–1while considering
the real-time driving behavior of HDV–2. To guarantee that
CAV–1has data of HDV–2’s real-time trajectories, we make
the following assumption:
Assumption 1: A coordinator is available to collect trajec-
tories of HDV–2and transmit them to CAV–1without any
significant delay or error during the communication.
We formulate the problem in the discrete-time domain, in
which the dynamic model of each vehicle iis given by
xi,k+1 =fi(xi,k ,ui,k),(1)
where xi,k and ui,k,i= 1,2, are the vectors of states
and control actions, respectively, at time step k∈N. We
utilize the control framework presented in [17], in which
the interaction between CAV–1and HDV–2is modeled
as a simultaneous game, i.e., the game without a leader-
follower structure, in which the objective of each vehicle
includes its individual objective and a shared objective.
Let l1x1,k+1,u1,k )and l2x2,k+1,u2,k )be the individual
objective functions of CAV–1and HDV–2, respectively, and
l12x12,k+1,u12,k , where x12,k+1 = [x>
1,k+1,x>
2,k+1]>and
u12,k = [u>
1,k,u>
2,k]>, be the cooperative term at time
step k. We assume that CAV–1and HDV–2share the
same cooperative objective, e.g., collision avoidance. Those
objective functions are usually designed as weighted sums
of some features as follows
lixi,k+1,ui,k ) = ω>
iφixi,k+1,ui,k ), i = 1,2,(2)
l12x12,k+1,u12,k ) = ω>
12φ12x12,k+1,u12,k ),(3)
where φi,φ12 are vectors of features and ωi∈ Wi,ω12 ∈
W12 are corresponding vectors of weights, where Wiand
W12 are the sets of feasible values. For ease of notation,
we define −ifor each i∈ {1,2}as the other vehicle than
vehicle i. We consider that given any control actions u−i,k
of the other vehicle, each vehicle iapplies the control actions
u∗
i,k that minimizes a sum of its individual objective and the
shared objective, i.e.,
u∗
i,k =arg min
ui,k
lixi,k+1,ui,k )+l12x12,k+1,u12,k),∀u−i,k.
(4)
Next, we formulate an MPC problem with a control
horizon of length H∈N. Let tbe the current time step and
It={t, . . . , t +H−1}be the set of all time steps in the
control horizon at time step t. We can recast the simultaneous
game between CAV–1and HDV–2presented above as a
potential game [19], the game in which all players minimize
a single global function called the potential function. In
the potential game, a Nash equilibrium can be found by
minimizing the potential function. The potential function in
this game at each time step kis
lpotx12,k+1,u12,k )
=X
i=1,2
lixi,k+1,ui,k ) + l12x12,k+1,u12,k )(5)
Therefore, we propose utilizing the cumulative sum of the
potential function over the control horizon as the objective
function in the MPC problem, which can be given by
JMPC =X
k∈It
lpot,kx12,k+1,u12,k).(6)
Hence, the MPC problem for motion planning of CAV–1
is formulated as follows
minimize
{u12,k }k∈It
JMPC (7a)
subject to:
(1), i = 1,2,(7b)
gj(x12,k+1,u12,k )≤0,∀j∈ Jieq,(7c)
hj(x12,k+1,u12,k )=0,∀j∈ Jeq,(7d)
where (7b)–(7d) hold for all k∈ It. The constraints (7c) and
(7d) are inequality and equality constraints with Jieq and Jeq
are sets of indices.
In the objective function of the MPC problem (7), assume
that we can pre-define the features φi, i = 1,2and φ12,
if we learn online ω2and ω12 that best describe the human
driving behavior, the CAV’s objective weights ω1are adapted
to achieve the desired performance. The optimal strategy
for adapting ω1can be derived offline using Bayesian
optimization as presented in Section III.
B. Moving Horizon Inverse Reinforcement Learning
To identify the weights ω2and ω12 in the individual
objective function of HDV–2and the shared objective, we
utilize the feature-based IRL approach [18], [20], a ma-
chine learning technique developed to learn the underlying
objective or reward of an agent by observing its behavior.
We define the vector of all features and the vector of all
corresponding weights in HDV–2’s objective function as
f= [φ>
2,φ>
12]>and θ= [ω>
2,ω>
12]>, respectively. Let ˜
f
be the vector of average observed feature values computed
from data and Ep[f]be the expected feature values with
a given probability distribution pover trajectories. With
feature-based IRL, the goal is to learn the weight vector
θ∈Ω, where Ω = W2× W12 so that expected feature
values can match observed feature values.
In moving horizon IRL, at each time step, we utilize
the L∈Nmost recent trajectory segments to update the
weight estimate, where Lis the estimation horizon length.
Let tbe the current time step and Rt={rm}m=1,...,L
be the set of Lsample trajectory segments collected
over the estimation horizon at time t, in which rm=
(x12,t−m,x12,t−m+1,u12,t−m), for m= 1, . . . , L, is the